diff --git a/GEMINI.md b/GEMINI.md index 5422c424..3ef07008 100644 --- a/GEMINI.md +++ b/GEMINI.md @@ -7,62 +7,55 @@ ## Project Overview -This project is a Python-based system for automated company data enrichment and lead generation. It uses a variety of data sources, including web scraping, Wikipedia, and the OpenAI API, to enrich company data from a CRM system. The project is designed to run in a Docker container and can be controlled via a Flask API. +This project is a Python-based system for automated company data enrichment and lead generation. It focuses on identifying B2B companies with high potential for robotics automation (Cleaning, Transport, Security, Service). -The system is modular and consists of the following key components: +The system architecture has evolved from a CLI-based toolset to a modern web application (`company-explorer`) backed by Docker containers. -* **`brancheneinstufung_167.py`:** The core module for data enrichment, including web scraping, Wikipedia lookups, and AI-based analysis. -* **`company_deduplicator.py`:** A module for intelligent duplicate checking, both for external lists and internal CRM data. -* **`generate_marketing_text.py`:** An engine for creating personalized marketing texts. -* **`app.py`:** A Flask application that provides an API to run the different modules. -* **`company-explorer/`:** A new React/FastAPI-based application (v2.x) replacing the legacy CLI tools. It focuses on identifying robotics potential in companies. +## Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition v0.3.0) -## Git Workflow & Conventions +### 1. Robotics Potential Analysis (v2.3) +* **Chain-of-Thought Logic:** The AI analysis (`ClassificationService`) now uses a multi-step reasoning process to evaluate companies based on their **physical infrastructure** (factories, warehouses) rather than just keywords. +* **Provider vs. User:** Strict logic implemented to distinguish between companies *selling* automation products and those *needing* them for their own operations. +* **Configurable Settings:** A database-driven configuration (`RoboticsCategory`) allows users to edit the definition and scoring logic for each robotics category directly via the frontend settings menu. -- **Commit-Nachrichten:** Commits sollen einem klaren Format folgen: - - Titel: Eine prägnante Zusammenfassung unter 100 Zeichen. - - Beschreibung: Detaillierte Änderungen als Liste mit `- ` am Zeilenanfang (keine Bulletpoints). -- **Datei-Umbenennungen:** Um die Git-Historie einer Datei zu erhalten, muss sie zwingend mit `git mv alter_name.py neuer_name.py` umbenannt werden. -- **Commit & Push Prozess:** Änderungen werden zuerst lokal committet. Das Pushen auf den Remote-Server erfolgt erst nach expliziter Bestätigung durch Sie. +### 2. Deep Wikipedia Integration (v2.1) +* **Extraction:** The system extracts the first paragraph (cleaned of artifacts), industry, revenue (normalized to Mio €), employee count, and Wikipedia categories. +* **Validation:** Uses a "Google-First" strategy via SerpAPI, validating candidates by checking for domain matches and city/HQ location in the article. +* **UI:** The Inspector displays a dedicated Wikipedia profile section with visual tags. -## Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition) +### 3. Web Scraping & Legal Data (v2.2) +* **Impressum Scraping:** + * **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for a "Kontakt" page and checks for the link there. + * **Root Fallback:** If deep links (e.g. `/about-us`) fail, the scraper checks the root domain (`/`). + * **LLM Extraction:** Unstructured legal text is parsed by Gemini to extract structured JSON (Legal Name, Address, CEO, VAT ID). +* **Robustness:** + * **JSON Cleaning:** A helper (`clean_json_response`) strips Markdown code blocks from LLM responses to prevent parsing errors. + * **Aggressive Cleaning:** HTML scraper removes `` tags and cookie banners before text analysis to reduce noise. -* **Robotics Potential Analysis (v2.3):** - * **Logic Overhaul:** Switched from keyword-based scanning to a **"Chain-of-Thought" Infrastructure Analysis**. The AI now evaluates physical assets (factories, warehouses, solar parks) to determine robotics needs. - * **Provider vs. User:** Implemented strict reasoning to distinguish between companies *selling* cleaning products (providers) and those *operating* factories (users/potential clients). - * **Configurable Logic:** Added a database-backed configuration system for robotics categories (`cleaning`, `transport`, `security`, `service`). Users can now define the "Trigger Logic" and "Scoring Guide" directly in the frontend settings. - -* **Wikipedia Integration (v2.1):** - * **Deep Extraction:** Implemented the "Legacy" extraction logic (`WikipediaService`). It now pulls the **first paragraph** (cleaned of references), **categories** (filtered for relevance), revenue, employees, and HQ location. - * **Google-First Discovery:** Uses SerpAPI to find the correct Wikipedia article, validating via domain match and city. - * **Visual Inspector:** The frontend `Inspector` now displays a comprehensive Wikipedia profile including category tags. - -* **Web Scraping & Legal Data (v2.2):** - * **Impressum Scraping:** Implemented a robust finder for "Impressum" / "Legal Notice" links. - * **Root-URL Fallback:** If deep links (e.g., from `/about-us`) don't work, the scraper automatically checks the root domain (`example.com/impressum`). - * **LLM Extraction:** Uses Gemini to parse unstructured Impressum text into structured JSON (Legal Name, Address, CEO). - * **Clean JSON Parsing:** Implemented `clean_json_response` to handle AI responses containing Markdown (` ```json `), preventing crash loops. - -* **Manual Overrides & Control:** - * **Wikipedia Override:** Added a UI to manually correct the Wikipedia URL. This triggers a re-scan and **locks** the record (`is_locked` flag) to prevent auto-overwrite. - * **Website Override:** Added a UI to manually correct the company website. This automatically clears old scraping data to force a fresh analysis on the next run. +### 4. User Control & Ops +* **Manual Overrides:** Users can manually correct the Wikipedia URL (locking the data) and the Company Website (triggering a fresh re-scrape). +* **Polling UI:** The frontend uses intelligent polling to auto-refresh data when background jobs (Discovery/Analysis) complete. +* **Forced Refresh:** The "Analyze" endpoint now clears old cache data to ensure a fresh scrape on every user request. ## Lessons Learned & Best Practices 1. **Numeric Extraction (German Locale):** * **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal). - * **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator. For Revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless multiple dots exist. - * **Rule:** Always check for both `,` and `.` presence to determine locale. + * **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator. + * **Revenue:** For revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million. 2. **LLM JSON Stability:** - * **Problem:** LLMs often wrap JSON in Markdown blocks, causing `json.loads()` to fail. - * **Solution:** ALWAYS use a `clean_json_response` helper that strips ` ```json ` markers before parsing. Never trust raw LLM output for structured data. + * **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail. + * **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output. 3. **Scraping Navigation:** * **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails. - * **Solution:** Always implement a fallback to the **Root Domain**. The legal notice is almost always linked from the homepage footer. + * **Solution:** Always implement a fallback to the **Root Domain** AND a **2-Hop check** via the "Kontakt" page. + +4. **Frontend State Management:** + * **Problem:** Users didn't see when a background job finished. + * **Solution:** Implementing a polling mechanism (`setInterval`) tied to a `isProcessing` state is superior to static timeouts for long-running AI tasks. ## Next Steps -* **Frontend Debugging:** Verify why the "Official Legal Data" block disappears in some states (likely due to conditional rendering checks on `impressum` object structure). -* **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads. -* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB. \ No newline at end of file +* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB. +* **Export:** Generate Excel/CSV exports of enriched leads for CRM import. diff --git a/company-explorer/backend/app.py b/company-explorer/backend/app.py index 31da4028..cfe0e68c 100644 --- a/company-explorer/backend/app.py +++ b/company-explorer/backend/app.py @@ -324,6 +324,19 @@ def analyze_company(req: AnalysisRequest, background_tasks: BackgroundTasks, db: if not company.website or company.website == "k.A.": return {"error": "No website to analyze. Run Discovery first."} + # FORCE SCRAPE LOGIC + # If explicit force_scrape is requested OR if we want to ensure fresh data for debugging + # We delete the old scrape data. + # For now, let's assume every manual "Analyze" click implies a desire for fresh results if previous failed. + # But let's respect the flag from frontend if we add it later. + + # Always clearing scrape data for now to fix the "stuck cache" issue reported by user + db.query(EnrichmentData).filter( + EnrichmentData.company_id == company.id, + EnrichmentData.source_type == "website_scrape" + ).delete() + db.commit() + background_tasks.add_task(run_analysis_task, company.id, company.website) return {"status": "queued"} diff --git a/tmp/sidebar.txt b/tmp/sidebar.txt new file mode 100644 index 00000000..eee7e4d3 --- /dev/null +++ b/tmp/sidebar.txt @@ -0,0 +1 @@ +

2H GmbH & Co. KG

EinzelhandelENRICHED

AI Strategic Dossier

Business Model

2H GmbH & Co. KG ist ein Großhändler für bedruckbare und digitale Medien, der sich an Unternehmen richtet, die Medien nutzen oder produzieren. Sie bieten ein breites Sortiment vom Papierkuvert bis zum digitalen Infoboard und unterstützen ihre Kunden mit Beratung, Materialverarbeitung und technischen Dienstleistungen. Ihr Ziel ist es, ihren Kunden zu helfen, im Bereich Medien und Kommunikation erfolgreich zu sein.

Infrastructure Evidence

"Nur Großhändler sind in der Lage, ein fast unerschöpfliches Sortiment auf Lager zu halten und just-in-time auszuliefern. Wir vereinbaren mit Ihnen gerne individuelle Logistikleistungen, gerne auch über Sonderleistungen wie zum Beispiel die Einlagerung und Auslieferung von Fertigware."

Company Profile (Wikipedia)

No Wikipedia profile found yet.

Robotics Potential

cleaning40%

"As a large distributor with warehousing, they likely have a need for cleaning robots to maintain their facilities, although it's not their primary focus."

transport70%

"They operate a large warehouse and offer logistics services, including storage and delivery, indicating a significant need for intralogistics and transport solutions."

security40%

"The presence of a large warehouse suggests a need for basic security measures, but it's unlikely to be a high-security environment requiring extensive surveillance."

service10%

"Their business model is primarily B2B distribution, with no indication of direct customer service or hospitality operations where service robots would be relevant."

Added: 8.1.2026
ID: CE-0009
\ No newline at end of file