Files
Brancheneinstufung2/GEMINI.md
Floke b3fa036809 feat(company-explorer): force-refresh analysis and refine extraction logic
- Enforced fresh scrape on 'Analyze' request to bypass stale cache.
- Implemented 2-Hop Impressum scraping strategy (via Kontakt page).
- Refined numeric extraction for German locale (thousands separators).
- Updated documentation with Lessons Learned.
2026-01-08 16:14:01 +01:00

62 lines
4.7 KiB
Markdown

# Gemini Code Assistant Context
## Wichtige Hinweise
- **Projektdokumentation:** Die primäre und umfassendste Dokumentation für dieses Projekt befindet sich in der Datei `readme.md`. Bitte ziehen Sie diese Datei für ein detailliertes Verständnis der Architektur und der einzelnen Module zu Rate.
- **Git-Repository:** Dieses Projekt wird über ein Git-Repository verwaltet. Alle Änderungen am Code werden versioniert. Beachten Sie den Abschnitt "Git Workflow & Conventions" für unsere Arbeitsregeln.
## Project Overview
This project is a Python-based system for automated company data enrichment and lead generation. It focuses on identifying B2B companies with high potential for robotics automation (Cleaning, Transport, Security, Service).
The system architecture has evolved from a CLI-based toolset to a modern web application (`company-explorer`) backed by Docker containers.
## Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition v0.3.0)
### 1. Robotics Potential Analysis (v2.3)
* **Chain-of-Thought Logic:** The AI analysis (`ClassificationService`) now uses a multi-step reasoning process to evaluate companies based on their **physical infrastructure** (factories, warehouses) rather than just keywords.
* **Provider vs. User:** Strict logic implemented to distinguish between companies *selling* automation products and those *needing* them for their own operations.
* **Configurable Settings:** A database-driven configuration (`RoboticsCategory`) allows users to edit the definition and scoring logic for each robotics category directly via the frontend settings menu.
### 2. Deep Wikipedia Integration (v2.1)
* **Extraction:** The system extracts the first paragraph (cleaned of artifacts), industry, revenue (normalized to Mio €), employee count, and Wikipedia categories.
* **Validation:** Uses a "Google-First" strategy via SerpAPI, validating candidates by checking for domain matches and city/HQ location in the article.
* **UI:** The Inspector displays a dedicated Wikipedia profile section with visual tags.
### 3. Web Scraping & Legal Data (v2.2)
* **Impressum Scraping:**
* **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for a "Kontakt" page and checks for the link there.
* **Root Fallback:** If deep links (e.g. `/about-us`) fail, the scraper checks the root domain (`/`).
* **LLM Extraction:** Unstructured legal text is parsed by Gemini to extract structured JSON (Legal Name, Address, CEO, VAT ID).
* **Robustness:**
* **JSON Cleaning:** A helper (`clean_json_response`) strips Markdown code blocks from LLM responses to prevent parsing errors.
* **Aggressive Cleaning:** HTML scraper removes `<a>` tags and cookie banners before text analysis to reduce noise.
### 4. User Control & Ops
* **Manual Overrides:** Users can manually correct the Wikipedia URL (locking the data) and the Company Website (triggering a fresh re-scrape).
* **Polling UI:** The frontend uses intelligent polling to auto-refresh data when background jobs (Discovery/Analysis) complete.
* **Forced Refresh:** The "Analyze" endpoint now clears old cache data to ensure a fresh scrape on every user request.
## Lessons Learned & Best Practices
1. **Numeric Extraction (German Locale):**
* **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
* **Revenue:** For revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.
2. **LLM JSON Stability:**
* **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail.
* **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output.
3. **Scraping Navigation:**
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
* **Solution:** Always implement a fallback to the **Root Domain** AND a **2-Hop check** via the "Kontakt" page.
4. **Frontend State Management:**
* **Problem:** Users didn't see when a background job finished.
* **Solution:** Implementing a polling mechanism (`setInterval`) tied to a `isProcessing` state is superior to static timeouts for long-running AI tasks.
## Next Steps
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.
* **Export:** Generate Excel/CSV exports of enriched leads for CRM import.