79 lines
6.5 KiB
Markdown
79 lines
6.5 KiB
Markdown
# Gemini Code Assistant Context
|
|
|
|
## Wichtige Hinweise
|
|
|
|
- **Projektdokumentation:** Die primäre und umfassendste Dokumentation für dieses Projekt befindet sich in der Datei `readme.md`. Bitte ziehen Sie diese Datei für ein detailliertes Verständnis der Architektur und der einzelnen Module zu Rate.
|
|
- **Git-Repository:** Dieses Projekt wird über ein Git-Repository verwaltet. Alle Änderungen am Code werden versioniert. Beachten Sie den Abschnitt "Git Workflow & Conventions" für unsere Arbeitsregeln.
|
|
- **WICHTIG:** Der AI-Agent kann Änderungen committen, aber aus Sicherheitsgründen oft nicht `git push` ausführen. Bitte führen Sie `git push` manuell aus, wenn der Agent dies meldet.
|
|
|
|
## Project Overview
|
|
|
|
This project is a Python-based system for automated company data enrichment and lead generation. It focuses on identifying B2B companies with high potential for robotics automation (Cleaning, Transport, Security, Service).
|
|
|
|
The system architecture has evolved from a CLI-based toolset to a modern web application (`company-explorer`) backed by Docker containers.
|
|
|
|
## Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition v0.3.0)
|
|
|
|
### 1. Robotics Potential Analysis (v2.3)
|
|
* **Chain-of-Thought Logic:** The AI analysis (`ClassificationService`) now uses a multi-step reasoning process to evaluate companies based on their **physical infrastructure** (factories, warehouses) rather than just keywords.
|
|
* **Provider vs. User:** Strict logic implemented to distinguish between companies *selling* automation products and those *needing* them for their own operations.
|
|
* **Configurable Settings:** A database-driven configuration (`RoboticsCategory`) allows users to edit the definition and scoring logic for each robotics category directly via the frontend settings menu.
|
|
|
|
### 2. Deep Wikipedia Integration (v2.1)
|
|
* **Extraction:** The system extracts the first paragraph (cleaned of artifacts), industry, revenue (normalized to Mio €), employee count, and Wikipedia categories.
|
|
* **Validation:** Uses a "Google-First" strategy via SerpAPI, validating candidates by checking for domain matches and city/HQ location in the article.
|
|
* **UI:** The Inspector displays a dedicated Wikipedia profile section with visual tags.
|
|
|
|
### 3. Web Scraping & Legal Data (v2.2)
|
|
* **Impressum Scraping:**
|
|
* **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for a "Kontakt" page and checks for the link there.
|
|
* **Root Fallback:** If deep links (e.g. `/about-us`) fail, the scraper checks the root domain (`/`).
|
|
* **LLM Extraction:** Unstructured legal text is parsed by Gemini to extract structured JSON (Legal Name, Address, CEO, VAT ID).
|
|
* **Robustness:**
|
|
* **JSON Cleaning:** A helper (`clean_json_response`) strips Markdown code blocks from LLM responses to prevent parsing errors.
|
|
* **Schema Enforcement:** Added logic to handle inconsistent LLM responses (e.g., returning a list `[{...}]` instead of a flat object `{...}`).
|
|
|
|
### 4. User Control & Ops
|
|
* **Manual Overrides:** Users can manually correct the Wikipedia URL (locking the data) and the Company Website (triggering a fresh re-scrape).
|
|
* **Polling UI:** The frontend uses intelligent polling to auto-refresh data when background jobs (Discovery/Analysis) complete.
|
|
* **Forced Refresh:** The "Analyze" endpoint now clears old cache data to ensure a fresh scrape on every user request.
|
|
|
|
## Lessons Learned & Best Practices
|
|
|
|
1. **Numeric Extraction (German Locale):**
|
|
* **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
|
|
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
|
|
* **Revenue:** For revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.
|
|
|
|
2. **LLM JSON Stability:**
|
|
* **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail.
|
|
* **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output.
|
|
|
|
3. **LLM Structure Inconsistency:**
|
|
* **Problem:** Even with `json_mode=True`, models sometimes wrap the result in a list `[...]` instead of a flat object `{...}`, breaking frontend property access.
|
|
* **Solution:** Implement a check: `if isinstance(result, list): result = result[0]`.
|
|
|
|
4. **Scraping Navigation:**
|
|
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
|
|
* **Solution:** Always implement a fallback to the **Root Domain** AND a **2-Hop check** via the "Kontakt" page.
|
|
|
|
5. **Frontend State Management:**
|
|
* **Problem:** Users didn't see when a background job finished.
|
|
* **Solution:** Implementing a polling mechanism (`setInterval`) tied to a `isProcessing` state is superior to static timeouts for long-running AI tasks.
|
|
|
|
6. **Notion API - Schema First:**
|
|
* **Problem:** Scripts failed when trying to write data to a Notion database property (column) that did not exist.
|
|
* **Solution:** ALWAYS ensure the database schema is correct *before* attempting to import or update data. Use the `databases.update` endpoint to add the required properties (e.g., "Key Features", "Constraints") programmatically as a preliminary step. The API will not create them on the fly.
|
|
|
|
7. **Notion API - Character Limits:**
|
|
* **Problem:** API calls failed with a `400 Bad Request` error when a rich text field exceeded the maximum length.
|
|
* **Solution:** Be aware of the **2000-character limit** for rich text properties. Implement logic to truncate text content before sending the payload to the Notion API to prevent validation errors.
|
|
|
|
8. **Notion API - Response Structures:**
|
|
* **Problem:** Parsing functions failed with `TypeError` or `AttributeError` because the JSON structure for a property differed depending on how it was requested.
|
|
* **Solution:** Write robust helper functions that can handle multiple possible JSON structures. A property object retrieved via a direct property endpoint (`/pages/{id}/properties/{prop_id}`) is structured differently from the same property when it's part of a full page object (`/pages/{id}`). The parsing logic must account for these variations.
|
|
|
|
## Next Steps
|
|
* **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads.
|
|
* **Export:** Generate Excel/CSV enriched reports.
|