112 lines
8.2 KiB
Markdown
112 lines
8.2 KiB
Markdown
# Gemini Code Assistant Context
|
|
|
|
## CRITICAL RULE: DOCUMENTATION PRESERVATION (DO NOT IGNORE)
|
|
|
|
**ES IST STRENGSTENS UNTERSAGT, DOKUMENTATION ZU LÖSCHEN ODER DURCH PLATZHALTER WIE `... (rest of the file)` ZU ERSETZEN.**
|
|
|
|
Dies ist in der Vergangenheit mehrfach passiert und hat zu massivem Datenverlust in kritischen Dateien wie `MIGRATION_PLAN.md` geführt.
|
|
|
|
**Regeln für den Agenten:**
|
|
1. **Niemals** große Textblöcke löschen, es sei denn, der User fordert dies *explizit* an.
|
|
2. **Immer** `git diff` prüfen, bevor ein Commit erstellt wird. Wenn eine Dokumentationsdatei 100 Zeilen verliert, ist das fast immer ein Fehler.
|
|
3. Beim Aktualisieren von Dokumentation: **Nur** neue Informationen hinzufügen oder veraltete präzise korrigieren. **Niemals** den Rest der Datei überschreiben.
|
|
4. Wenn du eine Datei "restoren" musst, nutze `git log -p <filename>` und stelle sicher, dass du wirklich *alles* wiederherstellst.
|
|
|
|
---
|
|
|
|
## Wichtige Hinweise
|
|
|
|
- **Projektdokumentation:** Die primäre und umfassendste Dokumentation für dieses Projekt befindet sich in der Datei `readme.md`. Bitte ziehen Sie diese Datei für ein detailliertes Verständnis der Architektur und der einzelnen Module zu Rate.
|
|
- **Git-Repository:** Dieses Projekt wird über ein Git-Repository verwaltet. Alle Änderungen am Code werden versioniert. Beachten Sie den Abschnitt "Git Workflow & Conventions" für unsere Arbeitsregeln.
|
|
- **WICHTIG:** Der AI-Agent kann Änderungen committen, aber aus Sicherheitsgründen oft nicht `git push` ausführen. Bitte führen Sie `git push` manuell aus, wenn der Agent dies meldet.
|
|
|
|
## Project Overview
|
|
|
|
This project is a Python-based system for automated company data enrichment and lead generation. It focuses on identifying B2B companies with high potential for robotics automation (Cleaning, Transport, Security, Service).
|
|
|
|
The system architecture has evolved from a CLI-based toolset to a modern web application (`company-explorer`) backed by Docker containers.
|
|
|
|
## Current Status (Jan 15, 2026) - Company Explorer (Robotics Edition v0.5.0)
|
|
|
|
### 1. Contacts Management (v0.5)
|
|
* **Full CRUD:** Integrated Contact Management system with direct editing capabilities.
|
|
* **Global List View:** Dedicated view for all contacts across all companies with search and filter.
|
|
* **Data Model:** Supports advanced fields like Academic Title, Role Interpretation (Decision Maker vs. User), and Marketing Automation Status.
|
|
* **Bulk Import:** CSV-based bulk import for contacts that automatically creates missing companies and prevents duplicates via email matching.
|
|
|
|
### 2. UI/UX Modernization
|
|
* **Light/Dark Mode:** Full theme support with toggle.
|
|
* **Grid Layout:** Unified card-based layout for both Company and Contact lists.
|
|
* **Mobile Responsiveness:** Optimized Inspector overlay and navigation for mobile devices.
|
|
* **Tabbed Inspector:** Clean separation between Company Overview and Contact Management within the details pane.
|
|
|
|
### 3. Advanced Configuration (Settings)
|
|
* **Industry Verticals:** Database-backed configuration for target industries (Description, Focus Flag, Primary Product).
|
|
* **Job Role Mapping:** Configurable patterns (Regex/Text) to map job titles on business cards to internal roles (e.g., "CTO" -> "Innovation Driver").
|
|
* **Robotics Categories:** Existing AI reasoning logic remains configurable via the UI.
|
|
|
|
### 4. Robotics Potential Analysis (v2.3)
|
|
* **Chain-of-Thought Logic:** The AI analysis (`ClassificationService`) uses multi-step reasoning to evaluate physical infrastructure.
|
|
* **Provider vs. User:** Strict differentiation logic implemented.
|
|
|
|
### 5. Web Scraping & Legal Data (v2.2)
|
|
* **Impressum Scraping:** 2-Hop Strategy and Root Fallback logic.
|
|
* **Manual Overrides:** Users can manually correct Wikipedia, Website, and Impressum URLs directly in the UI.
|
|
|
|
## Lessons Learned & Best Practices
|
|
|
|
1. **Numeric Extraction (German Locale):**
|
|
* **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
|
|
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
|
|
* **Revenue:** For revenue (`is_revenue=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.
|
|
|
|
2. **The Wolfra/Greilmeier/Erding Fixes (Advanced Metric Parsing):**
|
|
* **Problem:** Simple regex parsers fail on complex sentences with multiple numbers, concatenated years, or misleading prefixes.
|
|
* **Solution (Hybrid Extraction & Regression Testing):**
|
|
1. **LLM Guidance:** The LLM provides an `expected_value` (e.g., "8.000 m²").
|
|
2. **Robust Python Parser (`MetricParser`):** This parser aggressively cleans the `expected_value` (stripping units like "m²") to get a numerical target. It then intelligently searches the full text for this target, ignoring other numbers (like "2" in "An 2 Standorten").
|
|
3. **Specific Bug Fixes:**
|
|
- **Year-Suffix:** Logic to detect and remove trailing years from concatenated numbers (e.g., "802020" -> "80").
|
|
- **Year-Prefix:** Logic to ignore year-like numbers (1900-2100) if other, more likely candidates exist in the text.
|
|
- **Sentence Truncation:** Removed overly aggressive logic that cut off sentences after a hyphen, which caused metrics at the end of a phrase to be missed.
|
|
* **Safeguard:** These specific cases are now locked in via `test_metric_parser.py` to prevent future regressions.
|
|
|
|
3. **LLM JSON Stability:**
|
|
* **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail.
|
|
* **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output.
|
|
|
|
4. **LLM Structure Inconsistency:**
|
|
* **Problem:** Even with `json_mode=True`, models sometimes wrap the result in a list `[...]` instead of a flat object `{...}`, breaking frontend property access.
|
|
* **Solution:** Implement a check: `if isinstance(result, list): result = result[0]`.
|
|
|
|
5. **Scraping Navigation:**
|
|
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
|
|
* **Solution:** Always implement a fallback to the **Root Domain** AND a **2-Hop check** via the "Kontakt" page.
|
|
|
|
6. **Frontend State Management:**
|
|
* **Problem:** Users didn't see when a background job finished.
|
|
* **Solution:** Implementing a polling mechanism (`setInterval`) tied to a `isProcessing` state is superior to static timeouts for long-running AI tasks.
|
|
|
|
## Metric Parser - Regression Tests
|
|
To ensure the stability and accuracy of the metric extraction logic, a dedicated test suite (`/company-explorer/backend/tests/test_metric_parser.py`) has been created. It covers the following critical, real-world bug fixes:
|
|
|
|
1. **`test_wolfra_concatenated_year_bug`**:
|
|
* **Problem:** A number and year were concatenated (e.g., "802020").
|
|
* **Test:** Ensures the parser correctly identifies and strips the trailing year, extracting `80`.
|
|
|
|
2. **`test_erding_year_prefix_bug`**:
|
|
* **Problem:** A year appeared before the actual metric in the sentence (e.g., "2022 ... 200.000 Besucher").
|
|
* **Test:** Verifies that the parser's "Smart Year Skip" logic ignores the year and correctly extracts `200000`.
|
|
|
|
3. **`test_greilmeier_multiple_numbers_bug`**:
|
|
* **Problem:** The text contained multiple numbers ("An 2 Standorten ... 8.000 m²"), and the parser incorrectly picked the first one.
|
|
* **Test:** Confirms that when an `expected_value` (like "8.000 m²") is provided, the parser correctly cleans it and extracts the corresponding number (`8000`), ignoring other irrelevant numbers.
|
|
|
|
These tests are crucial for preventing regressions as the parser logic evolves.
|
|
|
|
## Next Steps
|
|
* **Marketing Automation:** Implement the actual sending logic (or export) based on the contact status.
|
|
* **Job Role Mapping Engine:** Connect the configured patterns to the contact import/creation process to auto-assign roles.
|
|
* **Industry Classification Engine:** Connect the configured industries to the AI Analysis prompt to enforce the "Strict Mode" mapping.
|
|
* **Export:** Generate Excel/CSV enriched reports (already partially implemented via JSON export).
|