Files

Floke 3510e73c62 [2f488f42] Die GEMINI.md wurde aktualisiert, um den neuen #fertig-Befehl und den damit verbundenen Workflow zu dokumentieren. Diese Konvention stellt sicher, dass das Abschließen von Arbeitspaketen zuverlässig erkannt wird.

Die GEMINI.md wurde aktualisiert, um den neuen #fertig-Befehl und den damit verbundenen Workflow zu dokumentieren. Diese Konvention stellt sicher, dass das Abschließen von Arbeitspaketen zuverlässig erkannt wird.

2026-01-27 11:56:43 +00:00

9.3 KiB

Raw Blame History

Gemini Code Assistant Context

CRITICAL RULE: DOCUMENTATION PRESERVATION (DO NOT IGNORE)

ES IST STRENGSTENS UNTERSAGT, DOKUMENTATION ZU LÖSCHEN ODER DURCH PLATZHALTER WIE ... (rest of the file) ZU ERSETZEN.

Dies ist in der Vergangenheit mehrfach passiert und hat zu massivem Datenverlust in kritischen Dateien wie MIGRATION_PLAN.md geführt.

Regeln für den Agenten:

Niemals große Textblöcke löschen, es sei denn, der User fordert dies explizit an.
Immer git diff prüfen, bevor ein Commit erstellt wird. Wenn eine Dokumentationsdatei 100 Zeilen verliert, ist das fast immer ein Fehler.
Beim Aktualisieren von Dokumentation: Nur neue Informationen hinzufügen oder veraltete präzise korrigieren. Niemals den Rest der Datei überschreiben.
Wenn du eine Datei "restoren" musst, nutze git log -p <filename> und stelle sicher, dass du wirklich alles wiederherstellst.

Wichtige Hinweise

Projektdokumentation: Die primäre und umfassendste Dokumentation für dieses Projekt befindet sich in der Datei readme.md. Bitte ziehen Sie diese Datei für ein detailliertes Verständnis der Architektur und der einzelnen Module zu Rate.
Git-Repository: Dieses Projekt wird über ein Git-Repository verwaltet. Alle Änderungen am Code werden versioniert. Beachten Sie den Abschnitt "Git Workflow & Conventions" für unsere Arbeitsregeln.
- WICHTIG: Der AI-Agent kann Änderungen committen, aber aus Sicherheitsgründen oft nicht git push ausführen. Bitte führen Sie git push manuell aus, wenn der Agent dies meldet.

Git Workflow & Conventions

Den Arbeitstag abschließen mit `#fertig`

Um einen Arbeitsschritt oder einen Task abzuschließen, verwenden Sie den Befehl #fertig.

WICHTIG: Verwenden Sie nicht /fertig oder nur fertig. Nur der Befehl mit der Raute (#) wird korrekt erkannt.

Wenn Sie #fertig eingeben, führt der Agent folgende Schritte aus:

Analyse: Der Agent prüft, ob seit dem letzten Commit Änderungen am Code vorgenommen wurden.
Zusammenfassung: Er generiert eine automatische Arbeitszusammenfassung basierend auf den Code-Änderungen.
Status-Update: Der Agent führt das Skript python3 dev_session.py --report-status im Hintergrund aus.
- Die in der aktuellen Session investierte Zeit wird berechnet und in Notion gespeichert.
- Ein neuer Statusbericht mit der Zusammenfassung wird an den Notion-Task angehängt.
- Der Status des Tasks in Notion wird auf "Done" (oder einen anderen passenden Status) gesetzt.
Commit & Push: Wenn Code-Änderungen vorhanden sind, wird ein Commit erstellt und ein git push interaktiv angefragt.

Project Overview

This project is a Python-based system for automated company data enrichment and lead generation. It focuses on identifying B2B companies with high potential for robotics automation (Cleaning, Transport, Security, Service).

The system architecture has evolved from a CLI-based toolset to a modern web application (company-explorer) backed by Docker containers.

Current Status (Jan 15, 2026) - Company Explorer (Robotics Edition v0.5.0)

1. Contacts Management (v0.5)

Full CRUD: Integrated Contact Management system with direct editing capabilities.
Global List View: Dedicated view for all contacts across all companies with search and filter.
Data Model: Supports advanced fields like Academic Title, Role Interpretation (Decision Maker vs. User), and Marketing Automation Status.
Bulk Import: CSV-based bulk import for contacts that automatically creates missing companies and prevents duplicates via email matching.

2. UI/UX Modernization

Light/Dark Mode: Full theme support with toggle.
Grid Layout: Unified card-based layout for both Company and Contact lists.
Mobile Responsiveness: Optimized Inspector overlay and navigation for mobile devices.
Tabbed Inspector: Clean separation between Company Overview and Contact Management within the details pane.

3. Advanced Configuration (Settings)

Industry Verticals: Database-backed configuration for target industries (Description, Focus Flag, Primary Product).
Job Role Mapping: Configurable patterns (Regex/Text) to map job titles on business cards to internal roles (e.g., "CTO" -> "Innovation Driver").
Robotics Categories: Existing AI reasoning logic remains configurable via the UI.

4. Robotics Potential Analysis (v2.3)

Chain-of-Thought Logic: The AI analysis (ClassificationService) uses multi-step reasoning to evaluate physical infrastructure.
Provider vs. User: Strict differentiation logic implemented.

5. Web Scraping & Legal Data (v2.2)

Impressum Scraping: 2-Hop Strategy and Root Fallback logic.
Manual Overrides: Users can manually correct Wikipedia, Website, and Impressum URLs directly in the UI.

Lessons Learned & Best Practices

Numeric Extraction (German Locale):
- Problem: "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
- Solution: Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
- Revenue: For revenue (is_revenue=True), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.
The Wolfra/Greilmeier/Erding Fixes (Advanced Metric Parsing):
- Problem: Simple regex parsers fail on complex sentences with multiple numbers, concatenated years, or misleading prefixes.
- Solution (Hybrid Extraction & Regression Testing):
  1. LLM Guidance: The LLM provides an expected_value (e.g., "8.000 m²").
  2. Robust Python Parser (MetricParser): This parser aggressively cleans the expected_value (stripping units like "m²") to get a numerical target. It then intelligently searches the full text for this target, ignoring other numbers (like "2" in "An 2 Standorten").
  3. Specific Bug Fixes:
    - Year-Suffix: Logic to detect and remove trailing years from concatenated numbers (e.g., "802020" -> "80").
    - Year-Prefix: Logic to ignore year-like numbers (1900-2100) if other, more likely candidates exist in the text.
    - Sentence Truncation: Removed overly aggressive logic that cut off sentences after a hyphen, which caused metrics at the end of a phrase to be missed.
- Safeguard: These specific cases are now locked in via test_metric_parser.py to prevent future regressions.
LLM JSON Stability:
- Problem: LLMs often wrap JSON in Markdown blocks (```json), causing json.loads() to fail.
- Solution: ALWAYS use a clean_json_response helper that strips markers before parsing. Never trust raw LLM output.
LLM Structure Inconsistency:
- Problem: Even with json_mode=True, models sometimes wrap the result in a list [...] instead of a flat object {...}, breaking frontend property access.
- Solution: Implement a check: if isinstance(result, list): result = result[0].
Scraping Navigation:
- Problem: Searching for "Impressum" only on the scraped URL (which might be a subpage found via Google) often fails.
- Solution: Always implement a fallback to the Root Domain AND a 2-Hop check via the "Kontakt" page.
Frontend State Management:
- Problem: Users didn't see when a background job finished.
- Solution: Implementing a polling mechanism (setInterval) tied to a isProcessing state is superior to static timeouts for long-running AI tasks.

Metric Parser - Regression Tests

To ensure the stability and accuracy of the metric extraction logic, a dedicated test suite (/company-explorer/backend/tests/test_metric_parser.py) has been created. It covers the following critical, real-world bug fixes:

test_wolfra_concatenated_year_bug:
- Problem: A number and year were concatenated (e.g., "802020").
- Test: Ensures the parser correctly identifies and strips the trailing year, extracting 80.
test_erding_year_prefix_bug:
- Problem: A year appeared before the actual metric in the sentence (e.g., "2022 ... 200.000 Besucher").
- Test: Verifies that the parser's "Smart Year Skip" logic ignores the year and correctly extracts 200000.
test_greilmeier_multiple_numbers_bug:
- Problem: The text contained multiple numbers ("An 2 Standorten ... 8.000 m²"), and the parser incorrectly picked the first one.
- Test: Confirms that when an expected_value (like "8.000 m²") is provided, the parser correctly cleans it and extracts the corresponding number (8000), ignoring other irrelevant numbers.

These tests are crucial for preventing regressions as the parser logic evolves.

Next Steps

Marketing Automation: Implement the actual sending logic (or export) based on the contact status.
Job Role Mapping Engine: Connect the configured patterns to the contact import/creation process to auto-assign roles.
Industry Classification Engine: Connect the configured industries to the AI Analysis prompt to enforce the "Strict Mode" mapping.
Export: Generate Excel/CSV enriched reports (already partially implemented via JSON export).

9.3 KiB Raw Blame History