Files

Floke 194f95f726 docs: restore and enhance MIGRATION_PLAN.md with full history and lessons learned

2026-01-24 14:06:56 +00:00

14 KiB

Raw Blame History

Migrations-Plan: Legacy GSheets -> Company Explorer (Robotics Edition v0.7.4)

Kontext: Neuanfang für die Branche Robotik & Facility Management. Ziel: Ablösung von Google Sheets/CLI durch eine Web-App ("Company Explorer") mit SQLite-Backend.

1. Strategische Neuausrichtung

Bereich	Alt (Legacy)	Neu (Robotics Edition)
Daten-Basis	Google Sheets	SQLite (Lokal, performant, filterbar).
Ziel-Daten	Allgemein / Kundenservice	Quantifizierbares Potenzial (z.B. 4500m² Fläche, 120 Betten).
Branchen	KI-Vorschlag (Freitext)	Strict Mode: Mapping auf definierte Notion-Liste (z.B. "Hotellerie", "Automotive").
Bewertung	0-100 Score (Vage)	Data-Driven: Rohwert (Scraper/Search) -> Standardisierung (Formel) -> Potenzial.
Analytics	Techniker-ML-Modell	Deaktiviert. Fokus auf harte Fakten.
Operations	D365 Sync (Broken)	Excel-Import & Deduplizierung. Fokus auf Matching externer Listen gegen Bestand.

2. Architektur & Komponenten-Mapping

Das System wird in company-explorer/ neu aufgebaut. Wir lösen Abhängigkeiten zur Root helpers.py auf.

A. Core Backend (`backend/`)

Komponente	Aufgabe & Neue Logik	Prio
Database	Ersetzt `GoogleSheetHandler`. Speichert Firmen & "Enrichment Blobs".	1
Importer	Ersetzt `SyncManager`. Importiert Excel-Dumps (CRM) und Event-Listen.	1
Deduplicator	Ersetzt `company_deduplicator.py`. Kern-Feature: Checkt Event-Listen gegen DB. Muss "intelligent" matchen (Name + Ort + Web).	1
Scraper (Base)	Extrahiert Text von Websites. Basis für alle Analysen.	1
Classification Service	NEU (v0.7.0). Zweistufige Logik: 1. Strict Industry Classification. 2. Metric Extraction Cascade (Web -> Wiki -> SerpAPI).	1
Marketing Engine	Ersetzt `generate_marketing_text.py`. Nutzt neue `marketing_wissen_robotics.yaml`.	3

B. Frontend (`frontend/`) - React

View 1: Der "Explorer": DataGrid aller Firmen. Filterbar nach "Roboter-Potential" und Status.
View 2: Der "Inspector": Detailansicht einer Firma. Zeigt gefundene Signale ("Hat SPA Bereich"). Manuelle Korrektur-Möglichkeit.
View 3: "List Matcher": Upload einer Excel-Liste -> Anzeige von Duplikaten -> Button "Neue importieren".
View 4: "Settings": Konfiguration von Branchen, Rollen und Robotik-Logik.

3. Umgang mit Shared Code (`helpers.py` & Co.)

Wir kapseln das neue Projekt vollständig ab ("Fork & Clean").

Quelle: helpers.py (Root)
Ziel: company-explorer/backend/lib/core_utils.py
Aktion: Wir kopieren nur relevante Teile und ergänzen sie (z.B. safe_eval_math, run_serp_search).

4. Datenstruktur (SQLite Schema)

Tabelle `companies` (Stammdaten & Analyse)

id (PK)
name (String)
website (String)
crm_id (String, nullable - Link zum D365)
industry_crm (String - Die "erlaubte" Branche aus Notion)
city (String)
country (String - Standard: "DE" oder aus Impressum)
status (Enum: NEW, IMPORTED, ENRICHED, QUALIFIED)
NEU (v0.7.0):
- calculated_metric_name (String - z.B. "Anzahl Betten")
- calculated_metric_value (Float - z.B. 180)
- calculated_metric_unit (String - z.B. "Betten")
- standardized_metric_value (Float - z.B. 4500)
- standardized_metric_unit (String - z.B. "m²")
- metric_source (String - "website", "wikipedia", "serpapi")

Tabelle `signals` (Deprecated)

Veraltet ab v0.7.0. Wird durch quantitative Metriken in companies ersetzt.

Tabelle `contacts` (Ansprechpartner)

id (PK)
account_id (FK -> companies.id)
gender, title, first_name, last_name, email
job_title (Visitenkarte)
role (Standardisierte Rolle: "Operativer Entscheider", etc.)
status (Marketing Status)

Tabelle `industries` (Branchen-Fokus - Synced from Notion)

id (PK)
notion_id (String, Unique)
name (String - "Vertical" in Notion)
description (Text - "Definition" in Notion)
metric_type (String - "Metric Type")
min_requirement (Float - "Min. Requirement")
whale_threshold (Float - "Whale Threshold")
proxy_factor (Float - "Proxy Factor")
scraper_search_term (String - "Scraper Search Term")
scraper_keywords (Text - "Scraper Keywords")
standardization_logic (String - "Standardization Logic")

Tabelle `job_role_mappings` (Rollen-Logik)

id (PK)
pattern (String - Regex für Jobtitles)
role (String - Zielrolle)

7. Historie & Fixes (Jan 2026)

*   **[CRITICAL] v0.7.4: Service Restoration & Logic Fix (Jan 24, 2026)**
    *   **Summary:** Identified and resolved a critical issue where `ClassificationService` contained empty placeholder methods (`pass`), leading to "Others" classification and missing metrics.
    *   **Fixes Implemented:**
        *   **Service Restoration:** Completely re-implemented `classify_company_potential`, `_run_llm_classification_prompt`, and `_run_llm_metric_extraction_prompt` to restore AI functionality using `call_gemini_flash`.
        *   **Standardization Logic:** Connected the `standardization_logic` formula parser (e.g., "Values * 100m²") into the metric extraction cascade. It now correctly computes `standardized_metric_value` (e.g., 352 beds -> 35,200 m²).
        *   **Verification:** Confirmed end-to-end flow from "New Company" -> "Healthcare - Hospital" -> "352 Betten" -> "35.200 m²" via the UI "Play" button.

*   **[STABILITY] v0.7.3: Hardening Metric Parser & Regression Testing (Jan 23, 2026)**
    *   **Summary:** A series of critical fixes were applied to the `MetricParser` to handle complex real-world scenarios, and a regression test suite was created to prevent future issues.
    *   **Specific Bug Fixes:**
        *   **Wolfra Bug ("802020"):** Logic to detect and remove trailing years from concatenated numbers (e.g., "Mitarbeiter: 802020" -> "80").
        *   **Erding Bug ("Year Prefix"):** Logic to ignore year-like prefixes appearing before the actual metric (e.g., "Seit 2022 ... 200.000 Besucher").
        *   **Greilmeier Bug ("Truncation"):** Removed aggressive sentence splitting on hyphens that was truncating text and causing the parser to miss numbers at the end of a phrase.
        *   **Expected Value Cleaning:** The parser now aggressively strips units (like "m²") from the LLM's `expected_value` to ensure it can find the correct numeric target even if the source text contains multiple numbers.
    *   **Regression Test Suite:** Created `/backend/tests/test_metric_parser.py` to lock in these fixes.

*   **[STABILITY] v0.7.2: Robust Metric Parsing (Jan 23, 2026)**
    *   **Legacy Logic Restored:** Re-implemented the robust, regex-based number parsing logic (formerly in legacy helpers) as `MetricParser`.
    *   **German Formats:** Correctly handles "1.000" (thousands) vs "1,5" (decimal) and mixed formats.
    *   **Citation Cleaning:** Filters out Wikipedia citations like `[3]` and years in parentheses (e.g. "80 (2020)" -> 80).
    *   **Hybrid Extraction:** The ClassificationService now asks the LLM for the *text segment* and parses the number deterministically, fixing "LLM Hallucinations" (e.g. "1.005" -> 1).

*   **[STABILITY] v0.7.1: AI Robustness & UI Fixes (Jan 21, 2026)**
    *   **SDK Stabilität:** Umstellung auf `gemini-2.0-flash` im Legacy-SDK zur Behebung von `404 Not Found` Fehlern.
    *   **API-Key Management:** Robustes Laden des Keys aus `/app/gemini_api_key.txt`.
    *   **Classification Prompt:** Schärfung auf "Best-Fit"-Entscheidungen (kein vorzeitiges "Others").
    *   **Scraping:** Wechsel auf `BeautifulSoup` nach Problemen mit `trafilatura`.

*   **[MAJOR] v0.7.0: Quantitative Potential Analysis (Jan 20, 2026)**
    *   **Zweistufige Analyse:** 
        1.  **Strict Classification:** Ordnet Firmen einer Notion-Branche zu (oder "Others").
        2.  **Metric Cascade:** Sucht gezielt nach der branchenspezifischen Metrik ("Scraper Search Term").
    *   **Fallback-Kaskade:** Website -> Wikipedia -> SerpAPI (Google Search).
    *   **Standardisierung:** Berechnet vergleichbare Werte (z.B. m²) aus Rohdaten mit der `Standardization Logic`.
    *   **Datenbank:** Erweiterung der `companies`-Tabelle um Metrik-Felder.

*   **[UPGRADE] v0.6.x: Notion Integration & UI Improvements**
    *   **Notion SSoT:** Umstellung der Branchenverwaltung (`Industries`) auf Notion.
    *   **Sync Automation:** `backend/scripts/sync_notion_industries.py`.
    *   **Contacts Management:** Globale Kontaktliste, Bulk-Import, Marketing-Status.
    *   **UI Overhaul:** Light/Dark Mode, Grid View, Responsive Design.

8. Eingesetzte Prompts (Account-Analyse v0.7.4)

8.1 Strict Industry Classification

Ordnet das Unternehmen einer definierten Branche zu.

prompt = f"""
Act as a strict B2B Industry Classifier.
Company: {company_name}
Context: {website_text[:3000]}

Available Industries:
{json.dumps(industry_definitions, indent=2)}

Task: Select the ONE industry that best matches the company.
If the company is a Hospital/Klinik, select 'Healthcare - Hospital'.
If none match well, select 'Others'.

Return ONLY the exact name of the industry.
"""

8.2 Metric Extraction

Extrahiert den spezifischen Zahlenwert ("Scraper Search Term") und liefert JSON für den MetricParser.

prompt = f"""
Extract the following metric for the company in industry '{industry_name}':
Target Metric: "{search_term}"

Source Text:
{text_content[:6000]}

Return a JSON object with:
- "raw_value": The number found (e.g. 352 or 352.0). If text says "352 Betten", extract 352. If not found, null.
- "raw_unit": The unit found (e.g. "Betten", "m²").
- "proof_text": A short quote from the text proving this value.

JSON ONLY.
"""

9. Notion Integration (Single Source of Truth)

Das System nutzt Notion als zentrales Steuerungselement für strategische Definitionen.

9.1 Datenfluss

Definition: Branchen und Robotik-Kategorien werden in Notion gepflegt (Whale Thresholds, Keywords, Definitionen).
Synchronisation: Das Skript sync_notion_industries.py zieht die Daten via API und führt einen Upsert in die lokale SQLite-Datenbank aus.
App-Nutzung: Das Web-Interface zeigt diese Daten schreibgeschützt an. Der ClassificationService nutzt sie als "System-Anweisung" für das LLM.

9.2 Technische Details

Notion Token: Muss in /app/notion_token.txt (Container-Pfad) hinterlegt sein.
DB-Mapping: Die Zuordnung erfolgt primär über die notion_id, sekundär über den Namen, um Dubletten bei der Migration zu vermeiden.

10. Database Migration

Wenn die industries-Tabelle in einer bestehenden Datenbank aktualisiert werden muss (z.B. um neue Felder aus Notion zu unterstützen), darf die Datenbankdatei nicht gelöscht werden. Stattdessen muss das Migrations-Skript ausgeführt werden.

Prozess:

Sicherstellen, dass die Zieldatenbank vorhanden ist: Die companies_v3_fixed_2.db muss im company-explorer-Verzeichnis liegen (bzw. via Volume gemountet sein).
Migration ausführen: Dieser Befehl fügt die fehlenden Spalten hinzu, ohne Daten zu löschen.
```
docker exec -it company-explorer python3 backend/scripts/migrate_db.py
```
Container neu starten: Damit der Server das neue Schema erkennt.
```
docker-compose restart company-explorer
```

Notion-Sync ausführen: Um die neuen Spalten mit Daten zu befüllen.

docker exec -it company-explorer python3 backend/scripts/sync_notion_industries.py

11. Lessons Learned (Retrospektive Jan 24, 2026)

API-Routing-Reihenfolge (FastAPI): Ein spezifischer Endpunkt (z.B. /api/companies/export) muss vor einem dynamischen Endpunkt (z.B. /api/companies/{company_id}) deklariert werden. Andernfalls interpretiert FastAPI "export" als eine company_id, was zu einem 422 Unprocessable Entity Fehler führt.
Nginx proxy_pass Trailing Slash: Das Vorhandensein oder Fehlen eines / am Ende der proxy_pass-URL in Nginx ist kritisch. Für Dienste wie FastAPI, die mit einem root_path (z.B. /ce) laufen, darf kein Trailing Slash verwendet werden (proxy_pass http://company-explorer:8000;), damit der root_path in der an das Backend weitergeleiteten Anfrage erhalten bleibt.
Docker-Datenbank-Persistenz: Das Fehlen eines expliziten Volume-Mappings für die Datenbankdatei in docker-compose.yml führt dazu, dass der Container eine interne, ephemere Kopie der Datenbank verwendet. Alle Änderungen, die außerhalb des Containers an der "Host"-DB vorgenommen werden, sind für die Anwendung unsichtbar. Es ist zwingend erforderlich, ein Mapping wie ./companies_v3_fixed_2.db:/app/companies_v3_fixed_2.db zu definieren.
Code-Integrität & Platzhalter: Es ist kritisch, bei Datei-Operationen sicherzustellen, dass keine Platzhalter (wie pass oder # omitted) in den produktiven Code gelangen. Eine "Zombie"-Datei, die äußerlich korrekt aussieht aber innerlich leer ist, kann schwer zu debuggende Logikfehler verursachen.
Formel-Robustheit: Formeln aus externen Quellen müssen vor der Auswertung bereinigt werden (Entfernung von Einheiten, Kommentaren), um Syntax-Fehler zu vermeiden.

12. Deployment & Access Notes

Wichtiger Hinweis zum Deployment-Setup:

Dieses Projekt läuft in einer Docker-Compose-Umgebung, typischerweise auf einer Synology Diskstation. Der Zugriff auf die einzelnen Microservices erfolgt über einen zentralen Nginx-Reverse-Proxy (proxy-Service), der auf Port 8090 des Host-Systems lauscht.

Zugriffs-URLs für company-explorer:

Intern (im Docker-Netzwerk): http://company-explorer:8000
Extern (über Proxy): https://floke-ai.duckdns.org/ce/ (bzw. lokal http://192.168.x.x:8090/ce/)

Datenbank-Persistenz:

Die SQLite-Datenbankdatei (companies_v3_fixed_2.db) muss mittels Docker-Volume-Mapping vom Host-Dateisystem in den company-explorer-Container gemountet werden (./companies_v3_fixed_2.db:/app/companies_v3_fixed_2.db). Dies stellt sicher, dass Datenänderungen persistent sind und nicht verloren gehen, wenn der Container neu gestartet oder neu erstellt wird.

14 KiB Raw Blame History