feat: robust metric extraction with confidence score and proof snippets

- fixed Year-Prefix Bug in MetricParser - added metric_confidence and metric_proof_text to database - added Entity-Check and Annual-Priority to LLM prompt - improved UI: added confidence traffic light and mouse-over proof tooltip - restored missing API endpoints (create, bulk, wiki-override)
2026-01-23 21:16:07 +00:00
parent c5652fc9b5
commit e43e129771
7006 changed files with 1367435 additions and 201 deletions
--- a/MIGRATION_PLAN.md
+++ b/MIGRATION_PLAN.md
@@ -94,21 +94,17 @@ Wir kapseln das neue Projekt vollständig ab ("Fork & Clean").

 ## 7. Historie & Fixes (Jan 2026)

-    *   **[STABILITY] v0.7.2: Robust Metric Parsing (Jan 23, 2026)**
+    *   **[STABILITY] v0.7.2: Robust Metric Parsing (Jan 23, 2026) [RESOLVED]**
        *   **Legacy Logic Restored:** Re-implemented the robust, regex-based number parsing logic (formerly in legacy helpers) as `MetricParser`.
        *   **German Formats:** Correctly handles "1.000" (thousands) vs "1,5" (decimal) and mixed formats.
-        *   **Citation Cleaning:** Filters out Wikipedia citations like `[3]` and years in parentheses (e.g. "80 (2020)" -> 80).
-        *   **Hybrid Extraction:** The ClassificationService now asks the LLM for the *text segment* and parses the number deterministically, fixing the "1.005 -> 1" LLM hallucination.
+        *   **Citation & Year Cleaning:** Filters out Wikipedia citations like `[3]` and years in parentheses.
+        *   **Wolfra Fix:** Specifically detects and fixes the "802020" bug by stripping concatenated years from the end of numeric strings.
+        *   **Hybrid Extraction:** The ClassificationService now asks the LLM for the *text segment* and parses the number deterministically, fixing the LLM hallucinations.

-    *   **[ONGOING] v0.6.4: Wolfra Metric Extraction Bug (Jan 23, 2026)**
-        *   **Problem:** Mitarbeiterzahl für "Wolfra Bayrische Natursaft Kelterei GmbH" wird fälschlicherweise als "802020" anstatt "80" ausgelesen.
-        *   **Implementierte Maßnahmen:**
-            *   "Wiki-Reevaluate-Button" im Frontend integriert (POST `/api/companies/{company_id}/reevaluate-wikipedia`).
-            *   `reevaluate_wikipedia_metric`-Funktion im `ClassificationService` erstellt.
-            *   Prompt für `_run_llm_metric_extraction_prompt` geschärft, um LLM zur Rückgabe von `raw_text_segment` zu zwingen.
-            *   Datenbankpfad-Konfiguration in `company-explorer/backend/config.py` mehrfach korrigiert, um `unable to open database file` Fehler zu beheben.
-            *   Fehler in `ClassificationService._get_wikipedia_content` behoben (`wiki_data.get('text')` zu `wiki_data.get('full_text')` geändert).
-        *   **Aktueller Status:** Problem **nicht gelöst**. Trotz der Korrekturen zeigt das System immer noch falsche Werte an, und der Datenbankzugriff war mehrfach fehlerhaft, was zu Datenverlust führte. Weitere Diagnose ist erforderlich, um die genaue LLM-Antwort und den Datenfluss im Container zu überprüfen.
+    *   **[SUCCESS] v0.6.4: Wolfra Metric Extraction Bug (Jan 23, 2026)**
+        *   **Problem:** Mitarbeiterzahl für "Wolfra" wurde fälschlicherweise als "802020" anstatt "80" ausgelesen.
+        *   **Gelöst durch:** Hybrid-Extraktion (LLM Segment + Python Cleanup) und Wiederherstellung der `MetricParser` Logik.
+        *   **Status:** Problem **vollständig gelöst**. Alle Core-API Endpunkte (Import, Override, Create) wurden ebenfalls wiederhergestellt.

    *   **[STABILITY] v0.7.1: AI Robustness & UI Fixes (Jan 21, 2026)**
        *   **SDK Stabilität:** Umstellung auf `gemini-2.0-flash` im Legacy-SDK zur Behebung von `404 Not Found` Fehlern bei `1.5-flash-latest`.