feat: robust metric extraction with confidence score and proof snippets

- fixed Year-Prefix Bug in MetricParser - added metric_confidence and metric_proof_text to database - added Entity-Check and Annual-Priority to LLM prompt - improved UI: added confidence traffic light and mouse-over proof tooltip - restored missing API endpoints (create, bulk, wiki-override)
2026-01-23 21:16:07 +00:00
parent c5652fc9b5
commit e43e129771
7006 changed files with 1367435 additions and 201 deletions
--- a/GEMINI.md
+++ b/GEMINI.md
@@ -60,7 +60,14 @@ The system architecture has evolved from a CLI-based toolset to a modern web app
    *   **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
    *   **Revenue:** For revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.

-2.  **LLM JSON Stability:**
+2.  **Concatenated Year/Citation Bug (The "Wolfra" Fix):**
+    *   **Problem:** Numbers were extracted with appended years or citations, e.g., "80 (2020)" became "802020" or "80[3]" became "803".
+    *   **Solution (Hybrid Extraction):** 
+        1. **LLM Segment Extraction:** The LLM is instructed to return the *raw text segment* (e.g., "80 (2020)") alongside the value.
+        2. **Regex Cleanup (MetricParser):** A Python-based `MetricParser` removes anything in parentheses or brackets and specifically checks for 4-digit years at the end of long digit strings (Bug-Fix for 802020 -> 80).
+        3. **Strict Prompting:** Prompt rules explicitly forbid including years/citations in the `raw_value`.
+
+3.  **LLM JSON Stability:**
    *   **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail.
    *   **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output.