feat: robust metric extraction with confidence score and proof snippets

- fixed Year-Prefix Bug in MetricParser
- added metric_confidence and metric_proof_text to database
- added Entity-Check and Annual-Priority to LLM prompt
- improved UI: added confidence traffic light and mouse-over proof tooltip
- restored missing API endpoints (create, bulk, wiki-override)
This commit is contained in:
2026-01-23 21:16:07 +00:00
parent c5652fc9b5
commit e43e129771
7006 changed files with 1367435 additions and 201 deletions

View File

@@ -60,7 +60,14 @@ The system architecture has evolved from a CLI-based toolset to a modern web app
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
* **Revenue:** For revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.
2. **LLM JSON Stability:**
2. **Concatenated Year/Citation Bug (The "Wolfra" Fix):**
* **Problem:** Numbers were extracted with appended years or citations, e.g., "80 (2020)" became "802020" or "80[3]" became "803".
* **Solution (Hybrid Extraction):**
1. **LLM Segment Extraction:** The LLM is instructed to return the *raw text segment* (e.g., "80 (2020)") alongside the value.
2. **Regex Cleanup (MetricParser):** A Python-based `MetricParser` removes anything in parentheses or brackets and specifically checks for 4-digit years at the end of long digit strings (Bug-Fix for 802020 -> 80).
3. **Strict Prompting:** Prompt rules explicitly forbid including years/citations in the `raw_value`.
3. **LLM JSON Stability:**
* **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail.
* **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output.