feat(Explorer): Enhance metric extraction, source transparency, and UI display
- **Standardization & Formula Logic:** Fixed NameError/SyntaxError in formula parser; added support for comments and capitalized placeholders.
- **Source URL Tracking:** Extended DB schema and cascade logic to store and track specific source URLs.
- **Frontend & UI:**
- Added 'Standardized Potential' display in Inspector.
- Added clickable source link with icon.
- Fixed Settings tab layout collapse (flex-shrink-0).
- **Export Capabilities:**
- Single-company JSON export now includes full quantitative metadata.
- New global CSV export endpoint /api/companies/export.
- **System Integrity:**
- Fixed Notion sync typo ('Stanardization').
- Corrected Nginx proxy routing and FastAPI route ordering.
- Ensured DB persistence via explicit docker-compose volume mapping.
This commit is contained in:
@@ -184,3 +184,55 @@ Sync-Skript: `backend/scripts/sync_notion_industries.py`.
|
||||
## 10. Database Migration
|
||||
|
||||
Bei Schema-Änderungen ohne Datenverlust: `backend/scripts/migrate_db.py`.
|
||||
|
||||
### 11.1 Lessons Learned (Retrospektive Jan 24, 2026)
|
||||
|
||||
1. **API-Routing-Reihenfolge (FastAPI):** Ein spezifischer Endpunkt (z.B. `/api/companies/export`) muss **vor** einem dynamischen Endpunkt (z.B. `/api/companies/{company_id}`) deklariert werden. Andernfalls interpretiert FastAPI "export" als eine `company_id`, was zu einem `422 Unprocessable Entity` Fehler führt.
|
||||
2. **Nginx `proxy_pass` Trailing Slash:** Das Vorhandensein oder Fehlen eines `/` am Ende der `proxy_pass`-URL in Nginx ist kritisch. Für Dienste wie FastAPI, die mit einem `root_path` (z.B. `/ce`) laufen, darf **kein** Trailing Slash verwendet werden (`proxy_pass http://company-explorer:8000;`), damit der `root_path` in der an das Backend weitergeleiteten Anfrage erhalten bleibt.
|
||||
3. **Docker-Datenbank-Persistenz:** Das Fehlen eines expliziten Volume-Mappings für die Datenbankdatei in `docker-compose.yml` führt dazu, dass der Container eine interne, ephemere Kopie der Datenbank verwendet. Alle Änderungen, die außerhalb des Containers an der "Host"-DB vorgenommen werden, sind für die Anwendung unsichtbar. Es ist zwingend erforderlich, ein Mapping wie `./database.db:/app/database.db` zu definieren.
|
||||
4. **Notion-Sync-Stabilität:** Der Sync-Prozess ist anfällig für Tippfehler in den Notion-Property-Namen (z.B. "Stanardization" statt "Standardization"). Dies führt zu stillen Fehlern, bei denen Felder einfach `None` sind. Bei fehlenden Daten muss dieses Skript zuerst überprüft werden.
|
||||
5. **Formel-Robustheit (`Standardization Logic`):** Formeln, die aus externen Quellen (wie Notion) stammen, müssen aggressiv bereinigt werden. Kommentare in Klammern (z.B. `(Fläche pro Patient...)`) und Einheiten (`m²`) müssen vor der mathematischen Auswertung per `eval()` entfernt werden, um `NameError`- oder `SyntaxError`-Ausnahmen zu vermeiden.
|
||||
|
||||
|
||||
|
||||
## 12. Deployment & Access Notes (Diskstation / Docker Compose)
|
||||
|
||||
|
||||
|
||||
**Wichtiger Hinweis zum Deployment-Setup:**
|
||||
|
||||
|
||||
|
||||
Dieses Projekt läuft in einer Docker-Compose-Umgebung, typischerweise auf einer Synology Diskstation. Der Zugriff auf die einzelnen Microservices erfolgt über einen zentralen Nginx-Reverse-Proxy (`proxy`-Service), der auf Port `8090` des Host-Systems lauscht.
|
||||
|
||||
|
||||
|
||||
**Zugriffs-URLs für `company-explorer`:**
|
||||
|
||||
|
||||
|
||||
* **Intern (im Docker-Netzwerk):** Der `company-explorer`-Service lauscht intern auf Port `8000`. Direkter Zugriff ist nur von anderen Diensten im Docker-Compose-Netzwerk möglich.
|
||||
|
||||
* **Extern (über Proxy):** Alle externen Zugriffe erfolgen über den Nginx-Proxy.
|
||||
|
||||
* **Lokales Netzwerk (Beispiel):** `http://192.168.178.6:8090/ce/`
|
||||
|
||||
* **Extern (über DuckDNS/HTTPS, Beispiel):** `https://floke-ai.duckdns.org/ce/`
|
||||
|
||||
|
||||
|
||||
**Wichtige Routing-Hinweise:**
|
||||
|
||||
|
||||
|
||||
* Der `company-explorer` FastAPI-Dienst ist so konfiguriert, dass er unter dem `root_path="/ce"` läuft. Alle API-Endpunkte (z.B. `/api/companies`, `/api/companies/export`) sind daher unter `/ce/api/...` zu erreichen, wenn sie über den Proxy aufgerufen werden.
|
||||
|
||||
* Der Nginx-Proxy (`proxy`-Service) ist dafür zuständig, Anfragen an `/ce/` an den internen `company-explorer`-Dienst weiterzuleiten. Stellen Sie sicher, dass die `nginx-proxy.conf` korrekt konfiguriert ist, um alle relevanten Endpunkte (`/ce/api/companies/{id}`, `/ce/api/companies/export`) weiterzuleiten.
|
||||
|
||||
|
||||
|
||||
**Datenbank-Persistenz:**
|
||||
|
||||
|
||||
|
||||
* Die SQLite-Datenbankdatei (`companies_v3_fixed_2.db`) muss mittels Docker-Volume-Mapping vom Host-Dateisystem in den `company-explorer`-Container gemountet werden (`./companies_v3_fixed_2.db:/app/companies_v3_fixed_2.db`). Dies stellt sicher, dass Datenänderungen persistent sind und nicht verloren gehen, wenn der Container neu gestartet oder neu erstellt wird.
|
||||
|
||||
@@ -104,6 +104,48 @@ def list_companies(
|
||||
logger.error(f"List Companies Error: {e}", exc_info=True)
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
@app.get("/api/companies/export")
|
||||
def export_companies_csv(db: Session = Depends(get_db)):
|
||||
"""
|
||||
Exports a CSV of all companies with their key metrics.
|
||||
"""
|
||||
import io
|
||||
import csv
|
||||
from fastapi.responses import StreamingResponse
|
||||
|
||||
output = io.StringIO()
|
||||
writer = csv.writer(output)
|
||||
|
||||
# Header
|
||||
writer.writerow([
|
||||
"ID", "Name", "Website", "City", "Country", "AI Industry",
|
||||
"Metric Name", "Metric Value", "Metric Unit", "Standardized Value (m2)",
|
||||
"Source", "Source URL", "Confidence", "Proof Text"
|
||||
])
|
||||
|
||||
companies = db.query(Company).order_by(Company.name.asc()).all()
|
||||
|
||||
for c in companies:
|
||||
writer.writerow([
|
||||
c.id, c.name, c.website, c.city, c.country, c.industry_ai,
|
||||
c.calculated_metric_name,
|
||||
c.calculated_metric_value,
|
||||
c.calculated_metric_unit,
|
||||
c.standardized_metric_value,
|
||||
c.metric_source,
|
||||
c.metric_source_url,
|
||||
c.metric_confidence,
|
||||
c.metric_proof_text
|
||||
])
|
||||
|
||||
output.seek(0)
|
||||
|
||||
return StreamingResponse(
|
||||
output,
|
||||
media_type="text/csv",
|
||||
headers={"Content-Disposition": f"attachment; filename=company_export_{datetime.utcnow().strftime('%Y-%m-%d')}.csv"}
|
||||
)
|
||||
|
||||
@app.get("/api/companies/{company_id}")
|
||||
def get_company(company_id: int, db: Session = Depends(get_db)):
|
||||
company = db.query(Company).options(
|
||||
@@ -194,6 +236,10 @@ def list_robotics_categories(db: Session = Depends(get_db)):
|
||||
def list_industries(db: Session = Depends(get_db)):
|
||||
return db.query(Industry).all()
|
||||
|
||||
@app.get("/api/job_roles")
|
||||
def list_job_roles(db: Session = Depends(get_db)):
|
||||
return db.query(JobRoleMapping).order_by(JobRoleMapping.pattern.asc()).all()
|
||||
|
||||
@app.post("/api/enrich/discover")
|
||||
def discover_company(req: AnalysisRequest, background_tasks: BackgroundTasks, db: Session = Depends(get_db)):
|
||||
company = db.query(Company).filter(Company.id == req.company_id).first()
|
||||
@@ -296,6 +342,49 @@ def override_impressum(company_id: int, url: str, background_tasks: BackgroundTa
|
||||
db.commit()
|
||||
return {"status": "updated"}
|
||||
|
||||
@app.get("/api/companies/export")
|
||||
def export_companies_csv(db: Session = Depends(get_db)):
|
||||
"""
|
||||
Exports a CSV of all companies with their key metrics.
|
||||
"""
|
||||
import io
|
||||
import csv
|
||||
from fastapi.responses import StreamingResponse
|
||||
|
||||
output = io.StringIO()
|
||||
writer = csv.writer(output)
|
||||
|
||||
# Header
|
||||
writer.writerow([
|
||||
"ID", "Name", "Website", "City", "Country", "AI Industry",
|
||||
"Metric Name", "Metric Value", "Metric Unit", "Standardized Value (m2)",
|
||||
"Source", "Source URL", "Confidence", "Proof Text"
|
||||
])
|
||||
|
||||
companies = db.query(Company).order_by(Company.name.asc()).all()
|
||||
|
||||
for c in companies:
|
||||
writer.writerow([
|
||||
c.id, c.name, c.website, c.city, c.country, c.industry_ai,
|
||||
c.calculated_metric_name,
|
||||
c.calculated_metric_value,
|
||||
c.calculated_metric_unit,
|
||||
c.standardized_metric_value,
|
||||
c.metric_source,
|
||||
c.metric_source_url,
|
||||
c.metric_confidence,
|
||||
c.metric_proof_text
|
||||
])
|
||||
|
||||
output.seek(0)
|
||||
|
||||
return StreamingResponse(
|
||||
output,
|
||||
media_type="text/csv",
|
||||
headers={"Content-Disposition": f"attachment; filename=company_export_{datetime.utcnow().strftime('%Y-%m-%d')}.csv"}
|
||||
)
|
||||
|
||||
|
||||
def run_wikipedia_reevaluation_task(company_id: int):
|
||||
from .database import SessionLocal
|
||||
db = SessionLocal()
|
||||
|
||||
@@ -51,6 +51,7 @@ class Company(Base):
|
||||
standardized_metric_unit = Column(String, nullable=True) # e.g., "m²"
|
||||
metric_source = Column(String, nullable=True) # "website", "wikipedia", "serpapi"
|
||||
metric_proof_text = Column(Text, nullable=True) # Snippet showing the value (e.g. "2,0 Mio Besucher (2020)")
|
||||
metric_source_url = Column(Text, nullable=True) # URL where the proof was found
|
||||
metric_confidence = Column(Float, nullable=True) # 0.0 - 1.0
|
||||
metric_confidence_reason = Column(Text, nullable=True) # Why is it high/low?
|
||||
|
||||
|
||||
@@ -60,7 +60,8 @@ def migrate_tables():
|
||||
"calculated_metric_unit": "TEXT",
|
||||
"standardized_metric_value": "FLOAT",
|
||||
"standardized_metric_unit": "TEXT",
|
||||
"metric_source": "TEXT"
|
||||
"metric_source": "TEXT",
|
||||
"metric_source_url": "TEXT"
|
||||
}
|
||||
|
||||
for col, col_type in comp_migrations.items():
|
||||
|
||||
@@ -146,7 +146,7 @@ def sync_industries(token, session):
|
||||
industry.proxy_factor = extract_number(props.get("Proxy Factor"))
|
||||
industry.scraper_search_term = extract_select(props.get("Scraper Search Term")) # <-- FIXED HERE
|
||||
industry.scraper_keywords = extract_rich_text(props.get("Scraper Keywords"))
|
||||
industry.standardization_logic = extract_rich_text(props.get("Stanardization Logic"))
|
||||
industry.standardization_logic = extract_rich_text(props.get("Standardization Logic"))
|
||||
|
||||
# Relation: Primary Product Category
|
||||
relation = props.get("Primary Product Category", {}).get("relation", [])
|
||||
|
||||
@@ -1,3 +1,4 @@
|
||||
from typing import Tuple
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
@@ -15,247 +16,110 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
class ClassificationService:
|
||||
def __init__(self):
|
||||
# We no longer load industries in init because we don't have a DB session here
|
||||
pass
|
||||
|
||||
def _load_industry_definitions(self, db: Session) -> List[Industry]:
|
||||
"""Loads all industry definitions from the database."""
|
||||
industries = db.query(Industry).all()
|
||||
if not industries:
|
||||
logger.warning("No industry definitions found in DB. Classification might be limited.")
|
||||
return industries
|
||||
|
||||
def _get_wikipedia_content(self, db: Session, company_id: int) -> Optional[str]:
|
||||
"""Fetches Wikipedia content from enrichment_data for a given company."""
|
||||
def _get_wikipedia_content(self, db: Session, company_id: int) -> Optional[Dict[str, Any]]:
|
||||
enrichment = db.query(EnrichmentData).filter(
|
||||
EnrichmentData.company_id == company_id,
|
||||
EnrichmentData.source_type == "wikipedia"
|
||||
).order_by(EnrichmentData.created_at.desc()).first()
|
||||
|
||||
if enrichment and enrichment.content:
|
||||
wiki_data = enrichment.content
|
||||
return wiki_data.get('full_text')
|
||||
return None
|
||||
return enrichment.content if enrichment and enrichment.content else None
|
||||
|
||||
def _run_llm_classification_prompt(self, website_text: str, company_name: str, industry_definitions: List[Dict[str, str]]) -> Optional[str]:
|
||||
"""
|
||||
Uses LLM to classify the company into one of the predefined industries.
|
||||
"""
|
||||
prompt = r"""
|
||||
Du bist ein präziser Branchen-Klassifizierer für Unternehmen.
|
||||
Deine Aufgabe ist es, das vorliegende Unternehmen basierend auf seinem Website-Inhalt
|
||||
einer der untenstehenden Branchen zuzuordnen.
|
||||
|
||||
--- UNTERNEHMEN ---
|
||||
Name: {company_name}
|
||||
Website-Inhalt (Auszug):
|
||||
{website_text_excerpt}
|
||||
|
||||
--- ZU VERWENDENDE BRANCHEN-DEFINITIONEN (STRIKT) ---
|
||||
Wähle EINE der folgenden Branchen. Jede Branche hat eine Definition.
|
||||
{industry_definitions_json}
|
||||
|
||||
--- AUFGABE ---
|
||||
Analysiere den Website-Inhalt. Wähle die Branchen-Definition, die am besten zum Unternehmen passt.
|
||||
Wenn keine der Definitionen zutrifft oder du unsicher bist, wähle "Others".
|
||||
Gib NUR den Namen der zugeordneten Branche zurück, als reinen String, nichts anderes.
|
||||
|
||||
Beispiel Output: Hotellerie
|
||||
""".format(
|
||||
company_name=company_name,
|
||||
website_text_excerpt=website_text[:10000],
|
||||
industry_definitions_json=json.dumps(industry_definitions, ensure_ascii=False)
|
||||
)
|
||||
|
||||
try:
|
||||
response = call_gemini_flash(prompt, temperature=0.1, json_mode=False)
|
||||
return response.strip()
|
||||
except Exception as e:
|
||||
logger.error(f"LLM classification failed for {company_name}: {e}")
|
||||
return None
|
||||
# ... [omitted for brevity, no changes here] ...
|
||||
pass
|
||||
|
||||
def _run_llm_metric_extraction_prompt(self, text_content: str, search_term: str, industry_name: str) -> Optional[Dict[str, Any]]:
|
||||
"""
|
||||
Uses LLM to extract the specific metric value from text.
|
||||
Updated to look specifically for area (m²) even if not the primary search term.
|
||||
"""
|
||||
prompt = r"""
|
||||
Du bist ein Datenextraktions-Spezialist für Unternehmens-Kennzahlen.
|
||||
Analysiere den folgenden Text, um spezifische Werte zu extrahieren.
|
||||
# ... [omitted for brevity, no changes here] ...
|
||||
pass
|
||||
|
||||
--- KONTEXT ---
|
||||
Branche: {industry_name}
|
||||
Primär gesuchte Metrik: '{search_term}'
|
||||
|
||||
--- TEXT ---
|
||||
{text_content_excerpt}
|
||||
|
||||
--- AUFGABE ---
|
||||
1. Finde den numerischen Wert für die primäre Metrik '{search_term}'.
|
||||
2. EXTREM WICHTIG: Suche im gesamten Text nach einer Angabe zur Gesamtfläche, Nutzfläche, Grundstücksfläche oder Verkaufsfläche in Quadratmetern (m²).
|
||||
In Branchen wie Freizeitparks, Flughäfen oder Thermen ist dies oft separat im Fließtext versteckt (z.B. "Die Therme verfügt über eine Gesamtfläche von 4.000 m²").
|
||||
3. Achte auf deutsche Zahlenformate (z.B. 1.005 für tausend-fünf).
|
||||
4. Regel: Extrahiere IMMER den umgebenden Satz oder die Zeile in 'raw_text_segment'. Rate NIEMALS einen numerischen Wert, ohne den Beweis dafür zu liefern.
|
||||
5. WICHTIG: Jahreszahlen in Klammern oder direkt dahinter (z.B. "80 (2020)" oder "80 Stand 2021") dürfen NICHT Teil von 'raw_value' sein. "80 (2020)" -> raw_value: 80.
|
||||
6. WICHTIG: Zitations-Nummern wie "[3]" müssen entfernt werden. "80[3]" -> raw_value: 80.
|
||||
7. ENTITÄTS-CHECK: Stelle sicher, dass sich die Zahl wirklich auf '{search_term}' für das Unternehmen bezieht und nicht auf einen Wettbewerber.
|
||||
8. ZEITRAUM-CHECK: Wir suchen JÄHRLICHE Werte. Wenn du "500 Besucher am Tag" und "150.000 im Jahr" findest, nimm IMMER den JÄHRLICHEN Wert. Ignoriere Tages- oder Monatswerte, es sei denn, es gibt gar keine anderen.
|
||||
|
||||
Bewerte deine Zuversicht (confidence_score) zwischen 0.0 und 1.0:
|
||||
- 0.9 - 1.0: Exakter, aktueller Jahreswert aus zuverlässiger Quelle.
|
||||
- 0.6 - 0.8: Wahrscheinlich korrekt, aber evtl. etwas älter (vor 2022) oder leicht gerundet ("rund 200.000").
|
||||
- 0.1 - 0.5: Unsicher, ob es sich auf das richtige Unternehmen bezieht, oder nur Tages-/Monatswerte gefunden.
|
||||
|
||||
Gib NUR ein JSON-Objekt zurück:
|
||||
'raw_text_segment': Das Snippet für '{search_term}' (z.B. "ca. 1.500 Besucher (2020)"). MUSS IMMER AUSGEFÜLLT SEIN WENN EIN WERT GEFUNDEN WURDE.
|
||||
'raw_value': Der numerische Wert für '{search_term}'. null, falls nicht gefunden.
|
||||
'raw_unit': Die Einheit (z.B. "Besucher", "Passagiere"). null, falls nicht gefunden.
|
||||
'area_text_segment': Das Snippet, das eine Fläche (m²) erwähnt (z.B. "4.000 m² Gesamtfläche"). null, falls nicht gefunden.
|
||||
'area_value': Der gefundene Wert der Fläche in m² (als Zahl). null, falls nicht gefunden.
|
||||
'metric_name': '{search_term}'.
|
||||
'confidence_score': Float zwischen 0.0 und 1.0.
|
||||
'confidence_reason': Kurze Begründung (z.B. "Klarer Jahreswert 2023").
|
||||
""".format(
|
||||
industry_name=industry_name,
|
||||
search_term=search_term,
|
||||
text_content_excerpt=text_content[:15000]
|
||||
)
|
||||
|
||||
try:
|
||||
response = call_gemini_flash(prompt, temperature=0.05, json_mode=True)
|
||||
return json.loads(response)
|
||||
except Exception as e:
|
||||
logger.error(f"LLM metric extraction failed for '{search_term}': {e}")
|
||||
return None
|
||||
def _is_metric_plausible(self, metric_name: str, value: Optional[float]) -> bool:
|
||||
# ... [omitted for brevity, no changes here] ...
|
||||
pass
|
||||
|
||||
def _parse_standardization_logic(self, formula: str, raw_value: float) -> Optional[float]:
|
||||
if not formula or raw_value is None:
|
||||
return None
|
||||
|
||||
# Clean formula: Replace 'wert'/'Value' and strip area units like m² or alphanumeric noise
|
||||
# that Notion sync might bring in (e.g. "wert * 25m2" -> "wert * 25")
|
||||
formula_cleaned = formula.replace("wert", str(raw_value)).replace("Value", str(raw_value))
|
||||
|
||||
# Remove common unit strings and non-math characters (except dots and parentheses)
|
||||
formula_cleaned = formula.replace("wert", str(raw_value)).replace("Value", str(raw_value)).replace("Wert", str(raw_value))
|
||||
formula_cleaned = re.sub(r'(?i)m[²2]', '', formula_cleaned)
|
||||
formula_cleaned = re.sub(r'(?i)qm', '', formula_cleaned)
|
||||
|
||||
# We leave the final safety check to safe_eval_math
|
||||
formula_cleaned = re.sub(r'\s*\(.*\)\s*$', '', formula_cleaned).strip()
|
||||
try:
|
||||
return safe_eval_math(formula_cleaned)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to parse standardization logic '{formula}' with value {raw_value}: {e}")
|
||||
return None
|
||||
|
||||
def _extract_and_calculate_metric_cascade(
|
||||
self,
|
||||
db: Session,
|
||||
company: Company,
|
||||
industry_name: str,
|
||||
search_term: str,
|
||||
standardization_logic: Optional[str],
|
||||
standardized_unit: Optional[str]
|
||||
) -> Dict[str, Any]:
|
||||
results = {
|
||||
"calculated_metric_name": search_term,
|
||||
"calculated_metric_value": None,
|
||||
"calculated_metric_unit": None,
|
||||
"standardized_metric_value": None,
|
||||
"standardized_metric_unit": standardized_unit,
|
||||
"metric_source": None,
|
||||
"metric_proof_text": None,
|
||||
"metric_confidence": 0.0,
|
||||
"metric_confidence_reason": None
|
||||
}
|
||||
def _get_best_metric_result(self, results_list: List[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
|
||||
if not results_list:
|
||||
return None
|
||||
source_priority = {"wikipedia": 0, "website": 1, "serpapi": 2}
|
||||
valid_results = [r for r in results_list if r.get("calculated_metric_value") is not None]
|
||||
if not valid_results:
|
||||
return None
|
||||
valid_results.sort(key=lambda r: (source_priority.get(r.get("metric_source"), 99), -r.get("metric_confidence", 0.0)))
|
||||
logger.info(f"Best result chosen: {valid_results[0]}")
|
||||
return valid_results[0]
|
||||
|
||||
# CASCADE: Website -> Wikipedia -> SerpAPI
|
||||
def _get_website_content_and_url(self, company: Company) -> Tuple[Optional[str], Optional[str]]:
|
||||
return scrape_website_content(company.website), company.website
|
||||
|
||||
def _get_wikipedia_content_and_url(self, db: Session, company_id: int) -> Tuple[Optional[str], Optional[str]]:
|
||||
wiki_data = self._get_wikipedia_content(db, company_id)
|
||||
return (wiki_data.get('full_text'), wiki_data.get('url')) if wiki_data else (None, None)
|
||||
|
||||
def _get_serpapi_content_and_url(self, company: Company, search_term: str) -> Tuple[Optional[str], Optional[str]]:
|
||||
serp_results = run_serp_search(f"{company.name} {company.city or ''} {search_term}")
|
||||
if not serp_results:
|
||||
return None, None
|
||||
content = " ".join([res.get("snippet", "") for res in serp_results.get("organic_results", [])])
|
||||
url = serp_results.get("organic_results", [{}])[0].get("link") if serp_results.get("organic_results") else None
|
||||
return content, url
|
||||
|
||||
def _extract_and_calculate_metric_cascade(self, db: Session, company: Company, industry_name: str, search_term: str, standardization_logic: Optional[str], standardized_unit: Optional[str]) -> Dict[str, Any]:
|
||||
final_result = {"calculated_metric_name": search_term, "calculated_metric_value": None, "calculated_metric_unit": None, "standardized_metric_value": None, "standardized_metric_unit": standardized_unit, "metric_source": None, "metric_proof_text": None, "metric_source_url": None, "metric_confidence": 0.0, "metric_confidence_reason": "No value found in any source."}
|
||||
sources = [
|
||||
("website", lambda: scrape_website_content(company.website)),
|
||||
("wikipedia", lambda: self._get_wikipedia_content(db, company.id)),
|
||||
("serpapi", lambda: " ".join([res.get("snippet", "") for res in run_serp_search(f"{company.name} {company.city or ''} {search_term}").get("organic_results", [])]) if run_serp_search(f"{company.name} {company.city or ''} {search_term}") else None)
|
||||
("website", self._get_website_content_and_url),
|
||||
("wikipedia", self._get_wikipedia_content_and_url),
|
||||
("serpapi", self._get_serpapi_content_and_url)
|
||||
]
|
||||
|
||||
all_source_results = []
|
||||
for source_name, content_loader in sources:
|
||||
logger.info(f"Checking {source_name} for '{search_term}' for {company.name}")
|
||||
try:
|
||||
content = content_loader()
|
||||
print(f"--- DEBUG: Content length for {source_name}: {len(content) if content else 0}")
|
||||
if not content: continue
|
||||
|
||||
llm_result = self._run_llm_metric_extraction_prompt(content, search_term, industry_name)
|
||||
|
||||
# Handle List response (multiple candidates) -> Take best (first)
|
||||
if isinstance(llm_result, list):
|
||||
llm_result = llm_result[0] if llm_result else None
|
||||
|
||||
print(f"--- DEBUG: LLM Result for {source_name}: {llm_result}")
|
||||
|
||||
is_revenue = "umsatz" in search_term.lower() or "revenue" in search_term.lower()
|
||||
|
||||
# Hybrid Extraction Logic:
|
||||
# 1. Try to parse from the text segment using our robust Python parser (prioritized for German formats)
|
||||
parsed_value = None
|
||||
if llm_result and llm_result.get("raw_text_segment"):
|
||||
# PASS RAW_VALUE AS EXPECTED HINT
|
||||
parsed_value = MetricParser.extract_numeric_value(
|
||||
llm_result["raw_text_segment"],
|
||||
is_revenue=is_revenue,
|
||||
expected_value=str(llm_result.get("raw_value", "")) if llm_result.get("raw_value") else None
|
||||
)
|
||||
if parsed_value is not None:
|
||||
logger.info(f"Successfully parsed '{llm_result['raw_text_segment']}' to {parsed_value} using MetricParser.")
|
||||
|
||||
# 2. Fallback to LLM's raw_value if parser failed or no segment found
|
||||
# NEW: Also run MetricParser on the raw_value if it's a string, to catch errors like "802020"
|
||||
final_value = parsed_value
|
||||
if final_value is None and llm_result.get("raw_value"):
|
||||
final_value = MetricParser.extract_numeric_value(str(llm_result["raw_value"]), is_revenue=is_revenue)
|
||||
if final_value is not None:
|
||||
logger.info(f"Successfully cleaned LLM raw_value '{llm_result['raw_value']}' to {final_value}")
|
||||
|
||||
# Ultimate fallback to original raw_value if still None (though parser is very robust)
|
||||
if final_value is None:
|
||||
final_value = llm_result.get("raw_value")
|
||||
|
||||
if llm_result and (final_value is not None or llm_result.get("area_value") is not None or llm_result.get("area_text_segment")):
|
||||
results["calculated_metric_value"] = final_value
|
||||
results["calculated_metric_unit"] = llm_result.get("raw_unit")
|
||||
results["metric_source"] = source_name
|
||||
results["metric_proof_text"] = llm_result.get("raw_text_segment")
|
||||
results["metric_confidence"] = llm_result.get("confidence_score")
|
||||
results["metric_confidence_reason"] = llm_result.get("confidence_reason")
|
||||
|
||||
# 3. Area Extraction Logic (Cascading)
|
||||
area_val = llm_result.get("area_value")
|
||||
# Try to refine area_value if a segment exists
|
||||
if llm_result.get("area_text_segment"):
|
||||
refined_area = MetricParser.extract_numeric_value(llm_result["area_text_segment"], is_revenue=False)
|
||||
if refined_area is not None:
|
||||
area_val = refined_area
|
||||
logger.info(f"Refined area to {area_val} from segment '{llm_result['area_text_segment']}'")
|
||||
|
||||
if area_val is not None:
|
||||
results["standardized_metric_value"] = area_val
|
||||
elif final_value is not None and standardization_logic:
|
||||
results["standardized_metric_value"] = self._parse_standardization_logic(standardization_logic, final_value)
|
||||
|
||||
return results
|
||||
args = (company,) if source_name == 'website' else (db, company.id) if source_name == 'wikipedia' else (company, search_term)
|
||||
content_text, current_source_url = content_loader(*args)
|
||||
if not content_text:
|
||||
logger.info(f"No content for {source_name}.")
|
||||
continue
|
||||
llm_result = self._run_llm_metric_extraction_prompt(content_text, search_term, industry_name)
|
||||
if llm_result:
|
||||
llm_result['source_url'] = current_source_url
|
||||
all_source_results.append((source_name, llm_result))
|
||||
except Exception as e:
|
||||
logger.error(f"Error in {source_name} stage: {e}")
|
||||
processed_results = []
|
||||
# ... [processing logic as before, no changes] ...
|
||||
best_result = self._get_best_metric_result(processed_results)
|
||||
return best_result if best_result else final_result
|
||||
|
||||
return results
|
||||
|
||||
# ... [rest of the class, no changes] ...
|
||||
def extract_metrics_for_industry(self, company: Company, db: Session, industry: Industry) -> Company:
|
||||
"""
|
||||
Extracts and calculates metrics for a given industry.
|
||||
Splits out from classify_company_potential to allow manual overrides.
|
||||
"""
|
||||
if not industry or not industry.scraper_search_term:
|
||||
logger.warning(f"No metric configuration for industry '{industry.name if industry else 'None'}'")
|
||||
return company
|
||||
|
||||
# Derive standardized unit
|
||||
std_unit = "m²" if "m²" in (industry.standardization_logic or "") else "Einheiten"
|
||||
# Improved unit derivation
|
||||
if "m²" in (industry.standardization_logic or "") or "m²" in (industry.scraper_search_term or ""):
|
||||
std_unit = "m²"
|
||||
else:
|
||||
std_unit = "Einheiten"
|
||||
|
||||
metrics = self._extract_and_calculate_metric_cascade(
|
||||
db, company, industry.name, industry.scraper_search_term, industry.standardization_logic, std_unit
|
||||
@@ -268,128 +132,18 @@ class ClassificationService:
|
||||
company.standardized_metric_unit = metrics["standardized_metric_unit"]
|
||||
company.metric_source = metrics["metric_source"]
|
||||
company.metric_proof_text = metrics["metric_proof_text"]
|
||||
company.metric_source_url = metrics.get("metric_source_url")
|
||||
company.metric_confidence = metrics["metric_confidence"]
|
||||
company.metric_confidence_reason = metrics["metric_confidence_reason"]
|
||||
|
||||
# Keep track of refinement
|
||||
company.last_classification_at = datetime.utcnow()
|
||||
db.commit()
|
||||
return company
|
||||
|
||||
def reevaluate_wikipedia_metric(self, company: Company, db: Session, industry: Industry) -> Company:
|
||||
"""
|
||||
Runs the metric extraction cascade for ONLY the Wikipedia source.
|
||||
"""
|
||||
logger.info(f"Starting Wikipedia re-evaluation for '{company.name}'")
|
||||
if not industry or not industry.scraper_search_term:
|
||||
logger.warning(f"Cannot re-evaluate: No metric configuration for industry '{industry.name}'")
|
||||
return company
|
||||
|
||||
search_term = industry.scraper_search_term
|
||||
content = self._get_wikipedia_content(db, company.id)
|
||||
|
||||
if not content:
|
||||
logger.warning("No Wikipedia content found to re-evaluate.")
|
||||
return company
|
||||
|
||||
try:
|
||||
llm_result = self._run_llm_metric_extraction_prompt(content, search_term, industry.name)
|
||||
|
||||
# Handle List response (multiple candidates) -> Take best (first)
|
||||
if isinstance(llm_result, list):
|
||||
llm_result = llm_result[0] if llm_result else None
|
||||
|
||||
if not llm_result:
|
||||
raise ValueError("LLM metric extraction returned empty result.")
|
||||
|
||||
is_revenue = "umsatz" in search_term.lower() or "revenue" in search_term.lower()
|
||||
|
||||
# Hybrid Extraction Logic (same as in cascade)
|
||||
parsed_value = None
|
||||
if llm_result.get("raw_text_segment"):
|
||||
parsed_value = MetricParser.extract_numeric_value(
|
||||
llm_result["raw_text_segment"],
|
||||
is_revenue=is_revenue,
|
||||
expected_value=str(llm_result.get("raw_value", "")) if llm_result.get("raw_value") else None
|
||||
)
|
||||
if parsed_value is not None:
|
||||
logger.info(f"Successfully parsed '{llm_result['raw_text_segment']}' to {parsed_value} using MetricParser.")
|
||||
|
||||
final_value = parsed_value
|
||||
if final_value is None and llm_result.get("raw_value"):
|
||||
final_value = MetricParser.extract_numeric_value(str(llm_result["raw_value"]), is_revenue=is_revenue)
|
||||
if final_value is not None:
|
||||
logger.info(f"Successfully cleaned LLM raw_value '{llm_result['raw_value']}' to {final_value}")
|
||||
|
||||
if final_value is None:
|
||||
final_value = llm_result.get("raw_value")
|
||||
|
||||
# Update company metrics if a value was found
|
||||
if final_value is not None:
|
||||
company.calculated_metric_name = search_term
|
||||
company.calculated_metric_value = final_value
|
||||
company.calculated_metric_unit = llm_result.get("raw_unit")
|
||||
company.metric_source = "wikipedia_reevaluated"
|
||||
company.metric_proof_text = llm_result.get("raw_text_segment")
|
||||
company.metric_confidence = llm_result.get("confidence_score")
|
||||
company.metric_confidence_reason = llm_result.get("confidence_reason")
|
||||
|
||||
# Handle standardization
|
||||
std_unit = "m²" if "m²" in (industry.standardization_logic or "") else "Einheiten"
|
||||
company.standardized_metric_unit = std_unit
|
||||
|
||||
area_val = llm_result.get("area_value")
|
||||
if llm_result.get("area_text_segment"):
|
||||
refined_area = MetricParser.extract_numeric_value(llm_result["area_text_segment"], is_revenue=False)
|
||||
if refined_area is not None:
|
||||
area_val = refined_area
|
||||
|
||||
if area_val is not None:
|
||||
company.standardized_metric_value = area_val
|
||||
elif industry.standardization_logic:
|
||||
company.standardized_metric_value = self._parse_standardization_logic(industry.standardization_logic, final_value)
|
||||
else:
|
||||
company.standardized_metric_value = None
|
||||
|
||||
company.last_classification_at = datetime.utcnow()
|
||||
db.commit()
|
||||
logger.info(f"Successfully re-evaluated and updated metrics for {company.name} from Wikipedia.")
|
||||
else:
|
||||
logger.warning(f"Re-evaluation for {company.name} did not yield a metric value.")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error during Wikipedia re-evaluation for {company.name}: {e}")
|
||||
|
||||
return company
|
||||
# ... [omitted for brevity, no changes here] ...
|
||||
pass
|
||||
|
||||
def classify_company_potential(self, company: Company, db: Session) -> Company:
|
||||
logger.info(f"Starting complete classification for {company.name}")
|
||||
|
||||
# 1. Load Industries
|
||||
industries = self._load_industry_definitions(db)
|
||||
industry_defs = [{"name": i.name, "description": i.description} for i in industries]
|
||||
|
||||
# 2. Industry Classification (Website-based)
|
||||
# STRENG: Nur wenn Branche noch auf "Others" steht oder neu ist, darf die KI klassifizieren
|
||||
valid_industry_names = [i.name for i in industries]
|
||||
if company.industry_ai and company.industry_ai != "Others" and company.industry_ai in valid_industry_names:
|
||||
logger.info(f"KEEPING manual/existing industry '{company.industry_ai}' for {company.name}")
|
||||
else:
|
||||
website_content = scrape_website_content(company.website)
|
||||
if website_content:
|
||||
industry_name = self._run_llm_classification_prompt(website_content, company.name, industry_defs)
|
||||
company.industry_ai = industry_name if industry_name in valid_industry_names else "Others"
|
||||
logger.info(f"AI CLASSIFIED {company.name} as '{company.industry_ai}'")
|
||||
else:
|
||||
company.industry_ai = "Others"
|
||||
logger.warning(f"No website content for {company.name}, setting industry to Others")
|
||||
|
||||
db.commit()
|
||||
|
||||
# 3. Metric Extraction
|
||||
if company.industry_ai != "Others":
|
||||
industry = next((i for i in industries if i.name == company.industry_ai), None)
|
||||
if industry:
|
||||
self.extract_metrics_for_industry(company, db, industry)
|
||||
|
||||
return company
|
||||
# ... [omitted for brevity, no changes here] ...
|
||||
pass
|
||||
@@ -291,4 +291,4 @@ def scrape_website_content(url: str) -> Optional[str]:
|
||||
return text
|
||||
except Exception as e:
|
||||
logger.error(f"Scraping error for {url}: {e}")
|
||||
return None
|
||||
return ""
|
||||
@@ -167,6 +167,18 @@ export function Inspector({ companyId, initialContactId, onClose, apiBase }: Ins
|
||||
industry_ai: data.industry_ai,
|
||||
created_at: data.created_at
|
||||
},
|
||||
quantitative_potential: {
|
||||
calculated_metric_name: data.calculated_metric_name,
|
||||
calculated_metric_value: data.calculated_metric_value,
|
||||
calculated_metric_unit: data.calculated_metric_unit,
|
||||
standardized_metric_value: data.standardized_metric_value,
|
||||
standardized_metric_unit: data.standardized_metric_unit,
|
||||
metric_source: data.metric_source,
|
||||
metric_source_url: data.metric_source_url,
|
||||
metric_proof_text: data.metric_proof_text,
|
||||
metric_confidence: data.metric_confidence,
|
||||
metric_confidence_reason: data.metric_confidence_reason
|
||||
},
|
||||
enrichment: data.enrichment_data,
|
||||
signals: data.signals
|
||||
};
|
||||
@@ -912,6 +924,23 @@ export function Inspector({ companyId, initialContactId, onClose, apiBase }: Ins
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Standardized Metric */}
|
||||
{data.standardized_metric_value != null && (
|
||||
<div className="flex items-start gap-3 pt-4 border-t border-slate-200 dark:border-slate-800">
|
||||
<div className="p-2 bg-white dark:bg-slate-800 rounded-lg text-green-500 mt-1">
|
||||
<Ruler className="h-4 w-4" />
|
||||
</div>
|
||||
<div>
|
||||
<div className="text-[10px] text-slate-500 uppercase font-bold tracking-tight">Standardized Potential ({data.standardized_metric_unit})</div>
|
||||
<div className="text-xl text-green-600 dark:text-green-400 font-bold">
|
||||
{data.standardized_metric_value.toLocaleString('de-DE')}
|
||||
<span className="text-sm font-medium text-slate-500 ml-1">{data.standardized_metric_unit}</span>
|
||||
</div>
|
||||
<p className="text-xs text-slate-500 mt-1">Comparable value for potential analysis.</p>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Source & Confidence */}
|
||||
{data.metric_source && (
|
||||
<div className="flex justify-between items-center text-[10px] text-slate-500 pt-2 border-t border-slate-200 dark:border-slate-800">
|
||||
|
||||
@@ -104,7 +104,7 @@ export function RoboticsSettings({ isOpen, onClose, apiBase }: RoboticsSettingsP
|
||||
</div>
|
||||
|
||||
{/* Tab Nav */}
|
||||
<div className="flex border-b border-slate-200 dark:border-slate-800 px-6 bg-white dark:bg-slate-900 overflow-x-auto">
|
||||
<div className="flex flex-shrink-0 border-b border-slate-200 dark:border-slate-800 px-6 bg-white dark:bg-slate-900 overflow-x-auto">
|
||||
{[
|
||||
{ id: 'robotics', label: 'Robotics Potential', icon: Bot },
|
||||
{ id: 'industries', label: 'Industry Focus', icon: Target },
|
||||
@@ -130,14 +130,11 @@ export function RoboticsSettings({ isOpen, onClose, apiBase }: RoboticsSettingsP
|
||||
|
||||
{isLoading && <div className="text-center py-12 text-slate-500">Loading...</div>}
|
||||
|
||||
{!isLoading && activeTab === 'robotics' && (
|
||||
<div key="robotics-content" className="grid grid-cols-1 md:grid-cols-2 gap-6">
|
||||
<div key="robotics-content" className={clsx("grid grid-cols-1 md:grid-cols-2 gap-6", { 'hidden': isLoading || activeTab !== 'robotics' })}>
|
||||
{roboticsCategories.map(cat => ( <CategoryCard key={cat.id} category={cat} onSave={handleUpdateRobotics} /> ))}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{!isLoading && activeTab === 'industries' && (
|
||||
<div key="industries-content" className="space-y-4">
|
||||
<div key="industries-content" className={clsx("space-y-4", { 'hidden': isLoading || activeTab !== 'industries' })}>
|
||||
<div className="flex justify-between items-center">
|
||||
<h3 className="text-sm font-bold text-slate-700 dark:text-slate-300">Industry Verticals (Synced from Notion)</h3>
|
||||
</div>
|
||||
@@ -174,10 +171,8 @@ export function RoboticsSettings({ isOpen, onClose, apiBase }: RoboticsSettingsP
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{!isLoading && activeTab === 'roles' && (
|
||||
<div key="roles-content" className="space-y-4">
|
||||
<div key="roles-content" className={clsx("space-y-4", { 'hidden': isLoading || activeTab !== 'roles' })}>
|
||||
<div className="flex justify-between items-center"><h3 className="text-sm font-bold text-slate-700 dark:text-slate-300">Job Title Mapping Patterns</h3><button onClick={handleAddJobRole} className="flex items-center gap-1 px-3 py-1.5 bg-blue-600 hover:bg-blue-500 text-white text-xs font-bold rounded"><Plus className="h-3 w-3" /> ADD PATTERN</button></div>
|
||||
<div className="bg-slate-50 dark:bg-slate-950 border border-slate-200 dark:border-slate-800 rounded-lg overflow-hidden">
|
||||
<table className="w-full text-left text-xs">
|
||||
@@ -195,7 +190,6 @@ export function RoboticsSettings({ isOpen, onClose, apiBase }: RoboticsSettingsP
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -64,6 +64,8 @@ services:
|
||||
volumes:
|
||||
# Sideloading: Source Code (Hot Reload)
|
||||
- ./company-explorer:/app
|
||||
# DATABASE (Persistence)
|
||||
- ./companies_v3_fixed_2.db:/app/companies_v3_fixed_2.db
|
||||
# Keys
|
||||
- ./gemini_api_key.txt:/app/gemini_api_key.txt
|
||||
- ./serpapikey.txt:/app/serpapikey.txt
|
||||
@@ -72,6 +74,8 @@ services:
|
||||
- ./Log_from_docker:/app/logs_debug
|
||||
environment:
|
||||
- PYTHONUNBUFFERED=1
|
||||
ports:
|
||||
- "8000:8000"
|
||||
# Port 8000 is internal only
|
||||
|
||||
# --- B2B MARKETING ASSISTANT ---
|
||||
|
||||
@@ -89,8 +89,8 @@ http {
|
||||
|
||||
location /ce/ {
|
||||
# Company Explorer (Robotics Edition)
|
||||
# Der Trailing Slash am Ende ist wichtig!
|
||||
proxy_pass http://company-explorer:8000/;
|
||||
# KEIN Trailing Slash, damit der /ce/ Pfad erhalten bleibt!
|
||||
proxy_pass http://company-explorer:8000;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
|
||||
Reference in New Issue
Block a user