Files

Floke 601593c65c feat(company-explorer): bump version to 0.3.0, add VAT ID extraction, and fix deep-link scraping

- Updated version to v0.3.0 (UI & Backend) to clear potential caching confusion.
- Enhanced Impressum scraper to extract VAT ID (Umsatzsteuer-ID).
- Implemented 2-Hop scraping strategy: Looks for 'Kontakt' page if Impressum isn't on the start page.
- Added VAT ID display to the Legal Data block in Inspector.

2026-01-08 16:14:01 +01:00

5.7 KiB

Raw Blame History

Gemini Code Assistant Context

Wichtige Hinweise

Projektdokumentation: Die primäre und umfassendste Dokumentation für dieses Projekt befindet sich in der Datei readme.md. Bitte ziehen Sie diese Datei für ein detailliertes Verständnis der Architektur und der einzelnen Module zu Rate.
Git-Repository: Dieses Projekt wird über ein Git-Repository verwaltet. Alle Änderungen am Code werden versioniert. Beachten Sie den Abschnitt "Git Workflow & Conventions" für unsere Arbeitsregeln.

Project Overview

This project is a Python-based system for automated company data enrichment and lead generation. It uses a variety of data sources, including web scraping, Wikipedia, and the OpenAI API, to enrich company data from a CRM system. The project is designed to run in a Docker container and can be controlled via a Flask API.

The system is modular and consists of the following key components:

brancheneinstufung_167.py: The core module for data enrichment, including web scraping, Wikipedia lookups, and AI-based analysis.
company_deduplicator.py: A module for intelligent duplicate checking, both for external lists and internal CRM data.
generate_marketing_text.py: An engine for creating personalized marketing texts.
app.py: A Flask application that provides an API to run the different modules.
company-explorer/: A new React/FastAPI-based application (v2.x) replacing the legacy CLI tools. It focuses on identifying robotics potential in companies.

Git Workflow & Conventions

Commit-Nachrichten: Commits sollen einem klaren Format folgen:
- Titel: Eine prägnante Zusammenfassung unter 100 Zeichen.
- Beschreibung: Detaillierte Änderungen als Liste mit - am Zeilenanfang (keine Bulletpoints).
Datei-Umbenennungen: Um die Git-Historie einer Datei zu erhalten, muss sie zwingend mit git mv alter_name.py neuer_name.py umbenannt werden.
Commit & Push Prozess: Änderungen werden zuerst lokal committet. Das Pushen auf den Remote-Server erfolgt erst nach expliziter Bestätigung durch Sie.

Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition)

Robotics Potential Analysis (v2.3):
- Logic Overhaul: Switched from keyword-based scanning to a "Chain-of-Thought" Infrastructure Analysis. The AI now evaluates physical assets (factories, warehouses, solar parks) to determine robotics needs.
- Provider vs. User: Implemented strict reasoning to distinguish between companies selling cleaning products (providers) and those operating factories (users/potential clients).
- Configurable Logic: Added a database-backed configuration system for robotics categories (cleaning, transport, security, service). Users can now define the "Trigger Logic" and "Scoring Guide" directly in the frontend settings.
Wikipedia Integration (v2.1):
- Deep Extraction: Implemented the "Legacy" extraction logic (WikipediaService). It now pulls the first paragraph (cleaned of references), categories (filtered for relevance), revenue, employees, and HQ location.
- Google-First Discovery: Uses SerpAPI to find the correct Wikipedia article, validating via domain match and city.
- Visual Inspector: The frontend Inspector now displays a comprehensive Wikipedia profile including category tags.
Web Scraping & Legal Data (v2.2):
- Impressum Scraping: Implemented a robust finder for "Impressum" / "Legal Notice" links.
  - Root-URL Fallback: If deep links (e.g., from /about-us) don't work, the scraper automatically checks the root domain (example.com/impressum).
  - LLM Extraction: Uses Gemini to parse unstructured Impressum text into structured JSON (Legal Name, Address, CEO).
- Clean JSON Parsing: Implemented clean_json_response to handle AI responses containing Markdown (```json), preventing crash loops.
Manual Overrides & Control:
- Wikipedia Override: Added a UI to manually correct the Wikipedia URL. This triggers a re-scan and locks the record (is_locked flag) to prevent auto-overwrite.
- Website Override: Added a UI to manually correct the company website. This automatically clears old scraping data to force a fresh analysis on the next run.

Lessons Learned & Best Practices

Numeric Extraction (German Locale):
- Problem: "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
- Solution: Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator. For Revenue (is_umsatz=True), dots are generally treated as decimals (e.g. "375.6 Mio") unless multiple dots exist.
- Rule: Always check for both , and . presence to determine locale.
LLM JSON Stability:
- Problem: LLMs often wrap JSON in Markdown blocks, causing json.loads() to fail.
- Solution: ALWAYS use a clean_json_response helper that strips ```json markers before parsing. Never trust raw LLM output for structured data.
Scraping Navigation:
- Problem: Searching for "Impressum" only on the scraped URL (which might be a subpage found via Google) often fails.
- Solution: Always implement a fallback to the Root Domain. The legal notice is almost always linked from the homepage footer.

Next Steps

Frontend Debugging: Verify why the "Official Legal Data" block disappears in some states (likely due to conditional rendering checks on impressum object structure).
Quality Assurance: Implement a dedicated "Review Mode" to validate high-potential leads.
Data Import: Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.

5.7 KiB Raw Blame History