Files
Brancheneinstufung2/GEMINI.md
Floke 07f5f2433c feat(company-explorer): bump version to 0.3.0, add VAT ID extraction, and fix deep-link scraping
- Updated version to v0.3.0 (UI & Backend) to clear potential caching confusion.
- Enhanced Impressum scraper to extract VAT ID (Umsatzsteuer-ID).
- Implemented 2-Hop scraping strategy: Looks for 'Kontakt' page if Impressum isn't on the start page.
- Added VAT ID display to the Legal Data block in Inspector.
2026-01-08 16:14:01 +01:00

68 lines
5.7 KiB
Markdown

# Gemini Code Assistant Context
## Wichtige Hinweise
- **Projektdokumentation:** Die primäre und umfassendste Dokumentation für dieses Projekt befindet sich in der Datei `readme.md`. Bitte ziehen Sie diese Datei für ein detailliertes Verständnis der Architektur und der einzelnen Module zu Rate.
- **Git-Repository:** Dieses Projekt wird über ein Git-Repository verwaltet. Alle Änderungen am Code werden versioniert. Beachten Sie den Abschnitt "Git Workflow & Conventions" für unsere Arbeitsregeln.
## Project Overview
This project is a Python-based system for automated company data enrichment and lead generation. It uses a variety of data sources, including web scraping, Wikipedia, and the OpenAI API, to enrich company data from a CRM system. The project is designed to run in a Docker container and can be controlled via a Flask API.
The system is modular and consists of the following key components:
* **`brancheneinstufung_167.py`:** The core module for data enrichment, including web scraping, Wikipedia lookups, and AI-based analysis.
* **`company_deduplicator.py`:** A module for intelligent duplicate checking, both for external lists and internal CRM data.
* **`generate_marketing_text.py`:** An engine for creating personalized marketing texts.
* **`app.py`:** A Flask application that provides an API to run the different modules.
* **`company-explorer/`:** A new React/FastAPI-based application (v2.x) replacing the legacy CLI tools. It focuses on identifying robotics potential in companies.
## Git Workflow & Conventions
- **Commit-Nachrichten:** Commits sollen einem klaren Format folgen:
- Titel: Eine prägnante Zusammenfassung unter 100 Zeichen.
- Beschreibung: Detaillierte Änderungen als Liste mit `- ` am Zeilenanfang (keine Bulletpoints).
- **Datei-Umbenennungen:** Um die Git-Historie einer Datei zu erhalten, muss sie zwingend mit `git mv alter_name.py neuer_name.py` umbenannt werden.
- **Commit & Push Prozess:** Änderungen werden zuerst lokal committet. Das Pushen auf den Remote-Server erfolgt erst nach expliziter Bestätigung durch Sie.
## Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition)
* **Robotics Potential Analysis (v2.3):**
* **Logic Overhaul:** Switched from keyword-based scanning to a **"Chain-of-Thought" Infrastructure Analysis**. The AI now evaluates physical assets (factories, warehouses, solar parks) to determine robotics needs.
* **Provider vs. User:** Implemented strict reasoning to distinguish between companies *selling* cleaning products (providers) and those *operating* factories (users/potential clients).
* **Configurable Logic:** Added a database-backed configuration system for robotics categories (`cleaning`, `transport`, `security`, `service`). Users can now define the "Trigger Logic" and "Scoring Guide" directly in the frontend settings.
* **Wikipedia Integration (v2.1):**
* **Deep Extraction:** Implemented the "Legacy" extraction logic (`WikipediaService`). It now pulls the **first paragraph** (cleaned of references), **categories** (filtered for relevance), revenue, employees, and HQ location.
* **Google-First Discovery:** Uses SerpAPI to find the correct Wikipedia article, validating via domain match and city.
* **Visual Inspector:** The frontend `Inspector` now displays a comprehensive Wikipedia profile including category tags.
* **Web Scraping & Legal Data (v2.2):**
* **Impressum Scraping:** Implemented a robust finder for "Impressum" / "Legal Notice" links.
* **Root-URL Fallback:** If deep links (e.g., from `/about-us`) don't work, the scraper automatically checks the root domain (`example.com/impressum`).
* **LLM Extraction:** Uses Gemini to parse unstructured Impressum text into structured JSON (Legal Name, Address, CEO).
* **Clean JSON Parsing:** Implemented `clean_json_response` to handle AI responses containing Markdown (` ```json `), preventing crash loops.
* **Manual Overrides & Control:**
* **Wikipedia Override:** Added a UI to manually correct the Wikipedia URL. This triggers a re-scan and **locks** the record (`is_locked` flag) to prevent auto-overwrite.
* **Website Override:** Added a UI to manually correct the company website. This automatically clears old scraping data to force a fresh analysis on the next run.
## Lessons Learned & Best Practices
1. **Numeric Extraction (German Locale):**
* **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator. For Revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless multiple dots exist.
* **Rule:** Always check for both `,` and `.` presence to determine locale.
2. **LLM JSON Stability:**
* **Problem:** LLMs often wrap JSON in Markdown blocks, causing `json.loads()` to fail.
* **Solution:** ALWAYS use a `clean_json_response` helper that strips ` ```json ` markers before parsing. Never trust raw LLM output for structured data.
3. **Scraping Navigation:**
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
* **Solution:** Always implement a fallback to the **Root Domain**. The legal notice is almost always linked from the homepage footer.
## Next Steps
* **Frontend Debugging:** Verify why the "Official Legal Data" block disappears in some states (likely due to conditional rendering checks on `impressum` object structure).
* **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads.
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.