feat(company-explorer): force-refresh analysis and refine extraction logic
- Enforced fresh scrape on 'Analyze' request to bypass stale cache. - Implemented 2-Hop Impressum scraping strategy (via Kontakt page). - Refined numeric extraction for German locale (thousands separators). - Updated documentation with Lessons Learned.
This commit is contained in:
75
GEMINI.md
75
GEMINI.md
@@ -7,62 +7,55 @@
|
||||
|
||||
## Project Overview
|
||||
|
||||
This project is a Python-based system for automated company data enrichment and lead generation. It uses a variety of data sources, including web scraping, Wikipedia, and the OpenAI API, to enrich company data from a CRM system. The project is designed to run in a Docker container and can be controlled via a Flask API.
|
||||
This project is a Python-based system for automated company data enrichment and lead generation. It focuses on identifying B2B companies with high potential for robotics automation (Cleaning, Transport, Security, Service).
|
||||
|
||||
The system is modular and consists of the following key components:
|
||||
The system architecture has evolved from a CLI-based toolset to a modern web application (`company-explorer`) backed by Docker containers.
|
||||
|
||||
* **`brancheneinstufung_167.py`:** The core module for data enrichment, including web scraping, Wikipedia lookups, and AI-based analysis.
|
||||
* **`company_deduplicator.py`:** A module for intelligent duplicate checking, both for external lists and internal CRM data.
|
||||
* **`generate_marketing_text.py`:** An engine for creating personalized marketing texts.
|
||||
* **`app.py`:** A Flask application that provides an API to run the different modules.
|
||||
* **`company-explorer/`:** A new React/FastAPI-based application (v2.x) replacing the legacy CLI tools. It focuses on identifying robotics potential in companies.
|
||||
## Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition v0.3.0)
|
||||
|
||||
## Git Workflow & Conventions
|
||||
### 1. Robotics Potential Analysis (v2.3)
|
||||
* **Chain-of-Thought Logic:** The AI analysis (`ClassificationService`) now uses a multi-step reasoning process to evaluate companies based on their **physical infrastructure** (factories, warehouses) rather than just keywords.
|
||||
* **Provider vs. User:** Strict logic implemented to distinguish between companies *selling* automation products and those *needing* them for their own operations.
|
||||
* **Configurable Settings:** A database-driven configuration (`RoboticsCategory`) allows users to edit the definition and scoring logic for each robotics category directly via the frontend settings menu.
|
||||
|
||||
- **Commit-Nachrichten:** Commits sollen einem klaren Format folgen:
|
||||
- Titel: Eine prägnante Zusammenfassung unter 100 Zeichen.
|
||||
- Beschreibung: Detaillierte Änderungen als Liste mit `- ` am Zeilenanfang (keine Bulletpoints).
|
||||
- **Datei-Umbenennungen:** Um die Git-Historie einer Datei zu erhalten, muss sie zwingend mit `git mv alter_name.py neuer_name.py` umbenannt werden.
|
||||
- **Commit & Push Prozess:** Änderungen werden zuerst lokal committet. Das Pushen auf den Remote-Server erfolgt erst nach expliziter Bestätigung durch Sie.
|
||||
### 2. Deep Wikipedia Integration (v2.1)
|
||||
* **Extraction:** The system extracts the first paragraph (cleaned of artifacts), industry, revenue (normalized to Mio €), employee count, and Wikipedia categories.
|
||||
* **Validation:** Uses a "Google-First" strategy via SerpAPI, validating candidates by checking for domain matches and city/HQ location in the article.
|
||||
* **UI:** The Inspector displays a dedicated Wikipedia profile section with visual tags.
|
||||
|
||||
## Current Status (Jan 08, 2026) - Company Explorer (Robotics Edition)
|
||||
### 3. Web Scraping & Legal Data (v2.2)
|
||||
* **Impressum Scraping:**
|
||||
* **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for a "Kontakt" page and checks for the link there.
|
||||
* **Root Fallback:** If deep links (e.g. `/about-us`) fail, the scraper checks the root domain (`/`).
|
||||
* **LLM Extraction:** Unstructured legal text is parsed by Gemini to extract structured JSON (Legal Name, Address, CEO, VAT ID).
|
||||
* **Robustness:**
|
||||
* **JSON Cleaning:** A helper (`clean_json_response`) strips Markdown code blocks from LLM responses to prevent parsing errors.
|
||||
* **Aggressive Cleaning:** HTML scraper removes `<a>` tags and cookie banners before text analysis to reduce noise.
|
||||
|
||||
* **Robotics Potential Analysis (v2.3):**
|
||||
* **Logic Overhaul:** Switched from keyword-based scanning to a **"Chain-of-Thought" Infrastructure Analysis**. The AI now evaluates physical assets (factories, warehouses, solar parks) to determine robotics needs.
|
||||
* **Provider vs. User:** Implemented strict reasoning to distinguish between companies *selling* cleaning products (providers) and those *operating* factories (users/potential clients).
|
||||
* **Configurable Logic:** Added a database-backed configuration system for robotics categories (`cleaning`, `transport`, `security`, `service`). Users can now define the "Trigger Logic" and "Scoring Guide" directly in the frontend settings.
|
||||
|
||||
* **Wikipedia Integration (v2.1):**
|
||||
* **Deep Extraction:** Implemented the "Legacy" extraction logic (`WikipediaService`). It now pulls the **first paragraph** (cleaned of references), **categories** (filtered for relevance), revenue, employees, and HQ location.
|
||||
* **Google-First Discovery:** Uses SerpAPI to find the correct Wikipedia article, validating via domain match and city.
|
||||
* **Visual Inspector:** The frontend `Inspector` now displays a comprehensive Wikipedia profile including category tags.
|
||||
|
||||
* **Web Scraping & Legal Data (v2.2):**
|
||||
* **Impressum Scraping:** Implemented a robust finder for "Impressum" / "Legal Notice" links.
|
||||
* **Root-URL Fallback:** If deep links (e.g., from `/about-us`) don't work, the scraper automatically checks the root domain (`example.com/impressum`).
|
||||
* **LLM Extraction:** Uses Gemini to parse unstructured Impressum text into structured JSON (Legal Name, Address, CEO).
|
||||
* **Clean JSON Parsing:** Implemented `clean_json_response` to handle AI responses containing Markdown (` ```json `), preventing crash loops.
|
||||
|
||||
* **Manual Overrides & Control:**
|
||||
* **Wikipedia Override:** Added a UI to manually correct the Wikipedia URL. This triggers a re-scan and **locks** the record (`is_locked` flag) to prevent auto-overwrite.
|
||||
* **Website Override:** Added a UI to manually correct the company website. This automatically clears old scraping data to force a fresh analysis on the next run.
|
||||
### 4. User Control & Ops
|
||||
* **Manual Overrides:** Users can manually correct the Wikipedia URL (locking the data) and the Company Website (triggering a fresh re-scrape).
|
||||
* **Polling UI:** The frontend uses intelligent polling to auto-refresh data when background jobs (Discovery/Analysis) complete.
|
||||
* **Forced Refresh:** The "Analyze" endpoint now clears old cache data to ensure a fresh scrape on every user request.
|
||||
|
||||
## Lessons Learned & Best Practices
|
||||
|
||||
1. **Numeric Extraction (German Locale):**
|
||||
* **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
|
||||
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator. For Revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless multiple dots exist.
|
||||
* **Rule:** Always check for both `,` and `.` presence to determine locale.
|
||||
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
|
||||
* **Revenue:** For revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.
|
||||
|
||||
2. **LLM JSON Stability:**
|
||||
* **Problem:** LLMs often wrap JSON in Markdown blocks, causing `json.loads()` to fail.
|
||||
* **Solution:** ALWAYS use a `clean_json_response` helper that strips ` ```json ` markers before parsing. Never trust raw LLM output for structured data.
|
||||
* **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail.
|
||||
* **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output.
|
||||
|
||||
3. **Scraping Navigation:**
|
||||
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
|
||||
* **Solution:** Always implement a fallback to the **Root Domain**. The legal notice is almost always linked from the homepage footer.
|
||||
* **Solution:** Always implement a fallback to the **Root Domain** AND a **2-Hop check** via the "Kontakt" page.
|
||||
|
||||
4. **Frontend State Management:**
|
||||
* **Problem:** Users didn't see when a background job finished.
|
||||
* **Solution:** Implementing a polling mechanism (`setInterval`) tied to a `isProcessing` state is superior to static timeouts for long-running AI tasks.
|
||||
|
||||
## Next Steps
|
||||
* **Frontend Debugging:** Verify why the "Official Legal Data" block disappears in some states (likely due to conditional rendering checks on `impressum` object structure).
|
||||
* **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads.
|
||||
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.
|
||||
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.
|
||||
* **Export:** Generate Excel/CSV exports of enriched leads for CRM import.
|
||||
|
||||
@@ -324,6 +324,19 @@ def analyze_company(req: AnalysisRequest, background_tasks: BackgroundTasks, db:
|
||||
if not company.website or company.website == "k.A.":
|
||||
return {"error": "No website to analyze. Run Discovery first."}
|
||||
|
||||
# FORCE SCRAPE LOGIC
|
||||
# If explicit force_scrape is requested OR if we want to ensure fresh data for debugging
|
||||
# We delete the old scrape data.
|
||||
# For now, let's assume every manual "Analyze" click implies a desire for fresh results if previous failed.
|
||||
# But let's respect the flag from frontend if we add it later.
|
||||
|
||||
# Always clearing scrape data for now to fix the "stuck cache" issue reported by user
|
||||
db.query(EnrichmentData).filter(
|
||||
EnrichmentData.company_id == company.id,
|
||||
EnrichmentData.source_type == "website_scrape"
|
||||
).delete()
|
||||
db.commit()
|
||||
|
||||
background_tasks.add_task(run_analysis_task, company.id, company.website)
|
||||
return {"status": "queued"}
|
||||
|
||||
|
||||
1
tmp/sidebar.txt
Normal file
1
tmp/sidebar.txt
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user