feat(company-explorer): bump version to 0.3.0, add VAT ID extraction, and fix deep-link scraping
- Updated version to v0.3.0 (UI & Backend) to clear potential caching confusion. - Enhanced Impressum scraper to extract VAT ID (Umsatzsteuer-ID). - Implemented 2-Hop scraping strategy: Looks for 'Kontakt' page if Impressum isn't on the start page. - Added VAT ID display to the Legal Data block in Inspector.
This commit is contained in:
26
GEMINI.md
26
GEMINI.md
@@ -37,14 +37,32 @@ The system is modular and consists of the following key components:
|
||||
* **Google-First Discovery:** Uses SerpAPI to find the correct Wikipedia article, validating via domain match and city.
|
||||
* **Visual Inspector:** The frontend `Inspector` now displays a comprehensive Wikipedia profile including category tags.
|
||||
|
||||
* **Web Scraping & Legal Data (v2.2):**
|
||||
* **Impressum Scraping:** Implemented a robust finder for "Impressum" / "Legal Notice" links.
|
||||
* **Root-URL Fallback:** If deep links (e.g., from `/about-us`) don't work, the scraper automatically checks the root domain (`example.com/impressum`).
|
||||
* **LLM Extraction:** Uses Gemini to parse unstructured Impressum text into structured JSON (Legal Name, Address, CEO).
|
||||
* **Clean JSON Parsing:** Implemented `clean_json_response` to handle AI responses containing Markdown (` ```json `), preventing crash loops.
|
||||
|
||||
* **Manual Overrides & Control:**
|
||||
* **Wikipedia Override:** Added a UI to manually correct the Wikipedia URL. This triggers a re-scan and **locks** the record (`is_locked` flag) to prevent auto-overwrite.
|
||||
* **Website Override:** Added a UI to manually correct the company website. This automatically clears old scraping data to force a fresh analysis on the next run.
|
||||
|
||||
* **Architecture & DB:**
|
||||
* **Database:** Updated `companies_v3_final.db` schema to include `RoboticsCategory` and `EnrichmentData.is_locked`.
|
||||
* **Services:** Refactored `ClassificationService` and `DiscoveryService` for better modularity and robustness.
|
||||
## Lessons Learned & Best Practices
|
||||
|
||||
1. **Numeric Extraction (German Locale):**
|
||||
* **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
|
||||
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator. For Revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless multiple dots exist.
|
||||
* **Rule:** Always check for both `,` and `.` presence to determine locale.
|
||||
|
||||
2. **LLM JSON Stability:**
|
||||
* **Problem:** LLMs often wrap JSON in Markdown blocks, causing `json.loads()` to fail.
|
||||
* **Solution:** ALWAYS use a `clean_json_response` helper that strips ` ```json ` markers before parsing. Never trust raw LLM output for structured data.
|
||||
|
||||
3. **Scraping Navigation:**
|
||||
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
|
||||
* **Solution:** Always implement a fallback to the **Root Domain**. The legal notice is almost always linked from the homepage footer.
|
||||
|
||||
## Next Steps
|
||||
* **Frontend Debugging:** Verify why the "Official Legal Data" block disappears in some states (likely due to conditional rendering checks on `impressum` object structure).
|
||||
* **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads.
|
||||
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.
|
||||
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.
|
||||
Reference in New Issue
Block a user