fix(company-explorer): enhance impressum scraping debug logging
- Increased logging verbosity in to track raw input to LLM and raw LLM response. - This helps diagnose why Impressum data extraction might be failing for specific company websites.
This commit is contained in:
@@ -25,7 +25,7 @@ The system architecture has evolved from a CLI-based toolset to a modern web app
|
||||
|
||||
### 3. Web Scraping & Legal Data (v2.2)
|
||||
* **Impressum Scraping:**
|
||||
* **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for a "Kontakt" page and checks for the link there.
|
||||
* **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for for a "Kontakt" page and checks for the link there.
|
||||
* **Root Fallback:** If deep links (e.g. `/about-us`) fail, the scraper checks the root domain (`/`).
|
||||
* **LLM Extraction:** Unstructured legal text is parsed by Gemini to extract structured JSON (Legal Name, Address, CEO, VAT ID).
|
||||
* **Robustness:**
|
||||
@@ -56,6 +56,10 @@ The system architecture has evolved from a CLI-based toolset to a modern web app
|
||||
* **Problem:** Users didn't see when a background job finished.
|
||||
* **Solution:** Implementing a polling mechanism (`setInterval`) tied to a `isProcessing` state is superior to static timeouts for long-running AI tasks.
|
||||
|
||||
5. **Impressum Extraction Debugging:**
|
||||
* **Problem:** Impressum fields sometimes return empty/null even when the URL is correctly identified and the page exists.
|
||||
* **Solution:** Increased logging verbosity in `_scrape_impressum_data` to output the exact raw text sent to the LLM and the raw LLM response. This helps diagnose issues with LLM interpretation or JSON formatting during extraction.
|
||||
|
||||
## Next Steps
|
||||
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.
|
||||
* **Export:** Generate Excel/CSV exports of enriched leads for CRM import.
|
||||
* **Export:** Generate Excel/CSV exports of enriched leads for CRM import.
|
||||
Reference in New Issue
Block a user