fix(company-explorer): handle inconsistent LLM list responses in scraper
- Added logic to automatically flatten list-wrapped JSON responses from LLM in Impressum extraction. - Fixed 'Unknown Legal Name' issue by ensuring property access on objects, not lists. - Finalized v0.3.0 features and updated documentation with Lessons Learned.
This commit is contained in:
20
GEMINI.md
20
GEMINI.md
@@ -25,12 +25,12 @@ The system architecture has evolved from a CLI-based toolset to a modern web app
|
||||
|
||||
### 3. Web Scraping & Legal Data (v2.2)
|
||||
* **Impressum Scraping:**
|
||||
* **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for for a "Kontakt" page and checks for the link there.
|
||||
* **2-Hop Strategy:** If no "Impressum" link is found on the landing page, the scraper automatically searches for a "Kontakt" page and checks for the link there.
|
||||
* **Root Fallback:** If deep links (e.g. `/about-us`) fail, the scraper checks the root domain (`/`).
|
||||
* **LLM Extraction:** Unstructured legal text is parsed by Gemini to extract structured JSON (Legal Name, Address, CEO, VAT ID).
|
||||
* **Robustness:**
|
||||
* **JSON Cleaning:** A helper (`clean_json_response`) strips Markdown code blocks from LLM responses to prevent parsing errors.
|
||||
* **Aggressive Cleaning:** HTML scraper removes `<a>` tags and cookie banners before text analysis to reduce noise.
|
||||
* **Schema Enforcement:** Added logic to handle inconsistent LLM responses (e.g., returning a list `[{...}]` instead of a flat object `{...}`).
|
||||
|
||||
### 4. User Control & Ops
|
||||
* **Manual Overrides:** Users can manually correct the Wikipedia URL (locking the data) and the Company Website (triggering a fresh re-scrape).
|
||||
@@ -48,18 +48,18 @@ The system architecture has evolved from a CLI-based toolset to a modern web app
|
||||
* **Problem:** LLMs often wrap JSON in Markdown blocks (` ```json `), causing `json.loads()` to fail.
|
||||
* **Solution:** ALWAYS use a `clean_json_response` helper that strips markers before parsing. Never trust raw LLM output.
|
||||
|
||||
3. **Scraping Navigation:**
|
||||
3. **LLM Structure Inconsistency:**
|
||||
* **Problem:** Even with `json_mode=True`, models sometimes wrap the result in a list `[...]` instead of a flat object `{...}`, breaking frontend property access.
|
||||
* **Solution:** Implement a check: `if isinstance(result, list): result = result[0]`.
|
||||
|
||||
4. **Scraping Navigation:**
|
||||
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
|
||||
* **Solution:** Always implement a fallback to the **Root Domain** AND a **2-Hop check** via the "Kontakt" page.
|
||||
|
||||
4. **Frontend State Management:**
|
||||
5. **Frontend State Management:**
|
||||
* **Problem:** Users didn't see when a background job finished.
|
||||
* **Solution:** Implementing a polling mechanism (`setInterval`) tied to a `isProcessing` state is superior to static timeouts for long-running AI tasks.
|
||||
|
||||
5. **Impressum Extraction Debugging:**
|
||||
* **Problem:** Impressum fields sometimes return empty/null even when the URL is correctly identified and the page exists.
|
||||
* **Solution:** Increased logging verbosity in `_scrape_impressum_data` to output the exact raw text sent to the LLM and the raw LLM response. This helps diagnose issues with LLM interpretation or JSON formatting during extraction.
|
||||
|
||||
## Next Steps
|
||||
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.
|
||||
* **Export:** Generate Excel/CSV exports of enriched leads for CRM import.
|
||||
* **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads.
|
||||
* **Export:** Generate Excel/CSV enriched reports.
|
||||
|
||||
@@ -169,7 +169,15 @@ class ScraperService:
|
||||
|
||||
response_text = call_gemini(prompt, json_mode=True, temperature=0.1)
|
||||
logger.debug(f"Impressum LLM raw response ({len(response_text)} chars): {response_text[:500]}...")
|
||||
return json.loads(clean_json_response(response_text))
|
||||
|
||||
result = json.loads(clean_json_response(response_text))
|
||||
|
||||
# --- FIX: Handle List vs Dict ---
|
||||
# If LLM returns a list like [{...}], take the first element
|
||||
if isinstance(result, list) and len(result) > 0:
|
||||
result = result[0]
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Impressum scrape failed for {url}: {e}", exc_info=True) # Log full traceback
|
||||
|
||||
Reference in New Issue
Block a user