feat(company-explorer): bump version to 0.3.0, add VAT ID extraction, and fix deep-link scraping

- Updated version to v0.3.0 (UI & Backend) to clear potential caching confusion.
- Enhanced Impressum scraper to extract VAT ID (Umsatzsteuer-ID).
- Implemented 2-Hop scraping strategy: Looks for 'Kontakt' page if Impressum isn't on the start page.
- Added VAT ID display to the Legal Data block in Inspector.
This commit is contained in:
2026-01-08 12:10:09 +00:00
parent dbc3ce9b34
commit 601593c65c
8 changed files with 156 additions and 27 deletions

View File

@@ -37,14 +37,32 @@ The system is modular and consists of the following key components:
* **Google-First Discovery:** Uses SerpAPI to find the correct Wikipedia article, validating via domain match and city. * **Google-First Discovery:** Uses SerpAPI to find the correct Wikipedia article, validating via domain match and city.
* **Visual Inspector:** The frontend `Inspector` now displays a comprehensive Wikipedia profile including category tags. * **Visual Inspector:** The frontend `Inspector` now displays a comprehensive Wikipedia profile including category tags.
* **Web Scraping & Legal Data (v2.2):**
* **Impressum Scraping:** Implemented a robust finder for "Impressum" / "Legal Notice" links.
* **Root-URL Fallback:** If deep links (e.g., from `/about-us`) don't work, the scraper automatically checks the root domain (`example.com/impressum`).
* **LLM Extraction:** Uses Gemini to parse unstructured Impressum text into structured JSON (Legal Name, Address, CEO).
* **Clean JSON Parsing:** Implemented `clean_json_response` to handle AI responses containing Markdown (` ```json `), preventing crash loops.
* **Manual Overrides & Control:** * **Manual Overrides & Control:**
* **Wikipedia Override:** Added a UI to manually correct the Wikipedia URL. This triggers a re-scan and **locks** the record (`is_locked` flag) to prevent auto-overwrite. * **Wikipedia Override:** Added a UI to manually correct the Wikipedia URL. This triggers a re-scan and **locks** the record (`is_locked` flag) to prevent auto-overwrite.
* **Website Override:** Added a UI to manually correct the company website. This automatically clears old scraping data to force a fresh analysis on the next run. * **Website Override:** Added a UI to manually correct the company website. This automatically clears old scraping data to force a fresh analysis on the next run.
* **Architecture & DB:** ## Lessons Learned & Best Practices
* **Database:** Updated `companies_v3_final.db` schema to include `RoboticsCategory` and `EnrichmentData.is_locked`.
* **Services:** Refactored `ClassificationService` and `DiscoveryService` for better modularity and robustness. 1. **Numeric Extraction (German Locale):**
* **Problem:** "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
* **Solution:** Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator. For Revenue (`is_umsatz=True`), dots are generally treated as decimals (e.g. "375.6 Mio") unless multiple dots exist.
* **Rule:** Always check for both `,` and `.` presence to determine locale.
2. **LLM JSON Stability:**
* **Problem:** LLMs often wrap JSON in Markdown blocks, causing `json.loads()` to fail.
* **Solution:** ALWAYS use a `clean_json_response` helper that strips ` ```json ` markers before parsing. Never trust raw LLM output for structured data.
3. **Scraping Navigation:**
* **Problem:** Searching for "Impressum" only on the *scraped* URL (which might be a subpage found via Google) often fails.
* **Solution:** Always implement a fallback to the **Root Domain**. The legal notice is almost always linked from the homepage footer.
## Next Steps ## Next Steps
* **Frontend Debugging:** Verify why the "Official Legal Data" block disappears in some states (likely due to conditional rendering checks on `impressum` object structure).
* **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads. * **Quality Assurance:** Implement a dedicated "Review Mode" to validate high-potential leads.
* **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB. * **Data Import:** Finalize the "List Matcher" to import and deduplicate Excel lists against the new DB.

View File

@@ -9,7 +9,7 @@ try:
class Settings(BaseSettings): class Settings(BaseSettings):
# App Info # App Info
APP_NAME: str = "Company Explorer" APP_NAME: str = "Company Explorer"
VERSION: str = "0.2.2" VERSION: str = "0.3.0"
DEBUG: bool = True DEBUG: bool = True
# Database (Store in App dir for simplicity) # Database (Store in App dir for simplicity)

View File

@@ -84,44 +84,59 @@ class ScraperService:
def _find_impressum_link(self, soup: BeautifulSoup, base_url: str) -> Optional[str]: def _find_impressum_link(self, soup: BeautifulSoup, base_url: str) -> Optional[str]:
""" """
Scans all links for keywords like 'Impressum', 'Legal', 'Imprint'. Scans links for Impressum. If not found, tries to find 'Kontakt' page and looks there.
Returns the absolute URL.
""" """
keywords = ["impressum", "imprint", "legal notice", "anbieterkennzeichnung", "rechtliches", "legal", "disclaimer"] # 1. Try Direct Impressum Link
direct_url = self._find_link_by_keywords(soup, base_url, ["impressum", "imprint", "legal notice", "anbieterkennzeichnung", "rechtliches"])
if direct_url:
return direct_url
# 2. Try 2-Hop via "Kontakt"
logger.info(f"No direct Impressum found on {base_url}. Checking 'Kontakt' page...")
kontakt_url = self._find_link_by_keywords(soup, base_url, ["kontakt", "contact"])
# Candidate tracking if kontakt_url:
candidates = [] try:
headers = {'User-Agent': random.choice(USER_AGENTS)}
resp = requests.get(kontakt_url, headers=headers, timeout=10, verify=False)
if resp.status_code == 200:
sub_soup = BeautifulSoup(resp.content, 'html.parser')
# Look for Impressum on Kontakt page
sub_impressum = self._find_link_by_keywords(sub_soup, kontakt_url, ["impressum", "imprint", "legal notice", "anbieterkennzeichnung"])
if sub_impressum:
logger.info(f"Found Impressum via Kontakt page: {sub_impressum}")
return sub_impressum
except Exception as e:
logger.warning(f"Failed to scan Kontakt page {kontakt_url}: {e}")
return None
def _find_link_by_keywords(self, soup: BeautifulSoup, base_url: str, keywords: list) -> Optional[str]:
"""Helper to find a link matching specific keywords."""
candidates = []
for a in soup.find_all('a', href=True): for a in soup.find_all('a', href=True):
text = clean_text(a.get_text()).lower() text = clean_text(a.get_text()).lower()
href = a['href'].lower() href = a['href'].lower()
# Debug log for potential candidates (verbose)
# if "imp" in text or "imp" in href:
# logger.debug(f"Checking link: '{text}' -> {href}")
# Check text content or href keywords
if any(kw in text for kw in keywords) or any(kw in href for kw in keywords): if any(kw in text for kw in keywords) or any(kw in href for kw in keywords):
# Avoid mailto links or purely social links if possible
if "mailto:" in href or "tel:" in href or "javascript:" in href: if "mailto:" in href or "tel:" in href or "javascript:" in href:
continue continue
full_url = urljoin(base_url, a['href']) full_url = urljoin(base_url, a['href'])
# Prioritize 'impressum' in text over href
score = 0 score = 0
if "impressum" in text: score += 10 # Higher score if keyword is in visible text
if "impressum" in href: score += 5 if any(kw in text for kw in keywords): score += 10
# Lower score if only in href
if any(kw in href for kw in keywords): score += 5
# Boost specific exact matches
if text in keywords: score += 5
candidates.append((score, full_url)) candidates.append((score, full_url))
if candidates: if candidates:
# Sort by score desc
candidates.sort(key=lambda x: x[0], reverse=True) candidates.sort(key=lambda x: x[0], reverse=True)
best_match = candidates[0][1] return candidates[0][1]
logger.info(f"Impressum Link Selection: Found {len(candidates)} candidates. Winner: {best_match}")
return best_match
return None return None
def _scrape_impressum_data(self, url: str) -> Dict[str, str]: def _scrape_impressum_data(self, url: str) -> Dict[str, str]:
@@ -143,7 +158,7 @@ class ScraperService:
# LLM Extraction # LLM Extraction
prompt = f""" prompt = f"""
Extract the official company details from this German 'Impressum' text. Extract the official company details from this German 'Impressum' text.
Return JSON ONLY. Keys: 'legal_name', 'street', 'zip', 'city', 'email', 'phone', 'ceo_name'. Return JSON ONLY. Keys: 'legal_name', 'street', 'zip', 'city', 'email', 'phone', 'ceo_name', 'vat_id'.
If a field is missing, use null. If a field is missing, use null.
Text: Text:

View File

@@ -73,7 +73,7 @@ function App() {
</div> </div>
<div> <div>
<h1 className="text-xl font-bold text-white tracking-tight">Company Explorer</h1> <h1 className="text-xl font-bold text-white tracking-tight">Company Explorer</h1>
<p className="text-xs text-blue-400 font-medium">ROBOTICS EDITION <span className="text-slate-600 ml-2">v0.2.2 (New DB Path)</span></p> <p className="text-xs text-blue-400 font-medium">ROBOTICS EDITION <span className="text-slate-600 ml-2">v0.3.0 (Polling & Legal Data)</span></p>
</div> </div>
</div> </div>

View File

@@ -281,9 +281,10 @@ export function Inspector({ companyId, onClose, apiBase }: InspectorProps) {
</div> </div>
{(impressum.email || impressum.phone) && ( {(impressum.email || impressum.phone) && (
<div className="mt-2 pt-2 border-t border-slate-900 flex gap-4 text-[10px] text-slate-500 font-mono"> <div className="mt-2 pt-2 border-t border-slate-900 flex flex-wrap gap-4 text-[10px] text-slate-500 font-mono">
{impressum.email && <span>{impressum.email}</span>} {impressum.email && <span>{impressum.email}</span>}
{impressum.phone && <span>{impressum.phone}</span>} {impressum.phone && <span>{impressum.phone}</span>}
{impressum.vat_id && <span className="text-blue-400/80">VAT: {impressum.vat_id}</span>}
</div> </div>
)} )}
</div> </div>

34
debug_igepa.py Normal file
View File

@@ -0,0 +1,34 @@
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://www.igepa.de/"
print(f"Fetching {url}...")
try:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False, timeout=15)
print(f"Status: {response.status_code}")
soup = BeautifulSoup(response.content, 'html.parser')
print("\n--- Searching for Impressum Candidates ---")
keywords = ["impressum", "imprint", "legal notice", "anbieterkennzeichnung", "rechtliches", "legal", "disclaimer"]
found = False
for a in soup.find_all('a', href=True):
text = a.get_text().strip().lower()
href = a['href'].lower()
# print(f"Link: '{text}' -> {href}") # Verbose
if any(kw in text for kw in keywords) or any(kw in href for kw in keywords):
print(f"MATCH: Text='{text}' | Href='{href}'")
found = True
if not found:
print("No matches found.")
except Exception as e:
print(f"Error: {e}")

34
debug_igepa_deep.py Normal file
View File

@@ -0,0 +1,34 @@
import requests
from bs4 import BeautifulSoup
url = "https://www.igepa.de/zweih_gmbh_co_kg/ueber-uns/"
print(f"Fetching {url}...")
try:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False, timeout=15)
soup = BeautifulSoup(response.content, 'html.parser')
print("\n--- Searching for 'imp' in Href or Text ---")
found = False
for a in soup.find_all('a', href=True):
text = a.get_text().strip().lower()
href = a['href'].lower()
if "imp" in href or "imp" in text:
print(f"MATCH: Text='{text}' | Href='{href}'")
found = True
if not found:
print("No match for 'imp' found.")
print("\n--- Searching for '2h' specific links ---")
for a in soup.find_all('a', href=True):
href = a['href'].lower()
if "zweih" in href:
print(f"2H Link: {href}")
except Exception as e:
print(f"Error: {e}")

27
debug_igepa_dump.py Normal file
View File

@@ -0,0 +1,27 @@
import requests
from bs4 import BeautifulSoup
url = "https://www.igepa.de/"
print(f"Fetching {url}...")
try:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers, verify=False, timeout=15)
soup = BeautifulSoup(response.content, 'html.parser')
print(f"Page Title: {soup.title.string if soup.title else 'No Title'}")
print("\n--- All Links (First 50) ---")
count = 0
for a in soup.find_all('a', href=True):
text = a.get_text().strip().replace('\n', ' ')
href = a['href']
print(f"[{count}] {text[:30]}... -> {href}")
count += 1
if count > 50: break
except Exception as e:
print(f"Error: {e}")