Brancheneinstufung2/data_processor.py at 5067bf2af2e58309b3ec9b6ed5f13472838ee95b

Files

Floke 3872ee292c Feat: Add thin content and cookie banner detection

- Enhances the `_scrape_website_task_batch` worker to improve data quality assessment.
- Implements a "Thin Content" check: If the extracted text is less than 200 characters, the URL status is set to `URL_SCRAPE_THIN_CONTENT`.
- Adds a heuristic for detecting cookie banners: If the text is short (< 500 chars) and contains a high density of cookie-related keywords, the status is set to `URL_SCRAPE_COOKIE_BANNER`.
- These new statuses provide more granular insights into scraping issues, allowing for better-targeted reprocessing and quality control.

2025-07-20 19:22:11 +00:00

353 KiB

Raw Blame History

View Raw

353 KiB Raw Blame History

353 KiB

Raw Blame History