- Enhances the `_scrape_website_task_batch` worker to improve data quality assessment. - Implements a "Thin Content" check: If the extracted text is less than 200 characters, the URL status is set to `URL_SCRAPE_THIN_CONTENT`. - Adds a heuristic for detecting cookie banners: If the text is short (< 500 chars) and contains a high density of cookie-related keywords, the status is set to `URL_SCRAPE_COOKIE_BANNER`. - These new statuses provide more granular insights into scraping issues, allowing for better-targeted reprocessing and quality control.
353 KiB
353 KiB