Floke dd09d5e268 Feat: Add thin content and cookie banner detection
- Enhances the `_scrape_website_task_batch` worker to improve data quality assessment.
- Implements a "Thin Content" check: If the extracted text is less than 200 characters, the URL status is set to `URL_SCRAPE_THIN_CONTENT`.
- Adds a heuristic for detecting cookie banners: If the text is short (< 500 chars) and contains a high density of cookie-related keywords, the status is set to `URL_SCRAPE_COOKIE_BANNER`.
- These new statuses provide more granular insights into scraping issues, allowing for better-targeted reprocessing and quality control.
2025-07-20 19:22:11 +00:00
2025-03-29 18:47:15 +01:00
2025-05-27 12:39:45 +00:00
2025-03-29 18:47:15 +01:00
2025-03-29 18:47:15 +01:00
2025-06-20 16:20:54 +00:00
2025-07-19 18:12:49 +00:00
2025-07-16 12:58:47 +00:00
2025-07-20 12:38:45 +00:00
2025-04-04 17:04:06 +00:00
2025-06-27 11:08:42 +02:00
2025-07-16 07:52:36 +00:00
2025-07-01 05:15:47 +00:00
2025-03-29 18:47:15 +01:00
2025-05-27 12:48:05 +00:00
2025-07-14 08:36:01 +00:00
2025-03-29 18:47:15 +01:00
Description
No description provided
2.8 GiB
Languages
Python 63.6%
TypeScript 19.2%
JavaScript 15.6%
HTML 0.7%
Dockerfile 0.4%
Other 0.5%