Brancheneinstufung2

Author	SHA1	Message	Date
Floke	3872ee292c	Feat: Add thin content and cookie banner detection - Enhances the `_scrape_website_task_batch` worker to improve data quality assessment. - Implements a "Thin Content" check: If the extracted text is less than 200 characters, the URL status is set to `URL_SCRAPE_THIN_CONTENT`. - Adds a heuristic for detecting cookie banners: If the text is short (< 500 chars) and contains a high density of cookie-related keywords, the status is set to `URL_SCRAPE_COOKIE_BANNER`. - These new statuses provide more granular insights into scraping issues, allowing for better-targeted reprocessing and quality control.	2025-07-20 19:22:11 +00:00
Floke	5f5dd16c1c	data_processor.py aktualisiert	2025-07-20 18:22:26 +00:00
Floke	706ba082e9	data_processor.py aktualisiert	2025-07-20 12:46:55 +00:00
Floke	8f1d28dc07	data_processor.py aktualisiert	2025-07-20 12:42:23 +00:00
Floke	2947312236	data_processor.py aktualisiert	2025-07-20 12:41:31 +00:00
Floke	9cc58ca294	helpers.py aktualisiert	2025-07-20 12:38:45 +00:00
Floke	df52c9ab7e	data_processor.py aktualisiert	2025-07-20 12:37:45 +00:00
Floke	9771cabf55	data_processor.py aktualisiert	2025-07-20 10:43:42 +00:00
Floke	8dfe7d23ec	großes rework, vieles gelöscht - Refactors the website scraping batch process to fix critical stability issues. - Replaces multiple redundant and conflicting scraping functions (`_scrape_website_task`, `_scrape_raw_text_task`, `_scrape_and_summarize_task`) with a single, robust worker function: `_scrape_website_task_batch`. - The new worker function now consistently returns a structured dictionary, resolving the `TypeError` that prevented results from being written to the sheet. - The main batch function `process_website_scraping_batch` is updated to correctly handle this new dictionary structure, including error states. - Functionality is now aligned with the single-row processing mode by also fetching meta-details in the batch process, not just raw text. - The two large, duplicated, and now obsolete `process_website_scraping` functions have been removed to improve code clarity and maintainability.	2025-07-20 09:18:49 +00:00
Floke	155675d827	data_processor.py aktualisiert	2025-07-20 08:49:15 +00:00
Floke	b0a7b8893a	data_processor.py aktualisiert	2025-07-20 08:47:54 +00:00
Floke	4037656029	data_processor.py aktualisiert	2025-07-20 08:33:21 +00:00
Floke	7dbd8a59f2	data_processor.py aktualisiert	2025-07-20 08:05:15 +00:00
Floke	b38fcaa7fd	data_processor.py aktualisiert	2025-07-20 07:57:06 +00:00
Floke	7b76cc09ef	data_processor.py aktualisiert	2025-07-20 07:56:31 +00:00
Floke	5cef0b0260	data_processor.py aktualisiert	2025-07-20 07:52:24 +00:00
Floke	071be8a410	data_processor.py aktualisiert	2025-07-20 07:48:03 +00:00
Floke	a09746dadd	data_processor.py aktualisiert	2025-07-20 07:46:35 +00:00
Floke	e8c901a081	data_processor.py aktualisiert	2025-07-20 07:40:30 +00:00
Floke	444e93e1d9	data_processor.py aktualisiert	2025-07-20 07:21:01 +00:00
Floke	aa8ee04c87	wikipedia_scraper.py aktualisiert	2025-07-20 06:53:02 +00:00
Floke	01d04c1b8f	Robuste, lineare Wikipedia-Suche - REFACTOR: Die komplexe, rekursive `search_company_article`-Funktion in `wikipedia_scraper.py` wurde durch eine einfache, lineare Implementierung ersetzt. - FIX: Der hartnäckige `TypeError` bei der Parameter-Übergabe wurde durch die neue, übersichtlichere Struktur endgültig behoben. - FEATURE: Die Suche prüft nun intelligent eine Liste von Suchbegriffen und validiert jeden potenziellen Treffer, was die Zuverlässigkeit erhöht.	2025-07-20 06:39:29 +00:00
Floke	dad57e852c	wikipedia_scraper.py aktualisiert	2025-07-20 06:34:33 +00:00
Floke	f7668823a4	wikipedia_scraper.py aktualisiert	2025-07-20 06:28:43 +00:00
Floke	9afb7888bd	data_processor.py aktualisiert	2025-07-20 06:04:56 +00:00
Floke	eb83ccdf48	data_processor.py aktualisiert	2025-07-20 06:03:11 +00:00
Floke	b1efeb3d8c	helpers.py aktualisiert	2025-07-20 05:44:56 +00:00
Floke	018b8acbdc	helpers.py aktualisiert	2025-07-20 05:35:57 +00:00
Floke	82ccd46c7f	data_processor.py aktualisiert	2025-07-20 05:26:00 +00:00
Floke	7a796c5460	google_sheet_handler.py aktualisiert	2025-07-20 05:07:03 +00:00
Floke	aeb5bb5ac1	data_processor.py aktualisiert	2025-07-20 04:47:04 +00:00
Floke	c77dd6938d	data_processor.py aktualisiert	2025-07-20 04:28:51 +00:00
Floke	b8c3416854	data_processor.py aktualisiert	2025-07-19 20:25:35 +00:00
Floke	585f84c3bd	data_processor.py aktualisiert	2025-07-19 20:24:50 +00:00
Floke	3cc4ec7f9a	data_processor.py aktualisiert	2025-07-19 20:23:56 +00:00
Floke	d33f025f9a	wikipedia_scraper.py aktualisiert	2025-07-19 20:10:19 +00:00
Floke	3a8bff2d27	data_processor.py aktualisiert	2025-07-19 20:03:28 +00:00
Floke	9bd38bbaea	data_processor.py aktualisiert	2025-07-19 19:56:18 +00:00
Floke	8b840cd16f	helpers.py aktualisiert	2025-07-19 19:53:33 +00:00
Floke	22afa4b1d2	Regex anpassung COLUMN_MAP\["([^"]+)"\] get_col_idx("$1")	2025-07-19 19:45:01 +00:00
Floke	f9cb3488da	data_processor.py aktualisiert	2025-07-19 19:14:16 +00:00
Floke	0441ba2d17	wikipedia_scraper.py aktualisiert	2025-07-19 19:03:02 +00:00
Floke	96cc188262	helpers.py aktualisiert	2025-07-19 18:45:43 +00:00
Floke	7b066d794e	data_processor.py aktualisiert	2025-07-19 18:42:41 +00:00
Floke	80f7e9afe6	Added def get_col_idx(key):	2025-07-19 18:34:20 +00:00
Floke	a6b4da61c0	Neue COLUMN_ORDER ergänzt	2025-07-19 18:12:49 +00:00
Floke	81d83a48f9	Anpassung reeval	2025-07-19 15:38:56 +00:00
Floke	cefda9c7c0	Anpassung Verify Wiki	2025-07-19 15:25:34 +00:00
Floke	ccf394d5c2	Anpassung Verify Wiki Article	2025-07-19 15:15:58 +00:00
Floke	42f3fc321b	wikipedia_scraper.py aktualisiert	2025-07-19 15:13:00 +00:00

1 2 3 4 5 ...

1056 Commits