Brancheneinstufung2

Author	SHA1	Message	Date
Floke	564aef9d20	bugfix	2025-04-22 12:21:33 +00:00
Floke	cca2191e2f	bugfix	2025-04-22 11:18:10 +00:00
Floke	8284b5fa9f	bugfix	2025-04-22 09:54:08 +00:00
Floke	97eee67323	bugfix	2025-04-22 08:23:32 +00:00
Floke	e2fc6b9139	bugfix	2025-04-22 06:43:59 +00:00
Floke	36f1f8f10f	bugfix	2025-04-22 06:31:38 +00:00
Floke	5819e54bbd	bugfix	2025-04-22 06:17:23 +00:00
Floke	69302c17dd	bugfix	2025-04-22 06:13:52 +00:00
Floke	6fc67e4bab	bugfix	2025-04-22 06:12:55 +00:00
Floke	38eb9cbce9	bugfix	2025-04-22 05:38:11 +00:00
Floke	849840054c	v1.6.6: Füge SerpAPI-Suche für fehlende Wiki-URLs großer Firmen hinzu - Füge neuen Betriebsmodus `--mode find_wiki_serp` hinzu. - Implementiere neue Funktion `serp_wikipedia_lookup`, die SerpAPI nutzt, um gezielt nach Wikipedia-Artikeln für einen Firmennamen zu suchen. - Implementiere neue Funktion `process_find_wiki_with_serp`: - Lädt aktuelle Sheet-Daten. - Filtert Zeilen, bei denen Spalte M (Wiki URL) leer/'k.A.' ist UND Spalte K (CRM Mitarbeiter) einen Schwellenwert (Standard: 500) überschreitet. - Ruft `serp_wikipedia_lookup` für gefilterte Zeilen auf. - Bei erfolgreicher URL-Findung: - Schreibt die gefundene URL in Spalte M. - Setzt Flag 'x' in Spalte A (ReEval Flag). - Löscht Timestamps in Spalten AN (Wikipedia Timestamp) und AO (Timestamp letzte Prüfung). - Führt gebündelte Sheet-Updates am Ende durch. - Integriere den neuen Modus `find_wiki_serp` in die Argumentenverarbeitung und Ausführungslogik der `main`-Funktion. - Füge notwendige Imports hinzu und stelle sicher, dass die neuen Funktionen Logging verwenden. - Aktualisiere Versionsnummer in `Config.VERSION` auf v1.6.6.	2025-04-22 05:19:53 +00:00
Floke	166c87a451	bugfix	2025-04-21 12:39:07 +00:00
Floke	6015046c51	bugfix	2025-04-20 17:15:26 +00:00
Floke	ced9c3f723	bugfix	2025-04-19 17:49:33 +00:00
Floke	744a7e6cd7	bugfix	2025-04-19 17:44:51 +00:00
Floke	b640998eff	bugfix	2025-04-19 17:36:41 +00:00
Floke	a69a3594ab	bugfix	2025-04-19 17:23:36 +00:00
Floke	6cf9afca87	v1.6.5: Refactor logging & integrate improved WikipediaScraper - Replace custom `debug_print` function with standard Python `logging` module calls throughout the codebase. - Use appropriate logging levels (DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION). - Refactor logging setup in `main` for clarity and proper handler initialization. - Integrate updated `WikipediaScraper` class (previously developed as v1.6.5 logic): - Implement more robust infobox parsing (`_extract_infobox_value`) using flexible selectors, keyword checking (`in`), and improved value cleaning (incl. `sup` removal). - Remove old infobox fallback functions. - Enhance article validation (`_validate_article`) with better link checking via `_get_page_soup`. - Improve reliability of article search (`search_company_article`) with direct match attempt and better error handling. - Apply `@retry_on_failure` decorator to network-dependent scraper methods (`_get_page_soup`, `search_company_article`). - Ensure `Config.VERSION` reflects the logical state (v1.6.5 for this commit).	2025-04-19 16:53:35 +00:00
Floke	8c6866ca4b	bugfix	2025-04-19 07:17:29 +00:00
Floke	4fdcdcf77a	bugfix	2025-04-19 07:15:24 +00:00
Floke	c123d235ff	v1.6.5: Refactor WikipediaScraper für robustere Infobox-Extraktion - Überarbeite WikipediaScraper._extract_infobox_value: - Nutzt flexibleren CSS-Selektor ('table[class="infobox"]') für Infobox-Suche. - Iteriert durch Tabellenzeilen (tr) statt nur durch th. - Prüft, ob Keywords im* normalisierten th-Text enthalten sind (statt exaktem Match). - Entfernt <sup>-Tags vor der Textextraktion aus td-Zellen. - Nutzt get_text(separator=' ') für bessere Handhabung von <br>. - Erweitert die keywords_map für Branche, Umsatz, Mitarbeiter. - Fügt detailliertes Debug-Logging für den Extraktionsprozess hinzu. - Entferne die alten Fallback-Funktionen _extract_full_infobox_text und _parse_infobox_text_fallback. - Passe WikipediaScraper.extract_company_data an: - Ruft _get_page_soup nur einmal auf. - Verwendet die neue _extract_infobox_value Methode. - Verbessere WikipediaScraper._validate_article: - Nutzt _get_page_soup für zuverlässigere Link-Prüfung. - Prüft Links in Infobox und externe Links. - Verwendet simple_normalize_url für URL-Vergleiche. - Passt Ähnlichkeitsschwelle an, wenn Domain-Match erfolgreich ist. - Verbessere WikipediaScraper.search_company_article: - Versucht direkten Match zuerst. - Prüft ggf. erste Option bei Begriffsklärung. - Behandelt Fehler (PageError, DisambiguationError, RequestException) robuster im Such-Loop. - Verbessere WikipediaScraper._get_page_soup: - Fügt Timeout, raise_for_status und explizites UTF-8 Encoding hinzu. - Wendet @retry_on_failure Decorator an (Annahme: Decorator existiert). - Wende @retry_on_failure auch auf search_company_article an. - Aktualisiere Versionsnummer in Config und Kommentaren auf v1.6.5.	2025-04-18 19:02:14 +00:00
Floke	3631704f03	refactor: v1.6.5 Minor code improvements and consistency - Add HTML logging to _extract_infobox_value for debugging - Implement _extract_infobox_value_fallback using regex - Call fallback in extract_company_data if primary fails - Add minor logging to _extract_first_paragraph_from_soup - Adjust extract_numeric_value for robustness - Add force_process flag to process_branch_batch for combined mode - Correct indentation in alignment_demo inner function colnum_string - Refine data preparation logic in DataProcessor.prepare_data_for_modeling - Add Config.HEADER_ROWS constant - Increment version to 1.6.5	2025-04-18 18:14:12 +00:00
Floke	e7c2d7c612	bugfix	2025-04-18 16:53:40 +00:00
Floke	96013a7e3d	bugfix	2025-04-18 16:45:37 +00:00
Floke	08b6f9248e	v1.6.5 Improve WikipediaScraper infobox extraction - Add HTML logging to _extract_infobox_value for debugging" - Implement _extract_infobox_value_fallback using regex" - Call fallback in extract_company_data if primary fails" - Add minor logging to _extract_first_paragraph_from_soup" - Adjust extract_numeric_value for robustness" - Increment version to 1.6.5"	2025-04-18 16:44:20 +00:00
Floke	3d306d48d7	bugfix	2025-04-18 14:20:36 +00:00
Floke	567c8e4178	bugfix	2025-04-18 14:08:09 +00:00
Floke	6e7a6bd949	bugfix	2025-04-18 10:57:37 +00:00
Floke	f3f55cd2e5	bugfix	2025-04-18 09:53:40 +00:00
Floke	f348171aa2	bugfix	2025-04-18 09:49:46 +00:00
Floke	f10778583f	bugfix	2025-04-18 06:44:42 +00:00
Floke	b76f5fe991	bugfix	2025-04-18 06:40:17 +00:00
Floke	f31c6b9fe1	bugfix	2025-04-18 06:35:55 +00:00
Floke	138b6ec2ae	bugfix	2025-04-18 06:30:48 +00:00
Floke	eff55c8125	bugfix	2025-04-18 06:25:05 +00:00
Floke	e2c4fe9d6b	bugfix	2025-04-18 06:18:56 +00:00
Floke	705a74655d	bugfix	2025-04-18 06:10:21 +00:00
Floke	51b733e3a9	bugfix	2025-04-17 19:32:37 +00:00
Floke	6b970ab0e7	bugfix	2025-04-17 19:19:44 +00:00
Floke	deed2acf2d	bugfix	2025-04-17 18:50:28 +00:00
Floke	a7220d9f20	bugfix	2025-04-17 18:34:42 +00:00
Floke	9a615a88fc	bugfix	2025-04-17 18:28:45 +00:00
Floke	04585f2b20	debug	2025-04-17 18:16:24 +00:00
Floke	d905c547ec	bugfix	2025-04-17 16:53:26 +00:00
Floke	b81f182706	bugfix	2025-04-17 15:28:11 +00:00
Floke	2ae3a4aa34	bugfix	2025-04-17 15:18:14 +00:00
Floke	23bac0c585	bugfix	2025-04-17 14:53:46 +00:00
Floke	99338dc9cf	bugfix	2025-04-17 14:48:10 +00:00
Floke	4c2ef0d251	bugfix	2025-04-17 14:36:23 +00:00
Floke	5c00505dff	v1.6.4: Implementiere ML-Modelltraining zur Technikerschätzung - Füge neuen Betriebsmodus `--mode train_technician_model` hinzu. - Implementiere Datenvorbereitung in `DataProcessor.prepare_data_for_modeling`: - Lädt relevante Spalten. - Konsolidiert Umsatz/Mitarbeiter (Wiki > CRM Priorität). - Filtert nach gültiger Technikerzahl (>0). - Erstellt Zielvariable `Techniker_Bucket` (7 Kategorien). - Führt One-Hot Encoding für Branchen durch. - Implementiere Logik im `train_technician_model`-Modus in `main`: - Führt Train/Test-Split durch (stratifiziert). - Imputiert fehlende numerische Werte mit Median (fittet auf Train, transformiert Train/Test). - Trainiert einen `DecisionTreeClassifier` mittels `GridSearchCV` zur Hyperparameter-Optimierung (Fokus auf `f1_weighted`). - Evaluiert das beste Modell auf dem Test-Set (Accuracy, Classification Report, Confusion Matrix). - Extrahiert Baumregeln mittels `export_text`. - Speichert den trainierten Imputer, das beste Modell (`.pkl`) und die extrahierten Regeln (`.txt`). - Füge notwendige Imports für `pandas`, `numpy`, `sklearn`, `pickle`, `json` hinzu. - Ergänze neue Konfigurationsparameter für ML in `Config` (Worker, Limits). - Füge Kommandozeilenargumente für Modell-Ausgabedateien hinzu.	2025-04-17 14:00:30 +00:00

1 2 3 4 5 ...

346 Commits