Brancheneinstufung2

Author	SHA1	Message	Date
Floke	744a7e6cd7	bugfix	2025-04-19 17:44:51 +00:00
Floke	b640998eff	bugfix	2025-04-19 17:36:41 +00:00
Floke	a69a3594ab	bugfix	2025-04-19 17:23:36 +00:00
Floke	6cf9afca87	v1.6.5: Refactor logging & integrate improved WikipediaScraper - Replace custom `debug_print` function with standard Python `logging` module calls throughout the codebase. - Use appropriate logging levels (DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION). - Refactor logging setup in `main` for clarity and proper handler initialization. - Integrate updated `WikipediaScraper` class (previously developed as v1.6.5 logic): - Implement more robust infobox parsing (`_extract_infobox_value`) using flexible selectors, keyword checking (`in`), and improved value cleaning (incl. `sup` removal). - Remove old infobox fallback functions. - Enhance article validation (`_validate_article`) with better link checking via `_get_page_soup`. - Improve reliability of article search (`search_company_article`) with direct match attempt and better error handling. - Apply `@retry_on_failure` decorator to network-dependent scraper methods (`_get_page_soup`, `search_company_article`). - Ensure `Config.VERSION` reflects the logical state (v1.6.5 for this commit).	2025-04-19 16:53:35 +00:00
Floke	8c6866ca4b	bugfix	2025-04-19 07:17:29 +00:00
Floke	4fdcdcf77a	bugfix	2025-04-19 07:15:24 +00:00
Floke	c123d235ff	v1.6.5: Refactor WikipediaScraper für robustere Infobox-Extraktion - Überarbeite WikipediaScraper._extract_infobox_value: - Nutzt flexibleren CSS-Selektor ('table[class="infobox"]') für Infobox-Suche. - Iteriert durch Tabellenzeilen (tr) statt nur durch th. - Prüft, ob Keywords im* normalisierten th-Text enthalten sind (statt exaktem Match). - Entfernt <sup>-Tags vor der Textextraktion aus td-Zellen. - Nutzt get_text(separator=' ') für bessere Handhabung von <br>. - Erweitert die keywords_map für Branche, Umsatz, Mitarbeiter. - Fügt detailliertes Debug-Logging für den Extraktionsprozess hinzu. - Entferne die alten Fallback-Funktionen _extract_full_infobox_text und _parse_infobox_text_fallback. - Passe WikipediaScraper.extract_company_data an: - Ruft _get_page_soup nur einmal auf. - Verwendet die neue _extract_infobox_value Methode. - Verbessere WikipediaScraper._validate_article: - Nutzt _get_page_soup für zuverlässigere Link-Prüfung. - Prüft Links in Infobox und externe Links. - Verwendet simple_normalize_url für URL-Vergleiche. - Passt Ähnlichkeitsschwelle an, wenn Domain-Match erfolgreich ist. - Verbessere WikipediaScraper.search_company_article: - Versucht direkten Match zuerst. - Prüft ggf. erste Option bei Begriffsklärung. - Behandelt Fehler (PageError, DisambiguationError, RequestException) robuster im Such-Loop. - Verbessere WikipediaScraper._get_page_soup: - Fügt Timeout, raise_for_status und explizites UTF-8 Encoding hinzu. - Wendet @retry_on_failure Decorator an (Annahme: Decorator existiert). - Wende @retry_on_failure auch auf search_company_article an. - Aktualisiere Versionsnummer in Config und Kommentaren auf v1.6.5.	2025-04-18 19:02:14 +00:00
Floke	3631704f03	refactor: v1.6.5 Minor code improvements and consistency - Add HTML logging to _extract_infobox_value for debugging - Implement _extract_infobox_value_fallback using regex - Call fallback in extract_company_data if primary fails - Add minor logging to _extract_first_paragraph_from_soup - Adjust extract_numeric_value for robustness - Add force_process flag to process_branch_batch for combined mode - Correct indentation in alignment_demo inner function colnum_string - Refine data preparation logic in DataProcessor.prepare_data_for_modeling - Add Config.HEADER_ROWS constant - Increment version to 1.6.5	2025-04-18 18:14:12 +00:00
Floke	e7c2d7c612	bugfix	2025-04-18 16:53:40 +00:00
Floke	96013a7e3d	bugfix	2025-04-18 16:45:37 +00:00
Floke	08b6f9248e	v1.6.5 Improve WikipediaScraper infobox extraction - Add HTML logging to _extract_infobox_value for debugging" - Implement _extract_infobox_value_fallback using regex" - Call fallback in extract_company_data if primary fails" - Add minor logging to _extract_first_paragraph_from_soup" - Adjust extract_numeric_value for robustness" - Increment version to 1.6.5"	2025-04-18 16:44:20 +00:00
Floke	3d306d48d7	bugfix	2025-04-18 14:20:36 +00:00
Floke	567c8e4178	bugfix	2025-04-18 14:08:09 +00:00
Floke	6e7a6bd949	bugfix	2025-04-18 10:57:37 +00:00
Floke	f3f55cd2e5	bugfix	2025-04-18 09:53:40 +00:00
Floke	f348171aa2	bugfix	2025-04-18 09:49:46 +00:00
Floke	f10778583f	bugfix	2025-04-18 06:44:42 +00:00
Floke	b76f5fe991	bugfix	2025-04-18 06:40:17 +00:00
Floke	f31c6b9fe1	bugfix	2025-04-18 06:35:55 +00:00
Floke	138b6ec2ae	bugfix	2025-04-18 06:30:48 +00:00
Floke	eff55c8125	bugfix	2025-04-18 06:25:05 +00:00
Floke	e2c4fe9d6b	bugfix	2025-04-18 06:18:56 +00:00
Floke	705a74655d	bugfix	2025-04-18 06:10:21 +00:00
Floke	51b733e3a9	bugfix	2025-04-17 19:32:37 +00:00
Floke	6b970ab0e7	bugfix	2025-04-17 19:19:44 +00:00
Floke	deed2acf2d	bugfix	2025-04-17 18:50:28 +00:00
Floke	a7220d9f20	bugfix	2025-04-17 18:34:42 +00:00
Floke	9a615a88fc	bugfix	2025-04-17 18:28:45 +00:00
Floke	04585f2b20	debug	2025-04-17 18:16:24 +00:00
Floke	d905c547ec	bugfix	2025-04-17 16:53:26 +00:00
Floke	b81f182706	bugfix	2025-04-17 15:28:11 +00:00
Floke	2ae3a4aa34	bugfix	2025-04-17 15:18:14 +00:00
Floke	23bac0c585	bugfix	2025-04-17 14:53:46 +00:00
Floke	99338dc9cf	bugfix	2025-04-17 14:48:10 +00:00
Floke	4c2ef0d251	bugfix	2025-04-17 14:36:23 +00:00
Floke	5c00505dff	v1.6.4: Implementiere ML-Modelltraining zur Technikerschätzung - Füge neuen Betriebsmodus `--mode train_technician_model` hinzu. - Implementiere Datenvorbereitung in `DataProcessor.prepare_data_for_modeling`: - Lädt relevante Spalten. - Konsolidiert Umsatz/Mitarbeiter (Wiki > CRM Priorität). - Filtert nach gültiger Technikerzahl (>0). - Erstellt Zielvariable `Techniker_Bucket` (7 Kategorien). - Führt One-Hot Encoding für Branchen durch. - Implementiere Logik im `train_technician_model`-Modus in `main`: - Führt Train/Test-Split durch (stratifiziert). - Imputiert fehlende numerische Werte mit Median (fittet auf Train, transformiert Train/Test). - Trainiert einen `DecisionTreeClassifier` mittels `GridSearchCV` zur Hyperparameter-Optimierung (Fokus auf `f1_weighted`). - Evaluiert das beste Modell auf dem Test-Set (Accuracy, Classification Report, Confusion Matrix). - Extrahiert Baumregeln mittels `export_text`. - Speichert den trainierten Imputer, das beste Modell (`.pkl`) und die extrahierten Regeln (`.txt`). - Füge notwendige Imports für `pandas`, `numpy`, `sklearn`, `pickle`, `json` hinzu. - Ergänze neue Konfigurationsparameter für ML in `Config` (Worker, Limits). - Füge Kommandozeilenargumente für Modell-Ausgabedateien hinzu.	2025-04-17 14:00:30 +00:00
Floke	ac1fde7f65	bugfix	2025-04-17 13:08:19 +00:00
Floke	0207fa0410	bugfix	2025-04-17 13:03:25 +00:00
Floke	1dd59f5abb	bugfix	2025-04-17 12:53:10 +00:00
Floke	e1f86b918d	bugfix	2025-04-17 12:48:03 +00:00
Floke	ad62e88946	bugfix	2025-04-17 12:26:07 +00:00
Floke	f26b9c3758	bugfix	2025-04-17 11:01:49 +00:00
Floke	baa503a949	bugfix	2025-04-17 10:56:02 +00:00
Floke	afd82fcd05	bugfix	2025-04-17 10:55:04 +00:00
Floke	c3930c49f4	bugfix	2025-04-17 10:53:59 +00:00
Floke	114d3ee96f	bugfix	2025-04-17 10:52:30 +00:00
Floke	a9fd711a61	bugfix	2025-04-17 10:28:13 +00:00
Floke	91906f7340	bugfix	2025-04-17 10:23:09 +00:00
Floke	e8104d9920	bugfix	2025-04-17 10:05:07 +00:00
Floke	6d764458d4	bugfix	2025-04-17 09:57:44 +00:00

1 2 3 4 5 ...

332 Commits