Commit Graph

364 Commits

Author SHA1 Message Date
55e712733b bugfix 2025-04-24 14:42:12 +00:00
cf2e611212 bugfix 2025-04-24 14:41:06 +00:00
8557144b1e bugfix 2025-04-24 14:39:50 +00:00
8f3cb16db4 bugfix 2025-04-24 06:33:57 +00:00
37100fe8a8 bugfix 2025-04-24 06:32:25 +00:00
ea015edb57 bugfix 2025-04-24 06:12:04 +00:00
43476983a4 bugfix 2025-04-24 05:59:51 +00:00
76f1c72cc3 v1.7.0: Major refactoring for flexible processing modes and UI.
This version introduces significant structural changes to improve code maintainability and user flexibility by centralizing processing logic within the DataProcessor class and implementing a new menu-driven user interface with granular control over processing steps and row selection.

- Increment version number to v1.7.0.
- Major Structural Refactoring:
    - DataProcessor Centralization: Move the core processing logic for sequential runs, re-evaluation, batch modes, and specific data lookups/updates into the `DataProcessor` class as methods.
    - Resolve AttributeErrors: Correct the indentation for all methods belonging to the `DataProcessor` class to ensure they are correctly defined within the class scope.
    - Fix DataProcessor Initialization: Update the `DataProcessor.__init__` signature and implementation to accept and store required handler instances (e.g., `GoogleSheetHandler`, `WikipediaScraper`).
- New User Interface:
    - Menu-Driven Dispatcher: Implement a new `run_user_interface` function to replace the old `main` logic block. This function provides an interactive, multi-level numeric menu for selecting processing modes and parameters. It can also process direct CLI arguments.
    - Simplified Main: The `main` function is reduced to handling initial setup (Config, Logging, Handlers, DataProcessor instantiation) and then calling `run_user_interface`.
- Granular Processing Control:
    - Step Selection: Implement the ability for users to select specific processing steps (grouped logically, e.g., 'website', 'wiki', 'chatgpt') for execution within sequential, re-evaluation, and criteria-based modes.
    - Flags for Steps: Adapt the `_process_single_row` method and the methods that call it (`process_reevaluation_rows`, `process_sequential`, `process_rows_matching_criteria`) to accept and utilize flags (e.g., `process_wiki`, `process_chatgpt`) to control which processing blocks are attempted for a given row.
    - Refined Step Logic: Ensure processing blocks within `_process_single_row` correctly check their corresponding step flag *and* the necessary timestamp/status conditions (unless `force_reeval` is active).
- New Processing Modes:
    - Criteria Mode: Implement the `process_rows_matching_criteria` method and its UI integration, allowing users to select a predefined criterion function (e.g., 'M filled and AN empty') to filter rows for processing.
    - Wiki Re-Extraction (Criteria-based): Integrate the logic for processing rows where Wiki URL (M) is filled and Wiki Timestamp (AN) is empty, likely as a specific option within the new Criteria mode.
- Fixes and Improvements:
    - SyntaxError Resolution: Resolve persistent `SyntaxError`s related to complex f-string formatting in logging calls by constructing message parts separately.
    - `find_wiki_serp` Filter Logic: Ensure the `process_find_wiki_serp` method correctly uses the `get_numeric_filter_value` helper to apply the Umsatz OR Mitarbeiter threshold filter logic based on the correct data units.
    - Timestamp/Status Logic: Consolidate and clarify the logic for checking process necessity based on timestamps, status flags (like S='X'), and the `force_reeval` parameter in helper methods like `_is_step_processing_needed`.
    - ML Integration: Ensure `prepare_data_for_modeling` and `train_technician_model` are correctly integrated as `DataProcessor` methods and function within the new structure.
    - Consistency: Address inconsistencies in timestamp setting (e.g., ensuring AP is set by batch modes) and parameter handling across different methods where identified during the refactoring.
    - Helper Functions: Define or confirm the global scope of necessary helper functions (`get_numeric_filter_value`, criteria functions, `_process_batch`, etc.).

This version marks a significant milestone in making the script more modular, maintainable, and user-controllable, laying the groundwork for further enhancements like the ML estimation mode.
2025-04-24 05:04:51 +00:00
d190188d04 bugfix 2025-04-23 12:47:54 +00:00
0c799ded26 bugfix 2025-04-23 12:27:01 +00:00
99afe4681d bugfix 2025-04-23 05:36:57 +00:00
bf5db76232 v1.6.7: Behebt strukturelle/Syntax-Fehler; passt Filter für Wiki-Suche via SerpAPI an
- Inkrementiere Versionsnummer auf v1.6.7.
- Behebe kritischen AttributeError: Korrigiere die Einrückung für mehrere Verarbeitungsmethoden (_process_single_row, process_reevaluation_rows, process_serp_website_lookup_for_empty, process_website_details_for_marked_rows, prepare_data_for_modeling, process_rows_sequentially, process_find_wiki_with_serp), sodass diese korrekt als Methoden innerhalb der Klasse DataProcessor definiert sind.
- Behebe SyntaxError: Löse das Problem mit komplexen f-Strings in _process_single_row und potenziell anderen Stellen, indem die String-Konstruktion von Ausdrücken innerhalb der f-String-Syntax getrennt wird.
- Passe Filterlogik für Modus 'find_wiki_serp' an: Die SerpAPI-Suche nach fehlenden Wiki-URLs (M=k.A./leer) wird nun ausgelöst, wenn (CRM Umsatz (J) > 200 Mio ODER CRM Anzahl Mitarbeiter (K) > 500). Implementiere robuste numerische Extraktion für J und K innerhalb der Filterlogik.
- Stelle sicher, dass SerpAPI Wiki Search Timestamp (AY) immer nach einem Suchversuch im Modus 'find_wiki_serp' gesetzt wird, unabhängig vom Ergebnis.
- Diverse Logging-Anpassungen für Klarheit und Debugging (z.B. im Wiki-Verarbeitungsschritt).
2025-04-23 05:18:30 +00:00
808a015904 bugfix 2025-04-22 14:17:22 +00:00
27bdabc0b8 bugfix 2025-04-22 14:10:50 +00:00
66ad46665c bugfix 2025-04-22 14:03:42 +00:00
dcf33501e9 bugfix 2025-04-22 13:58:36 +00:00
21e8fa21d8 bugfix 2025-04-22 12:42:49 +00:00
e5741a23c9 bugfix 2025-04-22 12:29:03 +00:00
564aef9d20 bugfix 2025-04-22 12:21:33 +00:00
cca2191e2f bugfix 2025-04-22 11:18:10 +00:00
8284b5fa9f bugfix 2025-04-22 09:54:08 +00:00
97eee67323 bugfix 2025-04-22 08:23:32 +00:00
e2fc6b9139 bugfix 2025-04-22 06:43:59 +00:00
36f1f8f10f bugfix 2025-04-22 06:31:38 +00:00
5819e54bbd bugfix 2025-04-22 06:17:23 +00:00
69302c17dd bugfix 2025-04-22 06:13:52 +00:00
6fc67e4bab bugfix 2025-04-22 06:12:55 +00:00
38eb9cbce9 bugfix 2025-04-22 05:38:11 +00:00
849840054c v1.6.6: Füge SerpAPI-Suche für fehlende Wiki-URLs großer Firmen hinzu
- Füge neuen Betriebsmodus `--mode find_wiki_serp` hinzu.
- Implementiere neue Funktion `serp_wikipedia_lookup`, die SerpAPI nutzt, um gezielt nach Wikipedia-Artikeln für einen Firmennamen zu suchen.
- Implementiere neue Funktion `process_find_wiki_with_serp`:
    - Lädt aktuelle Sheet-Daten.
    - Filtert Zeilen, bei denen Spalte M (Wiki URL) leer/'k.A.' ist UND Spalte K (CRM Mitarbeiter) einen Schwellenwert (Standard: 500) überschreitet.
    - Ruft `serp_wikipedia_lookup` für gefilterte Zeilen auf.
    - Bei erfolgreicher URL-Findung:
        - Schreibt die gefundene URL in Spalte M.
        - Setzt Flag 'x' in Spalte A (ReEval Flag).
        - Löscht Timestamps in Spalten AN (Wikipedia Timestamp) und AO (Timestamp letzte Prüfung).
    - Führt gebündelte Sheet-Updates am Ende durch.
- Integriere den neuen Modus `find_wiki_serp` in die Argumentenverarbeitung und Ausführungslogik der `main`-Funktion.
- Füge notwendige Imports hinzu und stelle sicher, dass die neuen Funktionen Logging verwenden.
- Aktualisiere Versionsnummer in `Config.VERSION` auf v1.6.6.
2025-04-22 05:19:53 +00:00
166c87a451 bugfix 2025-04-21 12:39:07 +00:00
6015046c51 bugfix 2025-04-20 17:15:26 +00:00
ced9c3f723 bugfix 2025-04-19 17:49:33 +00:00
744a7e6cd7 bugfix 2025-04-19 17:44:51 +00:00
b640998eff bugfix 2025-04-19 17:36:41 +00:00
a69a3594ab bugfix 2025-04-19 17:23:36 +00:00
6cf9afca87 v1.6.5: Refactor logging & integrate improved WikipediaScraper
- Replace custom `debug_print` function with standard Python `logging` module calls throughout the codebase.
    - Use appropriate logging levels (DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION).
    - Refactor logging setup in `main` for clarity and proper handler initialization.
- Integrate updated `WikipediaScraper` class (previously developed as v1.6.5 logic):
    - Implement more robust infobox parsing (`_extract_infobox_value`) using flexible selectors, keyword checking (`in`), and improved value cleaning (incl. `sup` removal).
    - Remove old infobox fallback functions.
    - Enhance article validation (`_validate_article`) with better link checking via `_get_page_soup`.
    - Improve reliability of article search (`search_company_article`) with direct match attempt and better error handling.
    - Apply `@retry_on_failure` decorator to network-dependent scraper methods (`_get_page_soup`, `search_company_article`).
- Ensure `Config.VERSION` reflects the logical state (v1.6.5 for this commit).
2025-04-19 16:53:35 +00:00
8c6866ca4b bugfix 2025-04-19 07:17:29 +00:00
4fdcdcf77a bugfix 2025-04-19 07:15:24 +00:00
c123d235ff v1.6.5: Refactor WikipediaScraper für robustere Infobox-Extraktion
- Überarbeite WikipediaScraper._extract_infobox_value:
    - Nutzt flexibleren CSS-Selektor ('table[class*="infobox"]') für Infobox-Suche.
    - Iteriert durch Tabellenzeilen (tr) statt nur durch th.
    - Prüft, ob Keywords *im* normalisierten th-Text enthalten sind (statt exaktem Match).
    - Entfernt <sup>-Tags vor der Textextraktion aus td-Zellen.
    - Nutzt get_text(separator=' ') für bessere Handhabung von <br>.
    - Erweitert die keywords_map für Branche, Umsatz, Mitarbeiter.
    - Fügt detailliertes Debug-Logging für den Extraktionsprozess hinzu.
- Entferne die alten Fallback-Funktionen _extract_full_infobox_text und _parse_infobox_text_fallback.
- Passe WikipediaScraper.extract_company_data an:
    - Ruft _get_page_soup nur einmal auf.
    - Verwendet die neue _extract_infobox_value Methode.
- Verbessere WikipediaScraper._validate_article:
    - Nutzt _get_page_soup für zuverlässigere Link-Prüfung.
    - Prüft Links in Infobox und externe Links.
    - Verwendet simple_normalize_url für URL-Vergleiche.
    - Passt Ähnlichkeitsschwelle an, wenn Domain-Match erfolgreich ist.
- Verbessere WikipediaScraper.search_company_article:
    - Versucht direkten Match zuerst.
    - Prüft ggf. erste Option bei Begriffsklärung.
    - Behandelt Fehler (PageError, DisambiguationError, RequestException) robuster im Such-Loop.
- Verbessere WikipediaScraper._get_page_soup:
    - Fügt Timeout, raise_for_status und explizites UTF-8 Encoding hinzu.
    - Wendet @retry_on_failure Decorator an (Annahme: Decorator existiert).
- Wende @retry_on_failure auch auf search_company_article an.
- Aktualisiere Versionsnummer in Config und Kommentaren auf v1.6.5.
2025-04-18 19:02:14 +00:00
3631704f03 refactor: v1.6.5 Minor code improvements and consistency
- Add HTML logging to _extract_infobox_value for debugging
- Implement _extract_infobox_value_fallback using regex
- Call fallback in extract_company_data if primary fails
- Add minor logging to _extract_first_paragraph_from_soup
- Adjust extract_numeric_value for robustness
- Add force_process flag to process_branch_batch for combined mode
- Correct indentation in alignment_demo inner function colnum_string
- Refine data preparation logic in DataProcessor.prepare_data_for_modeling
- Add Config.HEADER_ROWS constant
- Increment version to 1.6.5
2025-04-18 18:14:12 +00:00
e7c2d7c612 bugfix 2025-04-18 16:53:40 +00:00
96013a7e3d bugfix 2025-04-18 16:45:37 +00:00
08b6f9248e v1.6.5 Improve WikipediaScraper infobox extraction
- Add HTML logging to _extract_infobox_value for debugging"
- Implement _extract_infobox_value_fallback using regex"
- Call fallback in extract_company_data if primary fails"
- Add minor logging to _extract_first_paragraph_from_soup"
- Adjust extract_numeric_value for robustness"
- Increment version to 1.6.5"
2025-04-18 16:44:20 +00:00
3d306d48d7 bugfix 2025-04-18 14:20:36 +00:00
567c8e4178 bugfix 2025-04-18 14:08:09 +00:00
6e7a6bd949 bugfix 2025-04-18 10:57:37 +00:00
f3f55cd2e5 bugfix 2025-04-18 09:53:40 +00:00
f348171aa2 bugfix 2025-04-18 09:49:46 +00:00
f10778583f bugfix 2025-04-18 06:44:42 +00:00
b76f5fe991 bugfix 2025-04-18 06:40:17 +00:00