This version introduces significant structural changes to improve code maintainability and user flexibility by centralizing processing logic within the DataProcessor class and implementing a new menu-driven user interface with granular control over processing steps and row selection.
- Increment version number to v1.7.0.
- Major Structural Refactoring:
- DataProcessor Centralization: Move the core processing logic for sequential runs, re-evaluation, batch modes, and specific data lookups/updates into the `DataProcessor` class as methods.
- Resolve AttributeErrors: Correct the indentation for all methods belonging to the `DataProcessor` class to ensure they are correctly defined within the class scope.
- Fix DataProcessor Initialization: Update the `DataProcessor.__init__` signature and implementation to accept and store required handler instances (e.g., `GoogleSheetHandler`, `WikipediaScraper`).
- New User Interface:
- Menu-Driven Dispatcher: Implement a new `run_user_interface` function to replace the old `main` logic block. This function provides an interactive, multi-level numeric menu for selecting processing modes and parameters. It can also process direct CLI arguments.
- Simplified Main: The `main` function is reduced to handling initial setup (Config, Logging, Handlers, DataProcessor instantiation) and then calling `run_user_interface`.
- Granular Processing Control:
- Step Selection: Implement the ability for users to select specific processing steps (grouped logically, e.g., 'website', 'wiki', 'chatgpt') for execution within sequential, re-evaluation, and criteria-based modes.
- Flags for Steps: Adapt the `_process_single_row` method and the methods that call it (`process_reevaluation_rows`, `process_sequential`, `process_rows_matching_criteria`) to accept and utilize flags (e.g., `process_wiki`, `process_chatgpt`) to control which processing blocks are attempted for a given row.
- Refined Step Logic: Ensure processing blocks within `_process_single_row` correctly check their corresponding step flag *and* the necessary timestamp/status conditions (unless `force_reeval` is active).
- New Processing Modes:
- Criteria Mode: Implement the `process_rows_matching_criteria` method and its UI integration, allowing users to select a predefined criterion function (e.g., 'M filled and AN empty') to filter rows for processing.
- Wiki Re-Extraction (Criteria-based): Integrate the logic for processing rows where Wiki URL (M) is filled and Wiki Timestamp (AN) is empty, likely as a specific option within the new Criteria mode.
- Fixes and Improvements:
- SyntaxError Resolution: Resolve persistent `SyntaxError`s related to complex f-string formatting in logging calls by constructing message parts separately.
- `find_wiki_serp` Filter Logic: Ensure the `process_find_wiki_serp` method correctly uses the `get_numeric_filter_value` helper to apply the Umsatz OR Mitarbeiter threshold filter logic based on the correct data units.
- Timestamp/Status Logic: Consolidate and clarify the logic for checking process necessity based on timestamps, status flags (like S='X'), and the `force_reeval` parameter in helper methods like `_is_step_processing_needed`.
- ML Integration: Ensure `prepare_data_for_modeling` and `train_technician_model` are correctly integrated as `DataProcessor` methods and function within the new structure.
- Consistency: Address inconsistencies in timestamp setting (e.g., ensuring AP is set by batch modes) and parameter handling across different methods where identified during the refactoring.
- Helper Functions: Define or confirm the global scope of necessary helper functions (`get_numeric_filter_value`, criteria functions, `_process_batch`, etc.).
This version marks a significant milestone in making the script more modular, maintainable, and user-controllable, laying the groundwork for further enhancements like the ML estimation mode.
- Inkrementiere Versionsnummer auf v1.6.7.
- Behebe kritischen AttributeError: Korrigiere die Einrückung für mehrere Verarbeitungsmethoden (_process_single_row, process_reevaluation_rows, process_serp_website_lookup_for_empty, process_website_details_for_marked_rows, prepare_data_for_modeling, process_rows_sequentially, process_find_wiki_with_serp), sodass diese korrekt als Methoden innerhalb der Klasse DataProcessor definiert sind.
- Behebe SyntaxError: Löse das Problem mit komplexen f-Strings in _process_single_row und potenziell anderen Stellen, indem die String-Konstruktion von Ausdrücken innerhalb der f-String-Syntax getrennt wird.
- Passe Filterlogik für Modus 'find_wiki_serp' an: Die SerpAPI-Suche nach fehlenden Wiki-URLs (M=k.A./leer) wird nun ausgelöst, wenn (CRM Umsatz (J) > 200 Mio ODER CRM Anzahl Mitarbeiter (K) > 500). Implementiere robuste numerische Extraktion für J und K innerhalb der Filterlogik.
- Stelle sicher, dass SerpAPI Wiki Search Timestamp (AY) immer nach einem Suchversuch im Modus 'find_wiki_serp' gesetzt wird, unabhängig vom Ergebnis.
- Diverse Logging-Anpassungen für Klarheit und Debugging (z.B. im Wiki-Verarbeitungsschritt).
- Füge neuen Betriebsmodus `--mode find_wiki_serp` hinzu.
- Implementiere neue Funktion `serp_wikipedia_lookup`, die SerpAPI nutzt, um gezielt nach Wikipedia-Artikeln für einen Firmennamen zu suchen.
- Implementiere neue Funktion `process_find_wiki_with_serp`:
- Lädt aktuelle Sheet-Daten.
- Filtert Zeilen, bei denen Spalte M (Wiki URL) leer/'k.A.' ist UND Spalte K (CRM Mitarbeiter) einen Schwellenwert (Standard: 500) überschreitet.
- Ruft `serp_wikipedia_lookup` für gefilterte Zeilen auf.
- Bei erfolgreicher URL-Findung:
- Schreibt die gefundene URL in Spalte M.
- Setzt Flag 'x' in Spalte A (ReEval Flag).
- Löscht Timestamps in Spalten AN (Wikipedia Timestamp) und AO (Timestamp letzte Prüfung).
- Führt gebündelte Sheet-Updates am Ende durch.
- Integriere den neuen Modus `find_wiki_serp` in die Argumentenverarbeitung und Ausführungslogik der `main`-Funktion.
- Füge notwendige Imports hinzu und stelle sicher, dass die neuen Funktionen Logging verwenden.
- Aktualisiere Versionsnummer in `Config.VERSION` auf v1.6.6.
- Replace custom `debug_print` function with standard Python `logging` module calls throughout the codebase.
- Use appropriate logging levels (DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION).
- Refactor logging setup in `main` for clarity and proper handler initialization.
- Integrate updated `WikipediaScraper` class (previously developed as v1.6.5 logic):
- Implement more robust infobox parsing (`_extract_infobox_value`) using flexible selectors, keyword checking (`in`), and improved value cleaning (incl. `sup` removal).
- Remove old infobox fallback functions.
- Enhance article validation (`_validate_article`) with better link checking via `_get_page_soup`.
- Improve reliability of article search (`search_company_article`) with direct match attempt and better error handling.
- Apply `@retry_on_failure` decorator to network-dependent scraper methods (`_get_page_soup`, `search_company_article`).
- Ensure `Config.VERSION` reflects the logical state (v1.6.5 for this commit).