3d30565c9746cac51407565879cb5903f02c6a89
Key Improvements Better HTML Parsing: I've replaced the XPath-based extraction with BeautifulSoup, which is more robust for parsing HTML content. Improved Infobox Detection: The code now properly identifies and extracts data from Wikipedia infoboxes using a more flexible approach: It looks for various synonyms of "Branche" and "Umsatz" in the header text It handles different formats of these values within the infobox Text Cleaning: Added a clean_text() function to: Remove HTML tags and entities Strip out references (text in square brackets) Remove parenthetical text that might contain irrelevant information Handle whitespace issues Better Error Handling: The code now includes more robust error handling: Multiple retries for Wikipedia data fetching Proper exception handling with informative error messages Fallback to existing values if new data can't be obtained Domain Filtering: Improved the domain key extraction to ignore common subdomains like "www", "de", or "com". Data Preservation: The code now preserves existing data in the sheet when new data can't be found, rather than overwriting with "k.A." Better Logging: Added more detailed logging to help with debugging and tracking the progress of the script. This improved version should more reliably extract industry and revenue information from Wikipedia articles and update your Google Sheet accordingly.
Description
No description provided
Languages
Python
63.6%
TypeScript
19.2%
JavaScript
15.6%
HTML
0.7%
Dockerfile
0.4%
Other
0.5%