Brancheneinstufung2

Go to file

Key Improvements

Better HTML Parsing: I've replaced the XPath-based extraction with BeautifulSoup, which is more robust for parsing HTML content.
Improved Infobox Detection: The code now properly identifies and extracts data from Wikipedia infoboxes using a more flexible approach:

It looks for various synonyms of "Branche" and "Umsatz" in the header text
It handles different formats of these values within the infobox


Text Cleaning: Added a clean_text() function to:

Remove HTML tags and entities
Strip out references (text in square brackets)
Remove parenthetical text that might contain irrelevant information
Handle whitespace issues


Better Error Handling: The code now includes more robust error handling:

Multiple retries for Wikipedia data fetching
Proper exception handling with informative error messages
Fallback to existing values if new data can't be obtained


Domain Filtering: Improved the domain key extraction to ignore common subdomains like "www", "de", or "com".
Data Preservation: The code now preserves existing data in the sheet when new data can't be found, rather than overwriting with "k.A."
Better Logging: Added more detailed logging to help with debugging and tracking the progress of the script.

This improved version should more reliably extract industry and revenue information from Wikipedia articles and update your Google Sheet accordingly.

2025-03-31 09:55:56 +00:00

@eaDir

Erste Version

2025-03-29 18:47:15 +01:00

Bestandsfirmen.xlsx

Erste Version

2025-03-29 18:47:15 +01:00

brancheneinstufung - Kopie.py

Erste Version

2025-03-29 18:47:15 +01:00

brancheneinstufung.py

Claude V 1.0

2025-03-31 09:55:56 +00:00

service_account.json

Erste Version

2025-03-29 18:47:15 +01:00

update.log

Erste Version

2025-03-29 18:47:15 +01:00

Languages

Python 61.8%

TypeScript 20.2%

JavaScript 14.5%

HTML 2.5%

Dockerfile 0.4%

Other 0.6%