feat(company-explorer): Initial Web UI & Backend with Enrichment Flow

This commit introduces the foundational elements for the new "Company Explorer" web application, marking a significant step away from the legacy Google Sheets / CLI system.

Key changes include:
- Project Structure: A new  directory with separate  (FastAPI) and  (React/Vite) components.
- Data Persistence: Migration from Google Sheets to a local SQLite database () using SQLAlchemy.
- Core Utilities: Extraction and cleanup of essential helper functions (LLM wrappers, text utilities) into .
- Backend Services: , ,  for AI-powered analysis, and  logic.
- Frontend UI: Basic React application with company table, import wizard, and dynamic inspector sidebar.
- Docker Integration: Updated  and  for multi-stage builds and sideloading.
- Deployment & Access: Integrated into central Nginx proxy and dashboard, accessible via .

Lessons Learned & Fixed during development:
- Frontend Asset Loading: Addressed issues with Vite's  path and FastAPI's .
- TypeScript Configuration: Added  and .
- Database Schema Evolution: Solved  errors by forcing a new database file and correcting  override.
- Logging: Implemented robust file-based logging ().

This new foundation provides a powerful and maintainable platform for future B2B robotics lead generation.
This commit is contained in:
2026-01-07 17:55:08 +00:00
parent bc1cff825a
commit 95634d7bb6
51 changed files with 3475 additions and 2 deletions

BIN
FRITZbox7530.pdf Normal file

Binary file not shown.

80
MIGRATION_PLAN.md Normal file
View File

@@ -0,0 +1,80 @@
# Migrations-Plan: Legacy GSheets -> Company Explorer (Robotics Edition)
**Kontext:** Neuanfang für die Branche **Robotik & Facility Management**.
**Ziel:** Ablösung von Google Sheets/CLI durch eine Web-App ("Company Explorer") mit SQLite-Backend.
## 1. Strategische Neuausrichtung
| Bereich | Alt (Legacy) | Neu (Robotics Edition) |
| :--- | :--- | :--- |
| **Daten-Basis** | Google Sheets | **SQLite** (Lokal, performant, filterbar). |
| **Ziel-Daten** | Allgemein / Kundenservice | **Robotics-Signale** (SPA-Bereich? Intralogistik? Werkschutz?). |
| **Branchen** | KI-Vorschlag (Freitext) | **Strict Mode:** Mapping auf feste CRM-Liste (z.B. "Hotellerie", "Maschinenbau"). |
| **Texterstellung** | Pain/Gain Matrix (Service) | **Pain/Gain Matrix (Robotics)**. "Übersetzung" des alten Wissens auf Roboter. |
| **Analytics** | Techniker-ML-Modell | **Deaktiviert**. Vorerst keine Relevanz. |
| **Operations** | D365 Sync (Broken) | **Excel-Import & Deduplizierung**. Fokus auf Matching externer Listen gegen Bestand. |
## 2. Architektur & Komponenten-Mapping
Das System wird in `company-explorer/` neu aufgebaut. Wir lösen Abhängigkeiten zur Root `helpers.py` auf.
### A. Core Backend (`backend/`)
| Komponente | Aufgabe & Neue Logik | Prio |
| :--- | :--- | :--- |
| **Database** | Ersetzt `GoogleSheetHandler`. Speichert Firmen & "Enrichment Blobs". | 1 |
| **Importer** | Ersetzt `SyncManager`. Importiert Excel-Dumps (CRM) und Event-Listen. | 1 |
| **Deduplicator** | Ersetzt `company_deduplicator.py`. **Kern-Feature:** Checkt Event-Listen gegen DB. Muss "intelligent" matchen (Name + Ort + Web). | 1 |
| **Scraper (Base)** | Extrahiert Text von Websites. Basis für alle Analysen. | 1 |
| **Signal Detector** | **NEU.** Analysiert Website-Text auf Roboter-Potential. <br> *Logik:* Wenn Branche = Hotel & Keyword = "Wellness" -> Potential: Reinigungsroboter. | 1 |
| **Classifier** | Brancheneinstufung. **Strict Mode:** Prüft gegen `config/allowed_industries.json`. | 2 |
| **Marketing Engine** | Ersetzt `generate_marketing_text.py`. Nutzt neue `marketing_wissen_robotics.yaml`. | 3 |
### B. Frontend (`frontend/`) - React
* **View 1: Der "Explorer":** DataGrid aller Firmen. Filterbar nach "Roboter-Potential" und Status.
* **View 2: Der "Inspector":** Detailansicht einer Firma. Zeigt gefundene Signale ("Hat SPA Bereich"). Manuelle Korrektur-Möglichkeit.
* **View 3: "List Matcher":** Upload einer Excel-Liste -> Anzeige von Duplikaten -> Button "Neue importieren".
## 3. Umgang mit Shared Code (`helpers.py` & Co.)
Wir kapseln das neue Projekt vollständig ab ("Fork & Clean").
* **Quelle:** `helpers.py` (Root)
* **Ziel:** `company-explorer/backend/lib/core_utils.py`
* **Aktion:** Wir kopieren nur:
* OpenAI/Gemini Wrapper (Retry Logic).
* Text Cleaning (`clean_text`, `normalize_string`).
* URL Normalization.
* **Quelle:** Andere Gemini Apps (`duckdns`, `gtm-architect`, `market-intel`)
* **Aktion:** Wir betrachten diese als Referenz. Nützliche Logik (z.B. die "Grit"-Prompts aus `market-intel`) wird explizit in die neuen Service-Module kopiert.
## 4. Datenstruktur (SQLite Schema)
### Tabelle `companies` (Stammdaten)
* `id` (PK)
* `name` (String)
* `website` (String)
* `crm_id` (String, nullable - Link zum D365)
* `industry_crm` (String - Die "erlaubte" Branche)
* `city` (String)
* `country` (String)
* `status` (Enum: NEW, IMPORTED, ENRICHED, QUALIFIED)
### Tabelle `signals` (Roboter-Potential)
* `company_id` (FK)
* `signal_type` (z.B. "has_spa", "has_large_warehouse", "has_security_needs")
* `confidence` (Float)
* `proof_text` (Snippet von der Website)
### Tabelle `duplicates_log`
* Speichert Ergebnisse von Listen-Abgleichen ("Upload X enthielt 20 bekannte Firmen").
## 5. Phasenplan Umsetzung
1. **Housekeeping:** Archivierung des Legacy-Codes (`_legacy_gsheets_system`).
2. **Setup:** Init `company-explorer` (Backend + Frontend Skeleton).
3. **Foundation:** DB-Schema + "List Matcher" (Deduplizierung ist Prio A für Operations).
4. **Enrichment:** Implementierung des Scrapers + Signal Detector (Robotics).
5. **UI:** React Interface für die Daten.

View File

@@ -0,0 +1,674 @@
#!/usr/bin/env python3
"""
config.py
Zentrale Konfiguration für das Projekt "Automatisierte Unternehmensbewertung".
Enthält Dateipfade, API-Schlüssel-Pfade, die globale Config-Klasse
und das Spalten-Mapping für das Google Sheet.
"""
import os
import re
import logging
# ==============================================================================
# 1. GLOBALE KONSTANTEN UND DATEIPFADE
# ==============================================================================
# --- Dateipfade (NEU: Feste Pfade für Docker-Betrieb) ---
# Das Basisverzeichnis ist im Docker-Kontext immer /app.
BASE_DIR = "/app"
CREDENTIALS_FILE = os.path.join(BASE_DIR, "service_account.json")
API_KEY_FILE = os.path.join(BASE_DIR, "gemini_api_key.txt")
SERP_API_KEY_FILE = os.path.join(BASE_DIR, "serpapikey.txt")
GENDERIZE_API_KEY_FILE = os.path.join(BASE_DIR, "genderize_API_Key.txt")
BRANCH_MAPPING_FILE = None
LOG_DIR = os.path.join(BASE_DIR, "Log_from_docker") # Log in den gemounteten Ordner schreiben
# --- ML Modell Artefakte ---
MODEL_FILE = os.path.join(BASE_DIR, "technician_decision_tree_model.pkl")
IMPUTER_FILE = os.path.join(BASE_DIR, "median_imputer.pkl")
PATTERNS_FILE_TXT = os.path.join(BASE_DIR, "technician_patterns.txt") # Alt (Optional beibehalten)
PATTERNS_FILE_JSON = os.path.join(BASE_DIR, "technician_patterns.json") # Neu (Empfohlen)
# Marker für URLs, die erneut per SERP gesucht werden sollen
URL_CHECK_MARKER = "URL_CHECK_NEEDED"
# --- User Agents für Rotation ---
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:108.0) Gecko/20100101 Firefox/108.0',
'Mozilla/5.0 (X11; Linux i686; rv:108.0) Gecko/20100101 Firefox/108.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0',
]
# ==============================================================================
# 2. VORAB-HELPER FUNKTION (wird von Config-Klasse benötigt)
# ==============================================================================
def normalize_for_mapping(text):
"""
Normalisiert einen String aggressiv für Mapping-Zwecke.
Muss VOR der Config-Klasse definiert werden, da sie dort verwendet wird.
"""
if not isinstance(text, str):
return ""
text = text.lower()
text = text.strip()
text = re.sub(r'[^a-z0-9]', '', text)
return text
# ==============================================================================
# 3. ZENTRALE KONFIGURATIONS-KLASSE
# ==============================================================================
class Config:
"""Zentrale Konfigurationseinstellungen."""
VERSION = "v2.0.0" # Version hochgezählt nach Refactoring
LANG = "de" # Sprache fuer Wikipedia etc.
# ACHTUNG: SHEET_URL ist hier ein Platzhalter. Ersetzen Sie ihn durch Ihre tatsaechliche URL.
SHEET_URL = "https://docs.google.com/spreadsheets/d/1u_gHr9JUfmV1-iviRzbSe3575QEp7KLhK5jFV_gJcgo" # <<< ERSETZEN SIE DIES!
MAX_RETRIES = 5
RETRY_DELAY = 10
REQUEST_TIMEOUT = 20
SIMILARITY_THRESHOLD = 0.65
DEBUG = True
WIKIPEDIA_SEARCH_RESULTS = 5
HTML_PARSER = "html.parser"
TOKEN_MODEL = "gpt-3.5-turbo"
USER_AGENT = 'Mozilla/5.0 (compatible; UnternehmenSkript/1.0; +https://www.example.com/bot)'
# --- Konfiguration fuer Batching & Parallelisierung ---
PROCESSING_BATCH_SIZE = 20
OPENAI_BATCH_SIZE_LIMIT = 4
MAX_SCRAPING_WORKERS = 10
UPDATE_BATCH_ROW_LIMIT = 50
MAX_BRANCH_WORKERS = 10
OPENAI_CONCURRENCY_LIMIT = 3
PROCESSING_BRANCH_BATCH_SIZE = 20
SERPAPI_DELAY = 1.5
# --- (NEU) GTM Architect: Stilvorgabe für Bildgenerierung ---
CORPORATE_DESIGN_PROMPT = (
"cinematic industrial photography, sleek high-tech aesthetic, futuristic but grounded reality, "
"volumetric lighting, sharp focus on modern technology, 8k resolution, photorealistic, "
"highly detailed textures, cool steel-blue color grading with subtle safety-yellow accents, "
"wide angle lens, shallow depth of field."
)
# --- Plausibilitäts-Schwellenwerte ---
PLAUSI_UMSATZ_MIN_WARNUNG = 50000
PLAUSI_UMSATZ_MAX_WARNUNG = 200000000000
PLAUSI_MA_MIN_WARNUNG_ABS = 1
PLAUSI_MA_MIN_WARNUNG_BEI_UMSATZ = 3
PLAUSI_UMSATZ_MIN_SCHWELLE_FUER_MA_CHECK = 1000000
PLAUSI_MA_MAX_WARNUNG = 1000000
PLAUSI_RATIO_UMSATZ_PRO_MA_MIN = 25000
PLAUSI_RATIO_UMSATZ_PRO_MA_MAX = 1500000
PLAUSI_ABWEICHUNG_CRM_WIKI_PROZENT = 30
# --- Mapping für Länder-Codes ---
# Übersetzt D365 Country Codes in die im GSheet verwendete Langform.
# WICHTIG: Die Schlüssel (Codes) sollten in Kleinbuchstaben sein für einen robusten Vergleich.
COUNTRY_CODE_MAP = {
'de': 'Deutschland',
'gb': 'Vereinigtes Königreich',
'ch': 'Schweiz',
'at': 'Österreich',
'it': 'Italien',
'es': 'Spanien',
'dk': 'Dänemark',
'hu': 'Ungarn',
'se': 'Schweden',
'fr': 'Frankreich',
'us': 'USA',
'br': 'Brasilien',
'cz': 'Tschechien',
'au': 'Australien',
'mx': 'Mexiko',
'nl': 'Niederlande',
'pl': 'Polen',
'be': 'Belgien',
'sk': 'Slowakei',
'nz': 'Neuseeland',
'in': 'Indien',
'li': 'Liechtenstein',
'ae': 'Vereinigte Arabische Emirate',
'ru': 'Russland',
'jp': 'Japan',
'ro': 'Rumänien',
'is': 'Island',
'lu': 'Luxemburg',
'me': 'Montenegro',
'ph': 'Philippinen',
'fi': 'Finnland',
'no': 'Norwegen',
'ma': 'Marokko',
'hr': 'Kroatien',
'ca': 'Kanada',
'ua': 'Ukraine',
'sb': 'Salomonen',
'za': 'Südafrika',
'ee': 'Estland',
'cn': 'China',
'si': 'Slowenien',
'lt': 'Litauen',
}
# --- Branchen-Gruppen Mapping (v2.0 - Angereichert mit Definitionen & Beispielen) ---
# Single Source of Truth für alle Branchen.
BRANCH_GROUP_MAPPING = {
"Maschinenbau": {
"gruppe": "Hersteller / Produzenten",
"definition": "Herstellung von zumeist größeren und komplexen Maschinen. Abgrenzung: Keine Anlagen wie z.B. Aufzüge, Rolltreppen oder komplette Produktionsstraßen.",
"beispiele": "EBM Papst, Kärcher, Winterhalter, Testo, ZwickRoell, Koch Pac, Uhlmann, BHS, Schlie, Kasto, Chiron",
"d365_branch_detail": "Maschinenbau"
},
"Automobil": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von (Spezial)-Fahrzeugen, die meist in ihrer Bewegung eingeschränkt sind (z.B. Mähdrescher, Pistenraupen). Abgrenzung: Keine Autohändler oder Service an PKWs.",
"beispiele": "Kässbohrer, Aebi Schmidt, Pesko, Nova, PV Automotive",
"d365_branch_detail": "Automobil"
},
"Anlagenbau": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von komplexen Anlagen, die fest beim Kunden installiert werden (z.B. Fertigungsanlagen) und oft der Herstellung nachgelagerter Erzeugnisse dienen. Abgrenzung: Keine Aufzugsanlagen, keine Rolltreppen.",
"beispiele": "Yaskawa, Good Mills, Jungheinrich, Abus, BWT",
"d365_branch_detail": "Anlagenbau"
},
"Medizintechnik": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von medizinischen Geräten für Krankenhäuser, (Zahn-)Arztpraxen oder den Privatbereich. Abgrenzung: Keine reinen Dienstleister/Pflegedienste.",
"beispiele": "Carl Zeiss, MMM, Olympus, Sysmex, Henry Schein, Dental Bauer, Vitalaire",
"d365_branch_detail": "Medizintechnik"
},
"Chemie & Pharma": {
"gruppe": "Hersteller / Produzenten",
"definition": "Unternehmen, die chemische oder pharmazeutische Erzeugnisse herstellen. Abgrenzung: Keine Lebensmittel.",
"beispiele": "Brillux",
"d365_branch_detail": "Chemie & Pharma"
},
"Elektrotechnik": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von Maschinen und Geräten, die sich hauptsächlich durch elektrische Komponenten auszeichnen.",
"beispiele": "Triathlon, SBS BatterieSystem",
"d365_branch_detail": "Elektrotechnik"
},
"Lebensmittelproduktion": {
"gruppe": "Hersteller / Produzenten",
"definition": "Unternehmen, die Lebensmittel im industriellen Maßstab produzieren.",
"beispiele": "Ferrero, Lohmann, Mars, Fuchs, Teekanne, Frischli",
"d365_branch_detail": "Lebensmittelproduktion"
},
"IT / Telekommunikation": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von Telekommunikations-Hardware und -Equipment. Abgrenzung: Keine Telekommunikations-Netzbetreiber.",
"beispiele": "NDI Nordisk Daek Import Danmark",
"d365_branch_detail": "IT / Telekommunikation"
},
"Bürotechnik": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von Geräten für die Büro-Infrastruktur wie Drucker, Kopierer oder Aktenvernichter.",
"beispiele": "Ricoh, Rosskopf",
"d365_branch_detail": "Bürotechnik"
},
"Automaten (Vending / Slot)": {
"gruppe": "Hersteller / Produzenten",
"definition": "Reine Hersteller von Verkaufs-, Service- oder Spielautomaten, die mitunter einen eigenen Kundenservice haben.",
"beispiele": "Coffema, Melitta, Tchibo, Selecta",
"d365_branch_detail": "Automaten (Vending, Slot)"
},
"Gebäudetechnik Heizung / Lüftung / Klima": {
"gruppe": "Hersteller / Produzenten",
"definition": "Reine Hersteller von Heizungs-, Lüftungs- und Klimaanlagen (HLK), die mitunter einen eigenen Kundenservice haben.",
"beispiele": "Wolf, ETA, Fröling, Ochsner, Windhager, DKA",
"d365_branch_detail": "Gebäudetechnik Heizung, Lüftung, Klima"
},
"Gebäudetechnik Allgemein": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von Produkten, die fest in Gebäuden installiert werden (z.B. Sicherheitstechnik, Türen, Sonnenschutz).",
"beispiele": "Geze, Bothe Hild, Warema, Hagleitner",
"d365_branch_detail": "Gebäudetechnik Allgemein"
},
"Schädlingsbekämpfung": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von Systemen und Produkten zur Schädlingsbekämpfung.",
"beispiele": "BioTec, RSD Systems",
"d365_branch_detail": "Schädlingsbekämpfung"
},
"Braune & Weiße Ware": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von Haushaltsgroßgeräten (Weiße Ware) und Unterhaltungselektronik (Braune Ware).",
"beispiele": "BSH",
"d365_branch_detail": "Braune & Weiße Ware"
},
"Fenster / Glas": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von Fenstern, Türen oder Glaselementen.",
"beispiele": "",
"d365_branch_detail": "Fenster / Glas"
},
"Getränke": {
"gruppe": "Hersteller / Produzenten",
"definition": "Industrielle Hersteller von Getränken.",
"beispiele": "Wesergold, Schlossquelle, Winkels",
"d365_branch_detail": "Getränke"
},
"Möbel": {
"gruppe": "Hersteller / Produzenten",
"definition": "Industrielle Hersteller von Möbeln.",
"beispiele": "mycs",
"d365_branch_detail": "Möbel"
},
"Agrar / Pellets": {
"gruppe": "Hersteller / Produzenten",
"definition": "Hersteller von landwirtschaftlichen Produkten, Maschinen oder Brennstoffen wie Holzpellets.",
"beispiele": "KWB Energiesysteme",
"d365_branch_detail": "Agrar, Pellets"
},
"Stadtwerke": {
"gruppe": "Versorger",
"definition": "Lokale Stadtwerke, die die lokale Infrastruktur für die Energieversorgung (Strom, Gas, Wasser) betreiben.",
"beispiele": "Badenova, Drewag, Stadtwerke Leipzig, Stadtwerke Kiel",
"d365_branch_detail": "Stadtwerke"
},
"Verteilnetzbetreiber": {
"gruppe": "Versorger",
"definition": "Überregionale Betreiber von Verteilnetzen (Strom, Gas), die oft keine direkten Endkundenversorger sind.",
"beispiele": "Rheinenergie, Open Grid, ENBW",
"d365_branch_detail": "Verteilnetzbetreiber"
},
"Telekommunikation": {
"gruppe": "Versorger",
"definition": "Betreiber von Telekommunikations-Infrastruktur und Netzen (z.B. Telefon, Internet, Mobilfunk).",
"beispiele": "M-Net, NetCologne, Thiele, Willy.tel",
"d365_branch_detail": "Telekommunikation"
},
"Gase & Mineralöl": {
"gruppe": "Versorger",
"definition": "Unternehmen, die Gas- oder Mineralölprodukte an Endkunden oder Unternehmen liefern.",
"beispiele": "Westfalen AG, GasCom",
"d365_branch_detail": "Gase & Mineralöl"
},
"Messdienstleister": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Unternehmen, die sich auf die Ablesung und Abrechnung von Verbrauchszählern (Heizung, Wasser) spezialisiert haben. Abgrenzung: Kein Versorger.",
"beispiele": "Brunata, Ista, Telent",
"d365_branch_detail": "Messdienstleister"
},
"Facility Management": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Anbieter von Dienstleistungen rund um Immobilien, von der technischen Instandhaltung bis zur Reinigung.",
"beispiele": "Wisag, Vonovia, Infraserv, Gewofag, B&O, Sprint Sanierungen, BWTS",
"d365_branch_detail": "Facility Management"
},
"Healthcare/Pflegedienste": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Erbringen von reinen Dienstleistungen an medizinischen Geräten (z.B. Wartung, Lieferung) oder direkt an Menschen (Pflege). Abgrenzung: Keine Hersteller.",
"beispiele": "Sanimed, Fuchs+Möller, Strehlow, Healthcare at Home",
"d365_branch_detail": "Healthcare/Pflegedienste"
},
"Servicedienstleister / Reparatur ohne Produktion": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Reine Service-Organisationen, die technische Geräte warten und reparieren, aber nicht selbst herstellen.",
"beispiele": "HSR, FFB",
"d365_branch_detail": "Servicedienstleister / Reparatur ohne Produktion"
},
"Aufzüge und Rolltreppen": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Hersteller und Unternehmen, die Service, Wartung und Installation von Aufzügen und Rolltreppen anbieten.",
"beispiele": "TKE, Liftstar, Lifta",
"d365_branch_detail": "Aufzüge und Rolltreppen"
},
"Feuer- und Sicherheitssysteme": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Dienstleister für die Wartung, Installation und Überprüfung von Brandmelde- und Sicherheitssystemen.",
"beispiele": "Minimax, Securiton",
"d365_branch_detail": "Feuer- und Sicherheitssysteme"
},
"Personentransport": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Unternehmen, die Personen befördern (z.B. Busunternehmen, Taxi-Zentralen) und eine eigene Fahrzeugflotte warten.",
"beispiele": "Rhein-Sieg-Verkehrsgesellschaft",
"d365_branch_detail": "Personentransport"
},
"Entsorgung": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Unternehmen der Abfall- und Entsorgungswirtschaft mit komplexer Logistik und Fahrzeugmanagement.",
"beispiele": "",
"d365_branch_detail": "Entsorgung"
},
"Catering Services": {
"gruppe": "Service provider (Dienstleister)",
"definition": "Anbieter von Verpflegungsdienstleistungen, oft mit komplexer Logistik und Wartung von Küchengeräten.",
"beispiele": "Café+Co International",
"d365_branch_detail": "Catering Services"
},
"Auslieferdienste": {
"gruppe": "Handel & Logistik",
"definition": "Unternehmen, deren Kerngeschäft der Transport und die Logistik von Waren zum Endkunden ist (Lieferdienste). Abgrenzung: Keine reinen Logistik-Dienstleister.",
"beispiele": "Edeka, Rewe, Saturn, Gamma Reifen",
"d365_branch_detail": "Auslieferdienste"
},
"Energie (Brennstoffe)": {
"gruppe": "Handel & Logistik",
"definition": "Unternehmen, deren Kerngeschäft der Transport und die Logistik von Brennstoffen wie Heizöl zum Endkunden ist.",
"beispiele": "Eckert & Ziegler",
"d365_branch_detail": "Energie (Brennstoffe)"
},
"Großhandel": {
"gruppe": "Handel & Logistik",
"definition": "Großhandelsunternehmen, bei denen der Transport und die Logistik eine zentrale Rolle spielen.",
"beispiele": "Hairhaus, NDI Nordisk",
"d365_branch_detail": "Großhandel"
},
"Einzelhandel": {
"gruppe": "Handel & Logistik",
"definition": "Einzelhandelsunternehmen, oft mit eigener Lieferlogistik zum Endkunden.",
"beispiele": "Cactus, mertens, Teuto",
"d365_branch_detail": "Einzelhandel"
},
"Logistik": {
"gruppe": "Handel & Logistik",
"definition": "Allgemeine Logistikdienstleister, die nicht in eine der spezifischeren Kategorien passen.",
"beispiele": "Gerdes + Landwehr, Rüdebusch, Winner",
"d365_branch_detail": "Logistik - Sonstige"
},
"Baustoffhandel": {
"gruppe": "Baubranche",
"definition": "Großhandel mit Baustoffen wie Zement, Kies, Holz oder Fliesen oft mit eigenen Fuhrparks und komplexer Filiallogistik.",
"beispiele": "Kemmler Baustoffe, Henri Benthack",
"d365_branch_detail": "Baustoffhandel"
},
"Baustoffindustrie": {
"gruppe": "Baubranche",
"definition": "Produktion von Baustoffen wie Beton, Ziegeln, Gips oder Dämmmaterial häufig mit werkseigener Logistik.",
"beispiele": "Heidelberg Materials, Saint Gobain Weber",
"d365_branch_detail": "Baustoffindustrie"
},
"Logistiker Baustoffe": {
"gruppe": "Baubranche",
"definition": "Spezialisierte Transportdienstleister für Baustoffe häufig im Nahverkehr, mit engen Zeitfenstern und Baustellenbelieferung.",
"beispiele": "C.Bergmann, HENGE Baustoff GmbH",
"d365_branch_detail": "Logistiker Baustoffe"
},
"Baustoffindustrie": {
"gruppe": "Baubranche",
"definition": "Produktion von Baustoffen wie Beton, Ziegeln, Gips oder Dämmmaterial häufig mit werkseigener Logistik.",
"beispiele": "Heidelberg Materials, Saint Gobain Weber",
"d365_branch_detail": "Baustoffindustrie"
},
"Bauunternehmen": {
"gruppe": "Baubranche",
"definition": "Ausführung von Bauprojekten, oft mit eigenem Materialtransport hoher Koordinationsaufwand bei Fahrzeugen, Maschinen und Baustellen.",
"beispiele": "Max Bögl, Leonhard Weiss",
"d365_branch_detail": "Bauunternehmen"
},
"Versicherungsgutachten": {
"gruppe": "Gutachter / Versicherungen",
"definition": "Gutachter, die im Auftrag von Versicherungen Schäden prüfen und bewerten.",
"beispiele": "DEVK, Allianz",
"d365_branch_detail": "Versicherungsgutachten"
},
"Technische Gutachten": {
"gruppe": "Gutachter / Versicherungen",
"definition": "Sachverständige und Organisationen, die technische Prüfungen, Inspektionen und Gutachten durchführen.",
"beispiele": "TÜV, Audatex, Value, MDK",
"d365_branch_detail": "Technische Gutachten"
},
"Medizinische Gutachten": {
"gruppe": "Gutachter / Versicherungen",
"definition": "Sachverständige und Organisationen (z.B. MDK), die medizinische Gutachten erstellen.",
"beispiele": "MDK",
"d365_branch_detail": "Medizinische Gutachten"
},
"Baugutachter": {
"gruppe": "Gutachter / Versicherungen",
"definition": "Sachverständige, die Bauschäden oder den Wert von Immobilien begutachten.",
"beispiele": "",
"d365_branch_detail": "Baugutachter"
},
"Wohnungswirtschaft": {
"gruppe": "Housing",
"definition": "Wohnungsbaugesellschaften oder -genossenschaften, die ihre Immobilien instand halten.",
"beispiele": "GEWOFAG",
"d365_branch_detail": "Wohnungswirtschaft"
},
"Renovierungsunternehmen": {
"gruppe": "Housing",
"definition": "Dienstleister, die auf die Renovierung und Sanierung von Wohnimmobilien spezialisiert sind.",
"beispiele": "",
"d365_branch_detail": "Renovierungsunternehmen"
},
"Sozialbau Unternehmen": {
"gruppe": "Housing",
"definition": "Unternehmen, die im Bereich des sozialen Wohnungsbaus tätig sind.",
"beispiele": "",
"d365_branch_detail": "Anbieter für Soziales Wohnen"
},
"IT Beratung": {
"gruppe": "Sonstige",
"definition": "Beratungsunternehmen mit Fokus auf IT-Strategie und -Implementierung. Abgrenzung: Keine Systemhäuser mit eigenem Außendienst.",
"beispiele": "",
"d365_branch_detail": "IT Beratung"
},
"Unternehmensberatung": {
"gruppe": "Sonstige",
"definition": "Klassische Management- und Strategieberatungen.",
"beispiele": "",
"d365_branch_detail": "Unternehmensberatung (old)"
},
"Engineering": {
"gruppe": "Sonstige",
"definition": "Ingenieurbüros und technische Planungsdienstleister.",
"beispiele": "",
"d365_branch_detail": "Engineering"
},
"Öffentliche Verwaltung": {
"gruppe": "Sonstige",
"definition": "Behörden und öffentliche Einrichtungen, oft mit eigenen technischen Abteilungen (z.B. Bauhöfe).",
"beispiele": "",
"d365_branch_detail": "Öffentliche Verwaltung"
},
"Sonstiger Service": {
"gruppe": "Sonstige",
"definition": "Auffangkategorie für Dienstleistungen, die keiner anderen Kategorie zugeordnet werden können.",
"beispiele": "",
"d365_branch_detail": "Sonstiger Service (old)"
}
}
# Branchenübergreifende Top-Referenzen als Fallback
FALLBACK_REFERENCES = [
"Jungheinrich (weltweit >4.000 Techniker)",
"Vivawest (Kundenzufriedenheit > 95%)",
"TK Elevators (1.500 Techniker)",
"NetCologne"
]
# --- API Schlüssel Speicherung (werden in main() geladen) ---
API_KEYS = {}
@classmethod
def load_api_keys(cls):
"""Laedt API-Schluessel aus den definierten Dateien."""
logger = logging.getLogger(__name__)
logger.info("Lade API-Schluessel...")
cls.API_KEYS['openai'] = cls._load_key_from_file(API_KEY_FILE)
cls.API_KEYS['serpapi'] = cls._load_key_from_file(SERP_API_KEY_FILE)
cls.API_KEYS['genderize'] = cls._load_key_from_file(GENDERIZE_API_KEY_FILE)
if cls.API_KEYS.get('openai'):
# Hier nehmen wir an, dass 'openai' für Gemini verwendet wird (Legacy)
# Falls in helpers.py direkt auf 'gemini' zugegriffen wird, müsste das hier auch gesetzt werden.
logger.info("Gemini API Key (via 'openai' slot) erfolgreich geladen.")
else:
logger.warning("Gemini API Key konnte nicht geladen werden. KI-Funktionen sind deaktiviert.")
if not cls.API_KEYS.get('serpapi'):
logger.warning("SerpAPI Key konnte nicht geladen werden. Suchfunktionen sind deaktiviert.")
if not cls.API_KEYS.get('genderize'):
logger.warning("Genderize API Key konnte nicht geladen werden. Geschlechtserkennung ist eingeschraenkt.")
@staticmethod
def _load_key_from_file(filepath):
"""Hilfsfunktion zum Laden eines Schluessels aus einer Datei."""
logger = logging.getLogger(__name__)
abs_path = os.path.abspath(filepath)
try:
with open(abs_path, "r", encoding="utf-8") as f:
key = f.read().strip()
if key:
return key
else:
logger.warning(f"API key file is empty: '{abs_path}'")
return None
except FileNotFoundError:
logger.warning(f"API key file not found at path: '{abs_path}'")
return None
except Exception as e:
logger.error(f"Error reading key file '{abs_path}': {e}")
return None
# ==============================================================================
# 4. GLOBALE DATENSTRUKTUR-VARIABLEN
# ==============================================================================
# NEU: Definiert die exakte und garantierte Reihenfolge der Spalten.
# Dies ist die neue "Single Source of Truth" für alle Index-Berechnungen.
COLUMN_ORDER = [
"ReEval Flag", "CRM Name", "CRM Kurzform", "Parent Account Name", "CRM Website", "CRM Ort", "CRM Land",
"CRM Beschreibung", "CRM Branche", "CRM Beschreibung Branche extern", "CRM Anzahl Techniker", "CRM Umsatz",
"CRM Anzahl Mitarbeiter", "CRM Vorschlag Wiki URL", "System Vorschlag Parent Account", "Parent Vorschlag Status",
"Parent Vorschlag Timestamp", "Wiki URL", "Wiki Sitz Stadt", "Wiki Sitz Land", "Wiki Absatz", "Wiki Branche",
"Wiki Umsatz", "Wiki Mitarbeiter", "Wiki Kategorien", "Wikipedia Timestamp", "Wiki Verif. Timestamp",
"SerpAPI Wiki Search Timestamp", "Chat Wiki Konsistenzpruefung", "Chat Begründung Wiki Inkonsistenz",
"Chat Vorschlag Wiki Artikel", "Begründung bei Abweichung", "Website Rohtext", "Website Zusammenfassung",
"Website Meta-Details", "Website Scrape Timestamp", "URL Prüfstatus", "Chat Vorschlag Branche",
"Chat Branche Konfidenz", "Chat Konsistenz Branche", "Chat Begruendung Abweichung Branche",
"Chat Prüfung FSM Relevanz", "Chat Begründung für FSM Relevanz", "Chat Schätzung Anzahl Mitarbeiter",
"Chat Konsistenzprüfung Mitarbeiterzahl", "Chat Begruendung Abweichung Mitarbeiterzahl",
"Chat Einschätzung Anzahl Servicetechniker", "Chat Begründung Abweichung Anzahl Servicetechniker",
"Chat Schätzung Umsatz", "Chat Begründung Abweichung Umsatz", "FSM Pitch", "FSM Pitch Timestamp",
"Linked Serviceleiter gefunden", "Linked It-Leiter gefunden", "Linked Management gefunden",
"Linked Disponent gefunden", "Contact Search Timestamp", "Finaler Umsatz (Wiki>CRM)",
"Finaler Mitarbeiter (Wiki>CRM)", "Geschaetzter Techniker Bucket", "Plausibilität Umsatz",
"Plausibilität Mitarbeiter", "Plausibilität Umsatz/MA Ratio", "Abweichung Umsatz CRM/Wiki",
"Abweichung MA CRM/Wiki", "Plausibilität Begründung", "Plausibilität Prüfdatum",
"Archiviert", "SyncConflict", "Timestamp letzte Pruefung", "Version", "Tokens", "CRM ID"
]
# --- Spalten-Mapping (Single Source of Truth) ---
# Version 1.8.0 - 68 Spalten (A-BP)
COLUMN_MAP = {
# A-E: Stammdaten & Prozesssteuerung
"ReEval Flag": {"Titel": "A", "index": 0},
"CRM Name": {"Titel": "B", "index": 1},
"CRM Kurzform": {"Titel": "C", "index": 2},
"Parent Account Name": {"Titel": "D", "index": 3},
"CRM Website": {"Titel": "E", "index": 4},
# F-M: CRM-Daten
"CRM Ort": {"Titel": "F", "index": 5},
"CRM Land": {"Titel": "G", "index": 6},
"CRM Beschreibung": {"Titel": "H", "index": 7},
"CRM Branche": {"Titel": "I", "index": 8},
"CRM Beschreibung Branche extern": {"Titel": "J", "index": 9},
"CRM Anzahl Techniker": {"Titel": "K", "index": 10},
"CRM Umsatz": {"Titel": "L", "index": 11},
"CRM Anzahl Mitarbeiter": {"Titel": "M", "index": 12},
# N-Q: System & Parent Vorschläge
"CRM Vorschlag Wiki URL": {"Titel": "N", "index": 13},
"System Vorschlag Parent Account": {"Titel": "O", "index": 14},
"Parent Vorschlag Status": {"Titel": "P", "index": 15},
"Parent Vorschlag Timestamp": {"Titel": "Q", "index": 16},
# R-AB: Wikipedia Extraktion
"Wiki URL": {"Titel": "R", "index": 17},
"Wiki Sitz Stadt": {"Titel": "S", "index": 18},
"Wiki Sitz Land": {"Titel": "T", "index": 19},
"Wiki Absatz": {"Titel": "U", "index": 20},
"Wiki Branche": {"Titel": "V", "index": 21},
"Wiki Umsatz": {"Titel": "W", "index": 22},
"Wiki Mitarbeiter": {"Titel": "X", "index": 23},
"Wiki Kategorien": {"Titel": "Y", "index": 24},
"Wikipedia Timestamp": {"Titel": "Z", "index": 25},
"Wiki Verif. Timestamp": {"Titel": "AA", "index": 26},
"SerpAPI Wiki Search Timestamp": {"Titel": "AB", "index": 27},
# AC-AF: ChatGPT Wiki Verifizierung
"Chat Wiki Konsistenzpruefung": {"Titel": "AC", "index": 28},
"Chat Begründung Wiki Inkonsistenz": {"Titel": "AD", "index": 29},
"Chat Vorschlag Wiki Artikel": {"Titel": "AE", "index": 30},
"Begründung bei Abweichung": {"Titel": "AF", "index": 31},
# AG-AK: Website Scraping
"Website Rohtext": {"Titel": "AG", "index": 32},
"Website Zusammenfassung": {"Titel": "AH", "index": 33},
"Website Meta-Details": {"Titel": "AI", "index": 34},
"Website Scrape Timestamp": {"Titel": "AJ", "index": 35},
"URL Prüfstatus": {"Titel": "AK", "index": 36},
# AL-AU: ChatGPT Branchen & FSM Analyse
"Chat Vorschlag Branche": {"Titel": "AL", "index": 37},
"Chat Branche Konfidenz": {"Titel": "AM", "index": 38},
"Chat Konsistenz Branche": {"Titel": "AN", "index": 39},
"Chat Begruendung Abweichung Branche": {"Titel": "AO", "index": 40},
"Chat Prüfung FSM Relevanz": {"Titel": "AP", "index": 41},
"Chat Begründung für FSM Relevanz": {"Titel": "AQ", "index": 42},
"Chat Schätzung Anzahl Mitarbeiter": {"Titel": "AR", "index": 43},
"Chat Konsistenzprüfung Mitarbeiterzahl": {"Titel": "AS", "index": 44},
"Chat Begruendung Abweichung Mitarbeiterzahl": {"Titel": "AT", "index": 45},
"Chat Einschätzung Anzahl Servicetechniker": {"Titel": "AU", "index": 46},
# AV-AZ: ChatGPT Fortsetzung & FSM Pitch
"Chat Begründung Abweichung Anzahl Servicetechniker": {"Titel": "AV", "index": 47},
"Chat Schätzung Umsatz": {"Titel": "AW", "index": 48},
"Chat Begründung Abweichung Umsatz": {"Titel": "AX", "index": 49},
"FSM Pitch": {"Titel": "AY", "index": 50},
"FSM Pitch Timestamp": {"Titel": "AZ", "index": 51},
# BA-BE: LinkedIn Kontaktsuche
"Linked Serviceleiter gefunden": {"Titel": "BA", "index": 52},
"Linked It-Leiter gefunden": {"Titel": "BB", "index": 53},
"Linked Management gefunden": {"Titel": "BC", "index": 54},
"Linked Disponent gefunden": {"Titel": "BD", "index": 55},
"Contact Search Timestamp": {"Titel": "BE", "index": 56},
# BF-BH: Konsolidierte Daten & ML
"Finaler Umsatz (Wiki>CRM)": {"Titel": "BF", "index": 57},
"Finaler Mitarbeiter (Wiki>CRM)": {"Titel": "BG", "index": 58},
"Geschaetzter Techniker Bucket": {"Titel": "BH", "index": 59},
# BI-BO: Plausibilitäts-Checks
"Plausibilität Umsatz": {"Titel": "BI", "index": 60},
"Plausibilität Mitarbeiter": {"Titel": "BJ", "index": 61},
"Plausibilität Umsatz/MA Ratio": {"Titel": "BK", "index": 62},
"Abweichung Umsatz CRM/Wiki": {"Titel": "BL", "index": 63},
"Abweichung MA CRM/Wiki": {"Titel": "BM", "index": 64},
"Plausibilität Begründung": {"Titel": "BN", "index": 65},
"Plausibilität Prüfdatum": {"Titel": "BO", "index": 66},
"Archiviert": {"Titel": "BP", "index": 67},
"SyncConflict": {"Titel": "BQ", "index": 68},
# BR-BU: Metadaten (Indizes verschoben)
"Timestamp letzte Pruefung": {"Titel": "BR", "index": 69},
"Version": {"Titel": "BS", "index": 70},
"Tokens": {"Titel": "BT", "index": 71},
"CRM ID": {"Titel": "BU", "index": 72}
}
# ==============================================================================
# 5. DEALFRONT AUTOMATION CONFIGURATION
# ==============================================================================
DEALFRONT_CREDENTIALS_FILE = os.path.join(BASE_DIR, "dealfront_credentials.json")
DEALFRONT_LOGIN_URL = "https://app.dealfront.com/login"
# Die direkte URL zum 'Target'-Bereich. Dies hat sich als der robusteste Weg erwiesen.
DEALFRONT_TARGET_URL = "https://app.dealfront.com/t/prospector/companies"
# WICHTIG: Der exakte Name der vordefinierten Suche, die nach der Navigation geladen werden soll.
TARGET_SEARCH_NAME = "Facility Management" # <-- PASSEN SIE DIESEN NAMEN AN IHRE ZIEL-LISTE AN
# --- END OF FILE config.py ---

View File

@@ -0,0 +1,412 @@
#!/usr/bin/env python3
"""
helpers.py
Sammlung von globalen, wiederverwendbaren Hilfsfunktionen für das Projekt
"Automatisierte Unternehmensbewertung". Enthält Decorators, Text-Normalisierung,
API-Wrapper und andere Dienstprogramme.
"""
__version__ = "v2.4.0_Final_Fix"
ALLOWED_TARGET_BRANCHES = []
# ==============================================================================
# 1. IMPORTS
# ==============================================================================
# Standardbibliotheken
import os
import time
import re
import csv
import json
import random
import logging
import traceback
import unicodedata
from datetime import datetime
from urllib.parse import urlparse, unquote
from difflib import SequenceMatcher
import base64
import sys
# Externe Bibliotheken
try:
import gspread
GSPREAD_AVAILABLE = True
except ImportError:
GSPREAD_AVAILABLE = False
gspread = None
try:
import wikipedia
WIKIPEDIA_AVAILABLE = True
except ImportError:
WIKIPEDIA_AVAILABLE = False
wikipedia = None
import requests
from bs4 import BeautifulSoup
try:
import pandas as pd
PANDAS_AVAILABLE = True
except Exception as e:
logging.warning(f"Pandas import failed: {e}")
PANDAS_AVAILABLE = False
pd = None
# --- KI UMSCHALTUNG: Google Generative AI (Dual Support) ---
HAS_NEW_GENAI = False
HAS_OLD_GENAI = False
# 1. Neue Bibliothek (google-genai)
try:
from google import genai
from google.genai import types
HAS_NEW_GENAI = True
logging.info("Bibliothek 'google.genai' (v1.0+) geladen.")
except ImportError:
logging.warning("Bibliothek 'google.genai' nicht gefunden. Versuche Fallback.")
# 2. Alte Bibliothek (google-generativeai)
try:
import google.generativeai as old_genai
HAS_OLD_GENAI = True
logging.info("Bibliothek 'google.generativeai' (Legacy) geladen.")
except ImportError:
logging.warning("Bibliothek 'google.generativeai' nicht gefunden.")
HAS_GEMINI = HAS_NEW_GENAI or HAS_OLD_GENAI
# OpenAI Imports (Legacy)
try:
import openai
from openai.error import AuthenticationError, OpenAIError, RateLimitError, APIError, Timeout, InvalidRequestError, ServiceUnavailableError
OPENAI_AVAILABLE = True
except ImportError:
OPENAI_AVAILABLE = False
class AuthenticationError(Exception): pass
class OpenAIError(Exception): pass
class RateLimitError(Exception): pass
class APIError(Exception): pass
class Timeout(Exception): pass
class InvalidRequestError(Exception): pass
class ServiceUnavailableError(Exception): pass
from config import (Config, BRANCH_MAPPING_FILE, URL_CHECK_MARKER, USER_AGENTS, LOG_DIR)
from config import Config, COLUMN_MAP, COLUMN_ORDER
# Optionale Bibliotheken
try:
import tiktoken
except ImportError:
tiktoken = None
gender = None
gender_detector = None
def get_col_idx(key):
try:
return COLUMN_ORDER.index(key)
except ValueError:
return None
# ==============================================================================
# 2. RETRY DECORATOR
# ==============================================================================
decorator_logger = logging.getLogger(__name__ + ".Retry")
def retry_on_failure(func):
def wrapper(*args, **kwargs):
func_name = func.__name__
self_arg = args[0] if args and hasattr(args[0], func_name) and isinstance(args[0], object) else None
effective_func_name = f"{self_arg.__class__.__name__}.{func_name}" if self_arg else func_name
max_retries_config = getattr(Config, 'MAX_RETRIES', 3)
base_delay = getattr(Config, 'RETRY_DELAY', 5)
if max_retries_config <= 0:
return func(*args, **kwargs)
for attempt in range(max_retries_config):
try:
if attempt > 0:
decorator_logger.warning(f"Wiederhole Versuch {attempt + 1}/{max_retries_config} fuer '{effective_func_name}'...")
return func(*args, **kwargs)
except Exception as e:
permanent_errors = [ValueError]
if GSPREAD_AVAILABLE:
permanent_errors.append(gspread.exceptions.SpreadsheetNotFound)
if any(isinstance(e, error_type) for error_type in permanent_errors):
raise e
if attempt < max_retries_config - 1:
wait_time = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time)
else:
raise e
raise RuntimeError(f"Retry loop error for {effective_func_name}")
return wrapper
# ==============================================================================
# 3. LOGGING & UTILS
# ==============================================================================
def token_count(text, model=None):
if not text or not isinstance(text, str): return 0
return len(str(text).split())
def log_module_versions(modules_to_log):
pass
def create_log_filename(mode):
try:
now = datetime.now().strftime("%Y-%m-%d_%H-%M")
ver_short = getattr(Config, 'VERSION', 'unknown').replace(".", "")
return os.path.join(LOG_DIR, f"{now}_{ver_short}_Modus-{mode}.txt")
except Exception:
return None
# ==============================================================================
# 4. TEXT, STRING & URL UTILITIES
# ==============================================================================
def simple_normalize_url(url): return url if url else "k.A."
def normalize_string(s): return s
def clean_text(text): return str(text).strip() if text else "k.A."
def normalize_company_name(name): return name.lower().strip() if name else ""
def _get_col_letter(col_num): return ""
def fuzzy_similarity(str1, str2): return 0.0
def extract_numeric_value(raw_value, is_umsatz=False): return "k.A."
def get_numeric_filter_value(value_str, is_umsatz=False): return 0.0
@retry_on_failure
def _call_genderize_api(name, api_key): return {}
def get_gender(firstname): return "unknown"
def get_email_address(firstname, lastname, website): return ""
# ==============================================================================
# 8. GEMINI API WRAPPERS
# ==============================================================================
def _get_gemini_api_key():
api_key = Config.API_KEYS.get('gemini') or Config.API_KEYS.get('openai')
if api_key: return api_key
api_key = os.environ.get("GEMINI_API_KEY") or os.environ.get("OPENAI_API_KEY")
if api_key: return api_key
raise ValueError("API Key missing.")
@retry_on_failure
def call_gemini_flash(prompt, system_instruction=None, temperature=0.3, json_mode=False):
"""
Ruft Gemini auf (Text). Nutzt gemini-2.0-flash als Standard.
"""
logger = logging.getLogger(__name__)
api_key = _get_gemini_api_key()
# Priorität 1: Alte Bibliothek (bewährt für Text in diesem Setup)
if HAS_OLD_GENAI:
try:
old_genai.configure(api_key=api_key)
generation_config = {
"temperature": temperature,
"top_p": 0.95,
"top_k": 40,
"max_output_tokens": 8192,
}
if json_mode:
generation_config["response_mime_type"] = "application/json"
# WICHTIG: Nutze 2.0, da 1.5 nicht verfügbar war
model = old_genai.GenerativeModel(
model_name="gemini-2.0-flash",
generation_config=generation_config,
system_instruction=system_instruction
)
contents = [prompt] if isinstance(prompt, str) else prompt
response = model.generate_content(contents)
return response.text.strip()
except Exception as e:
logger.error(f"Fehler mit alter GenAI Lib: {e}")
if not HAS_NEW_GENAI: raise e
# Fallthrough to new lib
# Priorität 2: Neue Bibliothek
if HAS_NEW_GENAI:
try:
client = genai.Client(api_key=api_key)
config = {
"temperature": temperature,
"top_p": 0.95,
"top_k": 40,
"max_output_tokens": 8192,
}
if json_mode:
config["response_mime_type"] = "application/json"
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[prompt] if isinstance(prompt, str) else prompt,
config=config
)
return response.text.strip()
except Exception as e:
logger.error(f"Fehler mit neuer GenAI Lib: {e}")
raise e
raise ImportError("Keine Gemini Bibliothek verfügbar.")
@retry_on_failure
def call_gemini_image(prompt, reference_image_b64=None, aspect_ratio=None):
"""
Generiert ein Bild.
- Mit Referenzbild: Gemini 2.5 Flash Image.
- Ohne Referenzbild: Imagen 4.0.
- NEU: Akzeptiert `aspect_ratio` (z.B. "16:9").
- NEU: Wendet einen zentralen Corporate Design Prompt an.
"""
logger = logging.getLogger(__name__)
api_key = _get_gemini_api_key()
if HAS_NEW_GENAI:
try:
client = genai.Client(api_key=api_key)
# --- FALL A: REFERENZBILD VORHANDEN (Gemini 2.5) ---
if reference_image_b64:
try:
from PIL import Image
import io
except ImportError:
raise ImportError("Pillow (PIL) fehlt. Bitte 'pip install Pillow' ausführen.")
logger.info(f"Start Image-to-Image Generation mit gemini-2.5-flash-image. Seitenverhältnis: {aspect_ratio or 'default'}")
# Base64 zu PIL Image
try:
if "," in reference_image_b64:
reference_image_b64 = reference_image_b64.split(",")[1]
image_data = base64.b64decode(reference_image_b64)
raw_image = Image.open(io.BytesIO(image_data))
except Exception as e:
logger.error(f"Fehler beim Laden des Referenzbildes: {e}")
raise ValueError("Ungültiges Referenzbild.")
# Strengerer Prompt
full_prompt = (
"Use the provided reference image as the absolute truth. "
f"Place EXACTLY this product into the scene: {prompt}. "
"Do NOT alter the product's design, shape, or colors. "
"Keep the product 100% identical to the reference. "
"Only adjust lighting and perspective to match the scene."
)
# Hier können wir das Seitenverhältnis nicht direkt steuern,
# da es vom Referenzbild abhängt. Wir könnten es aber in den Prompt einbauen.
if aspect_ratio:
full_prompt += f" The final image composition should have an aspect ratio of {aspect_ratio}."
response = client.models.generate_content(
model='gemini-2.5-flash-image',
contents=[raw_image, full_prompt]
)
if response.candidates and response.candidates[0].content.parts:
for part in response.candidates[0].content.parts:
if part.inline_data:
return base64.b64encode(part.inline_data.data).decode('utf-8')
raise ValueError("Gemini 2.5 hat kein Bild zurückgeliefert.")
# --- FALL B: KEIN REFERENZBILD (Imagen 4) ---
else:
img_config = {
"number_of_images": 1,
"output_mime_type": "image/jpeg",
}
# Füge Seitenverhältnis hinzu, falls vorhanden
if aspect_ratio in ["16:9", "9:16", "1:1", "4:3"]:
img_config["aspect_ratio"] = aspect_ratio
logger.info(f"Seitenverhältnis auf {aspect_ratio} gesetzt.")
# Wende zentralen Stil an
final_prompt = f"{Config.CORPORATE_DESIGN_PROMPT}\n\nTask: {prompt}"
method = getattr(client.models, 'generate_images', None)
if not method:
available_methods = [m for m in dir(client.models) if not m.startswith('_')]
raise AttributeError(f"Client hat keine Image-Methode. Verfügbar: {available_methods}")
candidates = [
'imagen-4.0-generate-001',
'imagen-4.0-fast-generate-001',
'imagen-4.0-ultra-generate-001'
]
last_error = None
for model_name in candidates:
try:
logger.info(f"Versuche Text-zu-Bild mit Modell: {model_name}")
response = method(
model=model_name,
prompt=final_prompt,
config=img_config
)
if response.generated_images:
image_bytes = response.generated_images[0].image.image_bytes
return base64.b64encode(image_bytes).decode('utf-8')
except Exception as e:
logger.warning(f"Modell {model_name} fehlgeschlagen: {e}")
last_error = e
if last_error: raise last_error
raise ValueError("Kein Modell konnte Bilder generieren.")
except Exception as e:
logger.error(f"Fehler bei Image Gen: {e}")
raise e
else:
logger.error("Image Generation erfordert die neue 'google-genai' Bibliothek.")
raise ImportError("Installieren Sie 'google-genai' für Bildgenerierung.")
@retry_on_failure
def call_openai_chat(prompt, temperature=0.3, model=None, response_format_json=False):
return call_gemini_flash(
prompt=prompt,
temperature=temperature,
json_mode=response_format_json,
system_instruction=None
)
def summarize_website_content(raw_text, company_name): return "k.A."
def summarize_wikipedia_article(full_text, company_name): return "k.A."
def evaluate_branche_chatgpt(company_name, website_summary, wiki_absatz): return {}
def evaluate_branches_batch(companies_data): return []
def verify_wiki_article_chatgpt(company_name, parent_name, website, wiki_title, wiki_summary): return {}
def generate_fsm_pitch(company_name, company_short_name, ki_branche, website_summary, wiki_absatz, anzahl_ma, anzahl_techniker, techniker_bucket_ml): return ""
def serp_website_lookup(company_name): return "k.A."
def search_linkedin_contacts(company_name, website, position_query, crm_kurzform, num_results=10): return []
def get_website_raw(url, max_length=30000, verify_cert=False): return "k.A."
def scrape_website_details(url):
logger = logging.getLogger(__name__)
if not url or not isinstance(url, str) or not url.startswith('http'):
return "Keine gültige URL angegeben."
try:
headers = {'User-Agent': random.choice(USER_AGENTS)}
response = requests.get(url, headers=headers, timeout=getattr(Config, 'REQUEST_TIMEOUT', 15), verify=False)
response.raise_for_status()
if 'text/html' not in response.headers.get('Content-Type', ''): return "Kein HTML."
soup = BeautifulSoup(response.content, 'html.parser')
for element in soup(['script', 'style', 'noscript', 'iframe', 'svg', 'header', 'footer', 'nav', 'aside', 'form', 'button', 'a']):
element.decompose()
body = soup.find('body')
text = body.get_text(separator=' ', strip=True) if body else soup.get_text(separator=' ', strip=True)
text = re.sub(r'\s+', ' ', text).strip()
return text[:25000] if text else "Leer."
except Exception as e:
logger.error(f"Fehler URL {url}: {e}")
return "Fehler beim Scraping."
def is_valid_wikipedia_article_url(url): return False
def alignment_demo(sheet_handler): pass

7
cat_log.py Normal file
View File

@@ -0,0 +1,7 @@
import sys
try:
file_path = sys.argv[1] if len(sys.argv) > 1 else 'company-explorer/logs_debug/company_explorer_debug.log'
with open(file_path, 'r') as f:
print(f.read())
except Exception as e:
print(f"Error reading {file_path}: {e}")

View File

@@ -0,0 +1,36 @@
# --- STAGE 1: Build Frontend ---
FROM node:20-slim AS frontend-builder
WORKDIR /build
COPY frontend/package*.json ./
RUN npm install
COPY frontend/ ./
RUN npm run build
# --- STAGE 2: Backend & Runtime ---
FROM python:3.11-slim
WORKDIR /app
# System Dependencies
RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Copy Requirements & Install
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy Built Frontend from Stage 1 (To a safe location outside /app)
COPY --from=frontend-builder /build/dist /frontend_static
# Copy Backend Source
COPY backend ./backend
# Environment Variables
ENV PYTHONPATH=/app
ENV PYTHONUNBUFFERED=1
# Expose Port
EXPOSE 8000
# Start FastAPI
CMD ["uvicorn", "backend.app:app", "--host", "0.0.0.0", "--port", "8000"]

View File

@@ -0,0 +1,314 @@
from fastapi import FastAPI, Depends, HTTPException, Query, BackgroundTasks
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse
from sqlalchemy.orm import Session, joinedload
from typing import List, Optional, Dict, Any
from pydantic import BaseModel
from datetime import datetime
import os
import sys
from .config import settings
from .lib.logging_setup import setup_logging
# Setup Logging first
setup_logging()
import logging
logger = logging.getLogger(__name__)
from .database import init_db, get_db, Company, Signal, EnrichmentData
from .services.deduplication import Deduplicator
from .services.discovery import DiscoveryService
from .services.scraping import ScraperService
from .services.classification import ClassificationService
# Initialize App
app = FastAPI(
title=settings.APP_NAME,
version=settings.VERSION,
description="Backend for Company Explorer (Robotics Edition)",
root_path="/ce"
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Service Singletons
scraper = ScraperService()
classifier = ClassificationService()
discovery = DiscoveryService()
# --- Pydantic Models ---
class CompanyCreate(BaseModel):
name: str
city: Optional[str] = None
country: str = "DE"
website: Optional[str] = None
class BulkImportRequest(BaseModel):
names: List[str]
class AnalysisRequest(BaseModel):
company_id: int
force_scrape: bool = False
# --- Events ---
@app.on_event("startup")
def on_startup():
logger.info("Startup Event: Initializing Database...")
try:
init_db()
logger.info("Database initialized successfully.")
except Exception as e:
logger.critical(f"Database init failed: {e}", exc_info=True)
# --- Routes ---
@app.get("/api/health")
def health_check():
return {"status": "ok", "version": settings.VERSION, "db": settings.DATABASE_URL}
@app.get("/api/companies")
def list_companies(
skip: int = 0,
limit: int = 50,
search: Optional[str] = None,
db: Session = Depends(get_db)
):
try:
query = db.query(Company)
if search:
query = query.filter(Company.name.ilike(f"%{search}%"))
total = query.count()
# Sort by ID desc (newest first)
items = query.order_by(Company.id.desc()).offset(skip).limit(limit).all()
return {"total": total, "items": items}
except Exception as e:
logger.error(f"List Companies Error: {e}", exc_info=True)
raise HTTPException(status_code=500, detail=str(e))
@app.get("/api/companies/{company_id}")
def get_company(company_id: int, db: Session = Depends(get_db)):
company = db.query(Company).options(joinedload(Company.signals)).filter(Company.id == company_id).first()
if not company:
raise HTTPException(status_code=404, detail="Company not found")
return company
@app.post("/api/companies/bulk")
def bulk_import_names(req: BulkImportRequest, db: Session = Depends(get_db)):
"""
Quick import for testing. Just a list of names.
"""
logger.info(f"Starting bulk import of {len(req.names)} names.")
try:
added = 0
skipped = 0
# Deduplicator init
try:
dedup = Deduplicator(db)
logger.info("Deduplicator initialized.")
except Exception as e:
logger.warning(f"Deduplicator init failed: {e}")
dedup = None
for name in req.names:
clean_name = name.strip()
if not clean_name: continue
# 1. Simple Deduplication (Exact Name)
exists = db.query(Company).filter(Company.name == clean_name).first()
if exists:
skipped += 1
continue
# 2. Smart Deduplication (if available)
if dedup:
matches = dedup.find_duplicates({"name": clean_name})
if matches and matches[0]['score'] > 95:
logger.info(f"Duplicate found for {clean_name}: {matches[0]['name']}")
skipped += 1
continue
# 3. Create
new_comp = Company(
name=clean_name,
status="NEW" # This triggered the error before
)
db.add(new_comp)
added += 1
db.commit()
logger.info(f"Import success. Added: {added}, Skipped: {skipped}")
return {"added": added, "skipped": skipped}
except Exception as e:
logger.error(f"Bulk Import Failed: {e}", exc_info=True)
db.rollback()
raise HTTPException(status_code=500, detail=str(e))
@app.post("/api/enrich/discover")
def discover_company(req: AnalysisRequest, background_tasks: BackgroundTasks, db: Session = Depends(get_db)):
"""
Triggers Stage 1: Discovery (Website Search + Wikipedia Search)
"""
try:
company = db.query(Company).filter(Company.id == req.company_id).first()
if not company:
raise HTTPException(404, "Company not found")
# Run in background
background_tasks.add_task(run_discovery_task, company.id)
return {"status": "queued", "message": f"Discovery started for {company.name}"}
except Exception as e:
logger.error(f"Discovery Error: {e}")
raise HTTPException(status_code=500, detail=str(e))
def run_discovery_task(company_id: int):
# New Session for Background Task
from .database import SessionLocal
db = SessionLocal()
try:
company = db.query(Company).filter(Company.id == company_id).first()
if not company: return
logger.info(f"Running Discovery Task for {company.name}")
# 1. Website Search
if not company.website or company.website == "k.A.":
found_url = discovery.find_company_website(company.name, company.city)
if found_url and found_url != "k.A.":
company.website = found_url
logger.info(f"-> Found URL: {found_url}")
# 2. Wikipedia Search
wiki_url = discovery.find_wikipedia_url(company.name)
company.last_wiki_search_at = datetime.utcnow()
existing_wiki = db.query(EnrichmentData).filter(
EnrichmentData.company_id == company.id,
EnrichmentData.source_type == "wikipedia_url"
).first()
if not existing_wiki:
db.add(EnrichmentData(company_id=company.id, source_type="wikipedia_url", content={"url": wiki_url}))
else:
existing_wiki.content = {"url": wiki_url}
existing_wiki.updated_at = datetime.utcnow()
if company.status == "NEW" and company.website and company.website != "k.A.":
company.status = "DISCOVERED"
db.commit()
logger.info(f"Discovery finished for {company.id}")
except Exception as e:
logger.error(f"Background Task Error: {e}", exc_info=True)
db.rollback()
finally:
db.close()
@app.post("/api/enrich/analyze")
def analyze_company(req: AnalysisRequest, background_tasks: BackgroundTasks, db: Session = Depends(get_db)):
company = db.query(Company).filter(Company.id == req.company_id).first()
if not company:
raise HTTPException(404, "Company not found")
if not company.website or company.website == "k.A.":
return {"error": "No website to analyze. Run Discovery first."}
background_tasks.add_task(run_analysis_task, company.id, company.website)
return {"status": "queued"}
def run_analysis_task(company_id: int, url: str):
from .database import SessionLocal
db = SessionLocal()
try:
company = db.query(Company).filter(Company.id == company_id).first()
if not company: return
logger.info(f"Running Analysis Task for {company.name}")
# 1. Scrape Website
scrape_result = scraper.scrape_url(url)
# Save Scrape Data
existing_scrape_data = db.query(EnrichmentData).filter(
EnrichmentData.company_id == company.id,
EnrichmentData.source_type == "website_scrape"
).first()
if "text" in scrape_result and scrape_result["text"]:
if not existing_scrape_data:
db.add(EnrichmentData(company_id=company.id, source_type="website_scrape", content=scrape_result))
else:
existing_scrape_data.content = scrape_result
existing_scrape_data.updated_at = datetime.utcnow()
elif "error" in scrape_result:
logger.warning(f"Scraping failed for {company.name}: {scrape_result['error']}")
# 2. Classify Robotics Potential
if "text" in scrape_result and scrape_result["text"]:
analysis = classifier.analyze_robotics_potential(
company_name=company.name,
website_text=scrape_result["text"]
)
if "error" in analysis:
logger.error(f"Robotics classification failed for {company.name}: {analysis['error']}")
else:
industry = analysis.get("industry")
if industry:
company.industry_ai = industry
# Delete old signals
db.query(Signal).filter(Signal.company_id == company.id).delete()
# Save new signals
potentials = analysis.get("potentials", {})
for signal_type, data in potentials.items():
new_signal = Signal(
company_id=company.id,
signal_type=f"robotics_{signal_type}_potential",
confidence=data.get("score", 0),
value="High" if data.get("score", 0) > 70 else "Medium" if data.get("score", 0) > 30 else "Low",
proof_text=data.get("reason")
)
db.add(new_signal)
company.status = "ENRICHED"
company.last_classification_at = datetime.utcnow()
logger.info(f"Robotics analysis complete for {company.name}.")
db.commit()
logger.info(f"Analysis finished for {company.id}")
except Exception as e:
logger.error(f"Analyze Task Error: {e}", exc_info=True)
db.rollback()
finally:
db.close()
# --- Serve Frontend ---
# Priority 1: Container Path (outside of /app volume)
static_path = "/frontend_static"
# Priority 2: Local Dev Path (relative to this file)
if not os.path.exists(static_path):
static_path = os.path.join(os.path.dirname(__file__), "../static")
if os.path.exists(static_path):
logger.info(f"Serving frontend from {static_path}")
app.mount("/", StaticFiles(directory=static_path, html=True), name="static")
else:
logger.warning(f"Frontend static files not found at {static_path} or local fallback.")
if __name__ == "__main__":
import uvicorn
uvicorn.run("backend.app:app", host="0.0.0.0", port=8000, reload=True)

View File

@@ -0,0 +1,63 @@
import os
import logging
from typing import Optional
# Versuche Pydantic zu nutzen, Fallback auf os.environ
try:
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
# App Info
APP_NAME: str = "Company Explorer"
VERSION: str = "0.2.2"
DEBUG: bool = True
# Database (Store in App dir for simplicity)
DATABASE_URL: str = "sqlite:////app/companies_v3_final.db"
# API Keys
GEMINI_API_KEY: Optional[str] = None
OPENAI_API_KEY: Optional[str] = None
SERP_API_KEY: Optional[str] = None
# Paths
LOG_DIR: str = "/app/logs_debug"
class Config:
env_file = ".env"
settings = Settings()
except ImportError:
# Fallback wenn pydantic-settings nicht installiert ist
class Settings:
APP_NAME = "Company Explorer"
VERSION = "0.2.1"
DEBUG = True
DATABASE_URL = "sqlite:////app/logs_debug/companies_debug.db"
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
SERP_API_KEY = os.getenv("SERP_API_KEY")
LOG_DIR = "/app/logs_debug"
settings = Settings()
# Ensure Log Dir
os.makedirs(settings.LOG_DIR, exist_ok=True)
# API Key Loading Helper (from file if env missing)
def load_api_key_from_file(filename: str) -> Optional[str]:
try:
if os.path.exists(filename):
with open(filename, 'r') as f:
return f.read().strip()
except Exception as e:
print(f"Could not load key from {filename}: {e}") # Print because logging might not be ready
return None
# Auto-load keys if not in env
if not settings.GEMINI_API_KEY:
settings.GEMINI_API_KEY = load_api_key_from_file("/app/gemini_api_key.txt")
if not settings.SERP_API_KEY:
settings.SERP_API_KEY = load_api_key_from_file("/app/serpapikey.txt")

View File

@@ -0,0 +1,113 @@
from sqlalchemy import create_engine, Column, Integer, String, Text, DateTime, ForeignKey, Float, Boolean, JSON
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship
from datetime import datetime
from .config import settings
# Setup
engine = create_engine(settings.DATABASE_URL, connect_args={"check_same_thread": False})
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()
# ==============================================================================
# MODELS
# ==============================================================================
class Company(Base):
__tablename__ = "companies"
id = Column(Integer, primary_key=True, index=True)
# Core Identity
name = Column(String, index=True)
website = Column(String, index=True) # Normalized Domain preferred
crm_id = Column(String, unique=True, index=True, nullable=True) # Link to D365
# Classification
industry_crm = Column(String, nullable=True) # The "allowed" industry
industry_ai = Column(String, nullable=True) # The AI suggested industry
# Location
city = Column(String, nullable=True)
country = Column(String, default="DE")
# Workflow Status
status = Column(String, default="NEW", index=True)
# Granular Process Tracking (Timestamps)
created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
last_scraped_at = Column(DateTime, nullable=True)
last_wiki_search_at = Column(DateTime, nullable=True)
last_classification_at = Column(DateTime, nullable=True)
last_signal_check_at = Column(DateTime, nullable=True)
# Relationships
signals = relationship("Signal", back_populates="company", cascade="all, delete-orphan")
enrichment_data = relationship("EnrichmentData", back_populates="company", cascade="all, delete-orphan")
class Signal(Base):
"""
Represents a specific sales signal or potential.
Example: type='has_spa', value='true', proof='Wellnessbereich mit 2000qm'
"""
__tablename__ = "signals"
id = Column(Integer, primary_key=True, index=True)
company_id = Column(Integer, ForeignKey("companies.id"))
signal_type = Column(String, index=True) # e.g. "robotics_cleaning_potential"
confidence = Column(Float, default=0.0) # 0.0 to 1.0
value = Column(String) # "High", "Medium", "Yes", "No"
proof_text = Column(Text, nullable=True) # Snippet from website/source
created_at = Column(DateTime, default=datetime.utcnow)
company = relationship("Company", back_populates="signals")
class EnrichmentData(Base):
"""
Stores raw data blobs (HTML, API responses) to allow re-processing.
"""
__tablename__ = "enrichment_data"
id = Column(Integer, primary_key=True, index=True)
company_id = Column(Integer, ForeignKey("companies.id"))
source_type = Column(String) # "website_scrape", "wikipedia_api", "google_serp"
content = Column(JSON) # The raw data
created_at = Column(DateTime, default=datetime.utcnow)
company = relationship("Company", back_populates="enrichment_data")
class ImportLog(Base):
"""
Logs bulk imports (e.g. from Excel lists).
"""
__tablename__ = "import_logs"
id = Column(Integer, primary_key=True)
filename = Column(String)
import_type = Column(String) # "crm_dump" or "event_list"
total_rows = Column(Integer)
imported_rows = Column(Integer)
duplicate_rows = Column(Integer)
created_at = Column(DateTime, default=datetime.utcnow)
# ==============================================================================
# UTILS
# ==============================================================================
def init_db():
Base.metadata.create_all(bind=engine)
def get_db():
db = SessionLocal()
try:
yield db
finally:
db.close()

View File

@@ -0,0 +1,56 @@
from abc import ABC, abstractmethod
from typing import List, Optional, Dict, Any
from pydantic import BaseModel
# --- Generisches Datenmodell ---
# Damit ist unsere App unabhängig davon, wie SuperOffice Felder benennt.
class LeadData(BaseModel):
name: str
website: Optional[str] = None
city: Optional[str] = None
country: str = "DE"
industry: Optional[str] = None
# Enrichment Data
robotics_potential_score: int = 0
robotics_potential_reason: Optional[str] = None
# Meta
source_id: Optional[str] = None # ID im Quellsystem (z.B. SuperOffice ID)
class TaskData(BaseModel):
subject: str
description: str
deadline: Optional[str] = None
# --- Der Vertrag (Repository Interface) ---
class CRMRepository(ABC):
"""
Abstrakte Basisklasse für alle CRM-Integrationen.
Egal ob Notion, SuperOffice oder Odoo - alle müssen diese Methoden haben.
"""
@abstractmethod
def get_name(self) -> str:
"""Gibt den Namen des Systems zurück (z.B. 'SuperOffice')"""
pass
@abstractmethod
def find_company(self, name: str, email: str = None) -> Optional[str]:
"""Sucht eine Firma und gibt die externe ID zurück, falls gefunden."""
pass
@abstractmethod
def create_lead(self, lead: LeadData) -> str:
"""Erstellt einen neuen Lead und gibt die externe ID zurück."""
pass
@abstractmethod
def update_lead(self, external_id: str, lead: LeadData) -> bool:
"""Aktualisiert einen bestehenden Lead mit neuen Enrichment-Daten."""
pass
@abstractmethod
def create_task(self, external_id: str, task: TaskData) -> bool:
"""Erstellt eine Aufgabe/Wiedervorlage für den Vertriebler beim Lead."""
pass

View File

@@ -0,0 +1,144 @@
import time
import logging
import random
import os
import re
from functools import wraps
from typing import Optional, Union, List
# Versuche neue Google GenAI Lib (v1.0+)
try:
from google import genai
from google.genai import types
HAS_NEW_GENAI = True
except ImportError:
HAS_NEW_GENAI = False
# Fallback auf alte Lib
try:
import google.generativeai as old_genai
HAS_OLD_GENAI = True
except ImportError:
HAS_OLD_GENAI = False
from ..config import settings
logger = logging.getLogger(__name__)
# ==============================================================================
# 1. DECORATORS
# ==============================================================================
def retry_on_failure(max_retries: int = 3, delay: float = 2.0):
"""
Decorator for retrying functions with exponential backoff.
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
# Don't retry on certain fatal errors (can be extended)
if isinstance(e, ValueError) and "API Key" in str(e):
raise e
wait_time = delay * (2 ** attempt) + random.uniform(0, 1)
logger.warning(f"Retry {attempt + 1}/{max_retries} for '{func.__name__}' after error: {e}. Waiting {wait_time:.1f}s")
time.sleep(wait_time)
logger.error(f"Function '{func.__name__}' failed after {max_retries} attempts.")
raise last_exception
return wrapper
return decorator
# ==============================================================================
# 2. TEXT TOOLS
# ==============================================================================
def clean_text(text: str) -> str:
"""Removes excess whitespace and control characters."""
if not text:
return ""
text = str(text).strip()
text = re.sub(r'\s+', ' ', text)
return text
def normalize_string(s: str) -> str:
"""Basic normalization (lowercase, stripped)."""
return s.lower().strip() if s else ""
# ==============================================================================
# 3. LLM WRAPPER (GEMINI)
# ==============================================================================
@retry_on_failure(max_retries=3)
def call_gemini(
prompt: Union[str, List[str]],
model_name: str = "gemini-2.0-flash",
temperature: float = 0.3,
json_mode: bool = False,
system_instruction: Optional[str] = None
) -> str:
"""
Unified caller for Gemini API. Prefers new `google.genai` library.
"""
api_key = settings.GEMINI_API_KEY
if not api_key:
raise ValueError("GEMINI_API_KEY is missing in configuration.")
# Option A: New Library (google-genai)
if HAS_NEW_GENAI:
try:
client = genai.Client(api_key=api_key)
config = {
"temperature": temperature,
"top_p": 0.95,
"top_k": 40,
"max_output_tokens": 8192,
}
if json_mode:
config["response_mime_type"] = "application/json"
response = client.models.generate_content(
model=model_name,
contents=[prompt] if isinstance(prompt, str) else prompt,
config=config,
)
if not response.text:
raise ValueError("Empty response from Gemini")
return response.text.strip()
except Exception as e:
logger.error(f"Error with google-genai lib: {e}")
if not HAS_OLD_GENAI:
raise e
# Fallthrough to Option B
# Option B: Old Library (google-generativeai)
if HAS_OLD_GENAI:
try:
old_genai.configure(api_key=api_key)
generation_config = {
"temperature": temperature,
"top_p": 0.95,
"top_k": 40,
"max_output_tokens": 8192,
}
if json_mode:
generation_config["response_mime_type"] = "application/json"
model = old_genai.GenerativeModel(
model_name=model_name,
generation_config=generation_config,
system_instruction=system_instruction
)
response = model.generate_content(prompt)
return response.text.strip()
except Exception as e:
logger.error(f"Error with google-generativeai lib: {e}")
raise e
raise ImportError("No Google GenAI library installed (neither google-genai nor google-generativeai).")

View File

@@ -0,0 +1,39 @@
import logging
import sys
import os
from logging.handlers import RotatingFileHandler
from ..config import settings
def setup_logging():
log_file = os.path.join(settings.LOG_DIR, "company_explorer_debug.log")
# Create Formatter
formatter = logging.Formatter(
"%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
# File Handler
try:
file_handler = RotatingFileHandler(log_file, maxBytes=10*1024*1024, backupCount=5)
file_handler.setFormatter(formatter)
file_handler.setLevel(logging.DEBUG)
except Exception as e:
print(f"FATAL: Could not create log file at {log_file}: {e}")
return
# Console Handler
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setFormatter(formatter)
console_handler.setLevel(logging.INFO) # Keep console clean
# Root Logger Config
root_logger = logging.getLogger()
root_logger.setLevel(logging.DEBUG) # Catch ALL
root_logger.addHandler(file_handler)
root_logger.addHandler(console_handler)
# Silence noisy libs partially
logging.getLogger("uvicorn.access").setLevel(logging.INFO)
logging.getLogger("sqlalchemy.engine").setLevel(logging.INFO) # Set to DEBUG to see SQL queries!
logging.info(f"Logging initialized. Writing to {log_file}")

View File

@@ -0,0 +1,42 @@
import logging
import uuid
from typing import Optional
from ..interfaces import CRMRepository, LeadData, TaskData
logger = logging.getLogger(__name__)
class MockRepository(CRMRepository):
"""
Simulates a CRM. Use this for local dev or tests.
Stores data in memory (lost on restart).
"""
def __init__(self):
self._store = {}
def get_name(self) -> str:
return "Local Mock CRM"
def find_company(self, name: str, email: str = None) -> Optional[str]:
# Simple Exact Match Simulation
for lead_id, lead in self._store.items():
if lead.name.lower() == name.lower():
logger.info(f"[MockCRM] Found existing company '{name}' with ID {lead_id}")
return lead_id
return None
def create_lead(self, lead: LeadData) -> str:
new_id = f"MOCK_{uuid.uuid4().hex[:8]}"
self._store[new_id] = lead
logger.info(f"[MockCRM] Created company '{lead.name}' (ID: {new_id}). Total records: {len(self._store)}")
return new_id
def update_lead(self, external_id: str, lead: LeadData) -> bool:
if external_id in self._store:
self._store[external_id] = lead
logger.info(f"[MockCRM] Updated company {external_id} with robotics score: {lead.robotics_potential_score}")
return True
return False
def create_task(self, external_id: str, task: TaskData) -> bool:
logger.info(f"[MockCRM] 🔔 TASK CREATED for {external_id}: '{task.subject}'")
return True

View File

@@ -0,0 +1,40 @@
import logging
import requests
from typing import Optional
from ..interfaces import CRMRepository, LeadData, TaskData
from ..config import settings
logger = logging.getLogger(__name__)
class SuperOfficeRepository(CRMRepository):
def __init__(self, tenant_id: str, api_token: str):
self.base_url = f"https://{tenant_id}.superoffice.com/api/v1"
self.headers = {
"Authorization": f"Bearer {api_token}",
"Accept": "application/json"
}
def get_name(self) -> str:
return "SuperOffice"
def find_company(self, name: str, email: str = None) -> Optional[str]:
# TODO: Implement actual OData query
# Example: GET /Contact?$filter=Name eq '{name}'
logger.info(f"[SuperOffice] Searching for '{name}'...")
return None
def create_lead(self, lead: LeadData) -> str:
logger.info(f"[SuperOffice] Creating Lead: {lead.name}")
# TODO: POST /Contact
# Payload mapping: lead.industry -> SuperOffice BusinessId
return "SO_DUMMY_ID_123"
def update_lead(self, external_id: str, lead: LeadData) -> bool:
logger.info(f"[SuperOffice] Updating Lead {external_id} with Score {lead.robotics_potential_score}")
# TODO: PUT /Contact/{id}
# Wir schreiben das Robotics-Potential z.B. in ein benutzerdefiniertes Feld (UserDefinedField)
return True
def create_task(self, external_id: str, task: TaskData) -> bool:
logger.info(f"[SuperOffice] Creating Task for {external_id}: {task.subject}")
return True

View File

@@ -0,0 +1,91 @@
import sys
import os
import logging
from sqlalchemy.orm import Session
# Add paths to access legacy and new modules
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../../../"))) # Root for legacy
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../../"))) # Company Explorer Root
# Legacy Import
try:
from _legacy_gsheets_system.google_sheet_handler import GoogleSheetHandler
from _legacy_gsheets_system.config import Config as LegacyConfig
except ImportError as e:
print(f"Failed to import legacy modules: {e}")
sys.exit(1)
# New DB
from backend.database import SessionLocal, Company, init_db
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("LegacyImporter")
def migrate():
logger.info("Starting migration from Google Sheets...")
# 1. Connect to GSheets
LegacyConfig.load_api_keys() # Ensure keys are loaded
try:
handler = GoogleSheetHandler()
df = handler.get_sheet_as_dataframe("CRM_Accounts") # Assuming standard sheet name
except Exception as e:
logger.error(f"GSheet Connection failed: {e}")
return
if df is None or df.empty:
logger.warning("No data found in sheet.")
return
logger.info(f"Found {len(df)} rows. Transforming...")
# 2. Connect to New DB
init_db() # Ensure tables exist
db = SessionLocal()
count = 0
skipped = 0
try:
for _, row in df.iterrows():
name = str(row.get('CRM Name', '')).strip()
if not name or name.lower() in ['nan', 'none', '']:
continue
# Check duplicate (simple check by name for migration)
exists = db.query(Company).filter(Company.name == name).first()
if exists:
skipped += 1
continue
# Create Company
comp = Company(
name=name,
website=str(row.get('CRM Website', '')).strip() or None,
crm_id=str(row.get('CRM ID', '')).strip() or None,
city=str(row.get('CRM Ort', '')).strip() or None,
country=str(row.get('CRM Land', 'DE')).strip(),
status="IMPORTED" # Mark as imported so we know to enrich them
)
# Map old industry if useful, otherwise leave blank for re-classification
# comp.industry_ai = str(row.get('Chat Vorschlag Branche', ''))
db.add(comp)
count += 1
if count % 100 == 0:
logger.info(f"Committed {count}...")
db.commit()
db.commit()
logger.info(f"Migration finished. Imported: {count}, Skipped: {skipped}")
except Exception as e:
logger.error(f"Migration error: {e}")
db.rollback()
finally:
db.close()
if __name__ == "__main__":
migrate()

View File

@@ -0,0 +1,77 @@
import json
import logging
import os
from typing import Dict, Any, List
from ..lib.core_utils import call_gemini
from ..config import settings
logger = logging.getLogger(__name__)
ALLOWED_INDUSTRIES_FILE = os.path.join(os.path.dirname(__file__), "../data/allowed_industries.json")
class ClassificationService:
def __init__(self):
self.allowed_industries = self._load_allowed_industries()
def _load_allowed_industries(self) -> List[str]:
try:
with open(ALLOWED_INDUSTRIES_FILE, 'r', encoding='utf-8') as f:
return json.load(f)
except Exception as e:
logger.error(f"Failed to load allowed industries: {e}")
return ["Sonstige"]
def analyze_robotics_potential(self, company_name: str, website_text: str) -> Dict[str, Any]:
"""
Analyzes the company for robotics potential based on website content.
Returns strict JSON.
"""
if not website_text or len(website_text) < 100:
return {"error": "Insufficient text content"}
prompt = f"""
You are a Senior B2B Market Analyst for 'Roboplanet', a robotics distributor.
Your job is to analyze a target company based on their website text and determine their potential for using robots.
--- TARGET COMPANY ---
Name: {company_name}
Website Content (Excerpt):
{website_text[:15000]}
--- ALLOWED INDUSTRIES (STRICT) ---
You MUST assign the company to exactly ONE of these industries. If unsure, choose the closest match or "Sonstige".
{json.dumps(self.allowed_industries, ensure_ascii=False)}
--- ANALYSIS TASKS ---
1. **Industry Classification:** Pick one from the list.
2. **Robotics Potential Scoring (0-100):**
- **Cleaning:** Does the company manage large floors, hospitals, hotels, or public spaces? (Keywords: Hygiene, Cleaning, SPA, Facility Management)
- **Transport/Logistics:** Do they move goods internally? (Keywords: Warehouse, Intralogistics, Production line, Hospital logistics)
- **Security:** Do they have large perimeters or night patrols? (Keywords: Werkschutz, Security, Monitoring)
- **Service:** Do they interact with guests/patients? (Keywords: Reception, Restaurant, Nursing)
3. **Explanation:** A short, strategic reason for the scoring (German).
--- OUTPUT FORMAT (JSON ONLY) ---
{{
"industry": "String (from list)",
"summary": "Short business summary (German)",
"potentials": {{
"cleaning": {{ "score": 0-100, "reason": "..." }},
"transport": {{ "score": 0-100, "reason": "..." }},
"security": {{ "score": 0-100, "reason": "..." }},
"service": {{ "score": 0-100, "reason": "..." }}
}}
}}
"""
try:
response_text = call_gemini(
prompt=prompt,
json_mode=True,
temperature=0.2 # Low temp for consistency
)
return json.loads(response_text)
except Exception as e:
logger.error(f"Classification failed: {e}")
return {"error": str(e)}

View File

@@ -0,0 +1,209 @@
import logging
import re
from collections import Counter
from typing import List, Tuple, Dict, Any, Optional
from sqlalchemy.orm import Session
from sqlalchemy import select
# External libs (must be in requirements.txt)
from thefuzz import fuzz
from ..database import Company
from ..lib.core_utils import clean_text, normalize_string
logger = logging.getLogger(__name__)
# --- Configuration (Ported from Legacy) ---
SCORE_THRESHOLD = 80
SCORE_THRESHOLD_WEAK = 95
MIN_NAME_FOR_DOMAIN = 70
CITY_MISMATCH_PENALTY = 30
COUNTRY_MISMATCH_PENALTY = 40
STOP_TOKENS_BASE = {
'gmbh','mbh','ag','kg','ug','ohg','se','co','kgaa','inc','llc','ltd','sarl',
'holding','gruppe','group','international','solutions','solution','service','services',
'deutschland','austria','germany','technik','technology','technologies','systems','systeme',
'logistik','logistics','industries','industrie','management','consulting','vertrieb','handel',
'international','company','gesellschaft','mbh&co','mbhco','werke','werk'
}
# ==============================================================================
# Helpers
# ==============================================================================
def _tokenize(s: str) -> List[str]:
if not s: return []
return re.split(r"[^a-z0-9]+", str(s).lower())
def split_tokens(name: str) -> List[str]:
if not name: return []
tokens = [t for t in _tokenize(name) if len(t) >= 3]
return [t for t in tokens if t not in STOP_TOKENS_BASE]
def clean_name_for_scoring(norm_name: str) -> Tuple[str, set]:
toks = split_tokens(norm_name)
return " ".join(toks), set(toks)
# ==============================================================================
# Core Deduplication Logic
# ==============================================================================
class Deduplicator:
def __init__(self, db: Session):
self.db = db
self.reference_data = [] # Cache for DB records
self.domain_index = {}
self.token_freq = Counter()
self.token_index = {}
self._load_reference_data()
def _load_reference_data(self):
"""
Loads minimal dataset from DB into RAM for fast fuzzy matching.
Optimized for 10k-50k records.
"""
logger.info("Loading reference data for deduplication...")
query = self.db.query(Company.id, Company.name, Company.website, Company.city, Company.country)
companies = query.all()
for c in companies:
norm_name = normalize_string(c.name)
norm_domain = normalize_string(c.website) # Simplified, should extract domain
record = {
'id': c.id,
'name': c.name,
'normalized_name': norm_name,
'normalized_domain': norm_domain,
'city': normalize_string(c.city),
'country': normalize_string(c.country)
}
self.reference_data.append(record)
# Build Indexes
if norm_domain:
self.domain_index.setdefault(norm_domain, []).append(record)
# Token Frequency
_, toks = clean_name_for_scoring(norm_name)
for t in toks:
self.token_freq[t] += 1
self.token_index.setdefault(t, []).append(record)
logger.info(f"Loaded {len(self.reference_data)} records for deduplication.")
def _choose_rarest_token(self, norm_name: str) -> Optional[str]:
_, toks = clean_name_for_scoring(norm_name)
if not toks: return None
# Sort by frequency (asc) then length (desc)
lst = sorted(list(toks), key=lambda x: (self.token_freq.get(x, 10**9), -len(x)))
return lst[0] if lst else None
def find_duplicates(self, candidate: Dict[str, Any]) -> List[Dict[str, Any]]:
"""
Checks a single candidate against the loaded index.
Returns list of matches with score >= Threshold.
"""
# Prepare Candidate
c_norm_name = normalize_string(candidate.get('name', ''))
c_norm_domain = normalize_string(candidate.get('website', ''))
c_city = normalize_string(candidate.get('city', ''))
c_country = normalize_string(candidate.get('country', ''))
candidates_to_check = {} # Map ID -> Record
# 1. Domain Match (Fastest)
if c_norm_domain and c_norm_domain in self.domain_index:
for r in self.domain_index[c_norm_domain]:
candidates_to_check[r['id']] = r
# 2. Rarest Token Match (Blocking)
rtok = self._choose_rarest_token(c_norm_name)
if rtok and rtok in self.token_index:
for r in self.token_index[rtok]:
candidates_to_check[r['id']] = r
if not candidates_to_check:
return []
# 3. Scoring
matches = []
for db_rec in candidates_to_check.values():
score, details = self._calculate_similarity(
cand={'n': c_norm_name, 'd': c_norm_domain, 'c': c_city, 'ct': c_country},
ref=db_rec
)
# Threshold Logic (Weak vs Strong)
is_weak = (details['domain_match'] == 0 and not (details['loc_match']))
threshold = SCORE_THRESHOLD_WEAK if is_weak else SCORE_THRESHOLD
if score >= threshold:
matches.append({
'company_id': db_rec['id'],
'name': db_rec['name'],
'score': score,
'details': details
})
matches.sort(key=lambda x: x['score'], reverse=True)
return matches
def _calculate_similarity(self, cand, ref):
# Data Prep
n1, n2 = cand['n'], ref['normalized_name']
# Exact Name Shortcut
if n1 and n1 == n2:
return 100, {'exact': True, 'domain_match': 0, 'loc_match': 0}
# Domain
d1, d2 = cand['d'], ref['normalized_domain']
domain_match = 1 if (d1 and d2 and d1 == d2) else 0
# Location
city_match = 1 if (cand['c'] and ref['city'] and cand['c'] == ref['city']) else 0
country_match = 1 if (cand['ct'] and ref['country'] and cand['ct'] == ref['country']) else 0
loc_match = city_match and country_match
# Name Fuzzy Score
clean1, _ = clean_name_for_scoring(n1)
clean2, _ = clean_name_for_scoring(n2)
if clean1 and clean2:
ts = fuzz.token_set_ratio(clean1, clean2)
pr = fuzz.partial_ratio(clean1, clean2)
ss = fuzz.token_sort_ratio(clean1, clean2)
name_score = max(ts, pr, ss)
else:
name_score = 0
# Penalties
penalties = 0
if cand['ct'] and ref['country'] and not country_match:
penalties += COUNTRY_MISMATCH_PENALTY
if cand['c'] and ref['city'] and not city_match:
penalties += CITY_MISMATCH_PENALTY
# Final Calc
# Base weights: Domain is king (100), Name is mandatory (unless domain match)
total = 0
if domain_match:
total = 100
else:
total = name_score
if loc_match:
total += 10 # Bonus
total -= penalties
# Capping
total = min(100, max(0, total))
return total, {
'name_score': name_score,
'domain_match': domain_match,
'loc_match': loc_match,
'penalties': penalties
}

View File

@@ -0,0 +1,126 @@
import logging
import requests
import re
from typing import Optional, Dict, Tuple
from urllib.parse import urlparse
from ..config import settings
from ..lib.core_utils import retry_on_failure, normalize_string
logger = logging.getLogger(__name__)
# Domains to ignore when looking for official company homepage
BLACKLIST_DOMAINS = {
"linkedin.com", "xing.com", "facebook.com", "instagram.com", "twitter.com",
"northdata.de", "northdata.com", "firmenwissen.de", "creditreform.de",
"dnb.com", "kompass.com", "wer-zu-wem.de", "kununu.com", "glassdoor.com",
"stepstone.de", "indeed.com", "monster.de", "youtube.com", "wikipedia.org"
}
class DiscoveryService:
def __init__(self):
self.api_key = settings.SERP_API_KEY
if not self.api_key:
logger.warning("SERP_API_KEY not set. Discovery features will fail.")
@retry_on_failure(max_retries=2)
def find_company_website(self, company_name: str, city: Optional[str] = None) -> str:
"""
Uses Google Search via SerpAPI to find the most likely official homepage.
Returns "k.A." if nothing credible is found.
"""
if not self.api_key:
return "k.A."
query = f"{company_name} offizielle Website"
if city:
query += f" {city}"
logger.info(f"Searching website for: {query}")
try:
params = {
"engine": "google",
"q": query,
"api_key": self.api_key,
"num": 5,
"gl": "de",
"hl": "de"
}
response = requests.get("https://serpapi.com/search", params=params, timeout=15)
response.raise_for_status()
data = response.json()
if "organic_results" not in data:
return "k.A."
for result in data["organic_results"]:
link = result.get("link", "")
if self._is_credible_url(link):
# Simple heuristic: If the company name is part of the domain, high confidence
# Otherwise, take the first credible result.
return link
return "k.A."
except Exception as e:
logger.error(f"SerpAPI Error: {e}")
return "k.A."
@retry_on_failure(max_retries=2)
def find_wikipedia_url(self, company_name: str) -> str:
"""
Searches for a specific German Wikipedia article.
"""
if not self.api_key:
return "k.A."
query = f"{company_name} Wikipedia"
try:
params = {
"engine": "google",
"q": query,
"api_key": self.api_key,
"num": 3,
"gl": "de",
"hl": "de"
}
response = requests.get("https://serpapi.com/search", params=params, timeout=15)
response.raise_for_status()
data = response.json()
for result in data.get("organic_results", []):
link = result.get("link", "")
if "de.wikipedia.org/wiki/" in link:
# Basic validation: Is the title roughly the company?
title = result.get("title", "").replace(" Wikipedia", "")
if self._check_name_similarity(company_name, title):
return link
return "k.A."
except Exception as e:
logger.error(f"Wiki Search Error: {e}")
return "k.A."
def _is_credible_url(self, url: str) -> bool:
"""Filters out social media, directories, and junk."""
if not url: return False
try:
domain = urlparse(url).netloc.lower().replace("www.", "")
if domain in BLACKLIST_DOMAINS:
return False
# Check for subdomains of blacklist (e.g. de.linkedin.com)
for bad in BLACKLIST_DOMAINS:
if domain.endswith("." + bad):
return False
return True
except:
return False
def _check_name_similarity(self, name1: str, name2: str) -> bool:
"""Simple fuzzy check for validation."""
n1 = normalize_string(name1)
n2 = normalize_string(name2)
# Very permissive: if one is contained in the other
return n1 in n2 or n2 in n1

View File

@@ -0,0 +1,82 @@
import logging
import requests
import random
import re
from bs4 import BeautifulSoup
from typing import Optional, Dict
from ..lib.core_utils import clean_text, retry_on_failure
logger = logging.getLogger(__name__)
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
]
class ScraperService:
def __init__(self, timeout: int = 15):
self.timeout = timeout
@retry_on_failure(max_retries=2)
def scrape_url(self, url: str) -> Dict[str, str]:
"""
Fetches a URL and returns cleaned text content + meta info.
"""
if not url.startswith("http"):
url = "https://" + url
try:
headers = {'User-Agent': random.choice(USER_AGENTS)}
# verify=False is risky but often needed for poorly configured corporate sites
response = requests.get(url, headers=headers, timeout=self.timeout, verify=False)
response.raise_for_status()
# Check Content Type
content_type = response.headers.get('Content-Type', '').lower()
if 'text/html' not in content_type:
logger.warning(f"Skipping non-HTML content for {url}: {content_type}")
return {"error": "Not HTML"}
return self._parse_html(response.content)
except requests.exceptions.SSLError:
# Retry with HTTP if HTTPS fails
if url.startswith("https://"):
logger.info(f"SSL failed for {url}, retrying with http://...")
return self.scrape_url(url.replace("https://", "http://"))
raise
except Exception as e:
logger.error(f"Scraping failed for {url}: {e}")
return {"error": str(e)}
def _parse_html(self, html_content: bytes) -> Dict[str, str]:
soup = BeautifulSoup(html_content, 'html.parser')
# 1. Cleanup Junk
for element in soup(['script', 'style', 'noscript', 'iframe', 'svg', 'header', 'footer', 'nav', 'aside', 'form', 'button']):
element.decompose()
# 2. Extract Title & Meta Description
title = soup.title.string if soup.title else ""
meta_desc = ""
meta_tag = soup.find('meta', attrs={'name': 'description'})
if meta_tag:
meta_desc = meta_tag.get('content', '')
# 3. Extract Main Text
# Prefer body, fallback to full soup
body = soup.find('body')
raw_text = body.get_text(separator=' ', strip=True) if body else soup.get_text(separator=' ', strip=True)
cleaned_text = clean_text(raw_text)
# 4. Extract Emails (Basic Regex)
emails = set(re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', raw_text))
return {
"title": clean_text(title),
"description": clean_text(meta_desc),
"text": cleaned_text[:25000], # Limit to avoid context overflow
"emails": list(emails)[:5] # Limit to 5
}

View File

@@ -0,0 +1,103 @@
import os
import logging
from sqlalchemy.orm import Session
from ..database import Company
from ..interfaces import LeadData, TaskData, CRMRepository
from ..repositories.mock import MockRepository
from ..repositories.superoffice import SuperOfficeRepository
from ..config import settings
logger = logging.getLogger(__name__)
class CRMFactory:
_instance: CRMRepository = None
@classmethod
def get_repository(cls) -> CRMRepository:
if cls._instance:
return cls._instance
crm_type = os.getenv("CRM_TYPE", "MOCK").upper()
if crm_type == "SUPEROFFICE":
# Load credentials securely from settings/env
tenant = os.getenv("SO_TENANT_ID", "")
token = os.getenv("SO_API_TOKEN", "")
logger.info("Initializing SuperOffice Repository...")
cls._instance = SuperOfficeRepository(tenant, token)
else:
logger.info("Initializing Mock Repository (Default)...")
cls._instance = MockRepository()
return cls._instance
class SyncService:
def __init__(self, db: Session):
self.db = db
self.repo = CRMFactory.get_repository()
def sync_company(self, company_id: int) -> dict:
"""
Pushes a local company to the external CRM.
"""
local_company = self.db.query(Company).filter(Company.id == company_id).first()
if not local_company:
return {"error": "Company not found"}
# 1. Map Data
# Extract highest robotics potential score
max_score = 0
reason = ""
for sig in local_company.signals:
if sig.confidence > max_score:
max_score = int(sig.confidence)
reason = f"{sig.signal_type} ({sig.value})"
lead_data = LeadData(
name=local_company.name,
website=local_company.website,
city=local_company.city,
country=local_company.country,
industry=local_company.industry_ai, # We suggest our AI industry
robotics_potential_score=max_score,
robotics_potential_reason=reason
)
# 2. Check if already linked
external_id = local_company.crm_id
# 3. Check if exists in CRM (by name) if not linked yet
if not external_id:
external_id = self.repo.find_company(local_company.name)
action = "none"
if external_id:
# Update
success = self.repo.update_lead(external_id, lead_data)
if success:
action = "updated"
# If we found it by search, link it locally
if not local_company.crm_id:
local_company.crm_id = external_id
self.db.commit()
else:
# Create
new_id = self.repo.create_lead(lead_data)
if new_id:
action = "created"
local_company.crm_id = new_id
self.db.commit()
# Create a task for the sales rep if high potential
if max_score > 70:
self.repo.create_task(new_id, TaskData(
subject="🔥 Hot Robotics Lead",
description=f"AI detected high potential ({max_score}%). Reason: {reason}. Please check website."
))
return {
"status": "success",
"action": action,
"crm": self.repo.get_name(),
"external_id": local_company.crm_id
}

View File

@@ -0,0 +1,12 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Company Explorer (Robotics)</title>
</head>
<body class="bg-slate-950 text-slate-100">
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

View File

@@ -0,0 +1,31 @@
{
"name": "company-explorer-frontend",
"private": true,
"version": "0.1.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "tsc && vite build",
"preview": "vite preview"
},
"dependencies": {
"@tanstack/react-table": "^8.10.7",
"axios": "^1.6.2",
"clsx": "^2.0.0",
"lucide-react": "^0.294.0",
"react": "^18.2.0",
"react-dom": "^18.2.0",
"tailwind-merge": "^2.1.0"
},
"devDependencies": {
"@types/node": "^20.10.4",
"@types/react": "^18.2.43",
"@types/react-dom": "^18.2.17",
"@vitejs/plugin-react": "^4.2.1",
"autoprefixer": "^10.4.16",
"postcss": "^8.4.32",
"tailwindcss": "^3.3.6",
"typescript": "^5.3.3",
"vite": "^5.0.8"
}
}

View File

@@ -0,0 +1,6 @@
export default {
plugins: {
tailwindcss: {},
autoprefixer: {},
},
}

View File

@@ -0,0 +1,116 @@
import { useState, useEffect } from 'react'
import axios from 'axios'
import { CompanyTable } from './components/CompanyTable'
import { ImportWizard } from './components/ImportWizard'
import { Inspector } from './components/Inspector' // NEW
import { LayoutDashboard, UploadCloud, Search, RefreshCw } from 'lucide-react'
// Base URL detection (Production vs Dev)
const API_BASE = import.meta.env.BASE_URL === '/ce/' ? '/ce/api' : '/api';
interface Stats {
total: number;
}
function App() {
const [stats, setStats] = useState<Stats>({ total: 0 })
const [refreshKey, setRefreshKey] = useState(0)
const [isImportOpen, setIsImportOpen] = useState(false)
const [selectedCompanyId, setSelectedCompanyId] = useState<number | null>(null) // NEW
const fetchStats = async () => {
try {
const res = await axios.get(`${API_BASE}/companies?limit=1`)
setStats({ total: res.data.total })
} catch (e) {
console.error("Failed to fetch stats", e)
}
}
useEffect(() => {
fetchStats()
}, [refreshKey])
const handleCompanySelect = (id: number) => {
setSelectedCompanyId(id)
}
const handleCloseInspector = () => {
setSelectedCompanyId(null)
}
return (
<div className="min-h-screen bg-slate-950 text-slate-200 font-sans">
<ImportWizard
isOpen={isImportOpen}
onClose={() => setIsImportOpen(false)}
apiBase={API_BASE}
onSuccess={() => setRefreshKey(k => k + 1)}
/>
{/* Inspector Sidebar */}
<Inspector
companyId={selectedCompanyId}
onClose={handleCloseInspector}
apiBase={API_BASE}
/>
{/* Header */}
<header className="border-b border-slate-800 bg-slate-900/50 sticky top-0 z-10 backdrop-blur-md">
<div className="max-w-7xl mx-auto px-4 sm:px-6 lg:px-8 h-16 flex items-center justify-between">
<div className="flex items-center gap-3">
<div className="p-2 bg-blue-600 rounded-lg">
<LayoutDashboard className="h-6 w-6 text-white" />
</div>
<div>
<h1 className="text-xl font-bold text-white tracking-tight">Company Explorer</h1>
<p className="text-xs text-blue-400 font-medium">ROBOTICS EDITION <span className="text-slate-600 ml-2">v0.2.2 (New DB Path)</span></p>
</div>
</div>
<div className="flex items-center gap-4">
<div className="text-sm text-slate-400">
<span className="text-white font-bold">{stats.total}</span> Companies
</div>
<button
onClick={() => setRefreshKey(k => k + 1)}
className="p-2 hover:bg-slate-800 rounded-full transition-colors text-slate-400 hover:text-white"
title="Refresh Data"
>
<RefreshCw className="h-5 w-5" />
</button>
<button
className="flex items-center gap-2 bg-blue-600 hover:bg-blue-500 text-white px-4 py-2 rounded-md font-medium text-sm transition-all shadow-lg shadow-blue-900/20"
onClick={() => setIsImportOpen(true)}
>
<UploadCloud className="h-4 w-4" />
Import List
</button>
</div>
</div>
</header>
{/* Main Content */}
<main className="max-w-7xl mx-auto px-4 sm:px-6 lg:px-8 py-8">
<div className="mb-6 flex gap-4">
<div className="relative flex-1 max-w-md">
<Search className="absolute left-3 top-2.5 h-5 w-5 text-slate-500" />
<input
type="text"
placeholder="Search companies..."
className="w-full bg-slate-900 border border-slate-700 text-slate-200 rounded-md pl-10 pr-4 py-2 focus:ring-2 focus:ring-blue-500 focus:border-transparent outline-none"
/>
</div>
</div>
<div className="bg-slate-900 border border-slate-800 rounded-xl overflow-hidden shadow-xl">
<CompanyTable key={refreshKey} apiBase={API_BASE} onRowClick={handleCompanySelect} /> {/* NEW PROP */}
</div>
</main>
</div>
)
}
export default App

View File

@@ -0,0 +1,205 @@
import { useState, useEffect, useMemo } from 'react'
import {
useReactTable,
getCoreRowModel,
flexRender,
createColumnHelper,
} from '@tanstack/react-table'
import axios from 'axios'
import { Play, Globe, AlertCircle, Search as SearchIcon, Loader2 } from 'lucide-react'
import clsx from 'clsx'
type Company = {
id: number
name: string
city: string | null
country: string
website: string | null
status: string
industry_ai: string | null
}
const columnHelper = createColumnHelper<Company>()
interface CompanyTableProps {
apiBase: string
onRowClick: (companyId: number) => void // NEW PROP
}
export function CompanyTable({ apiBase, onRowClick }: CompanyTableProps) {
const [data, setData] = useState<Company[]>([])
const [loading, setLoading] = useState(true)
const [processingId, setProcessingId] = useState<number | null>(null)
const fetchData = async () => {
setLoading(true)
try {
const res = await axios.get(`${apiBase}/companies?limit=100`)
setData(res.data.items)
} catch (e) {
console.error(e)
} finally {
setLoading(false)
}
}
useEffect(() => {
fetchData()
}, [])
const triggerDiscovery = async (id: number) => {
setProcessingId(id)
try {
await axios.post(`${apiBase}/enrich/discover`, { company_id: id })
// Optimistic update or wait for refresh? Let's refresh shortly after to see results
setTimeout(fetchData, 2000)
} catch (e) {
alert("Discovery Error")
setProcessingId(null)
}
}
const triggerAnalysis = async (id: number) => {
setProcessingId(id)
try {
await axios.post(`${apiBase}/enrich/analyze`, { company_id: id })
setTimeout(fetchData, 2000)
} catch (e) {
alert("Analysis Error")
setProcessingId(null)
}
}
const columns = useMemo(() => [
columnHelper.accessor('name', {
header: 'Company',
cell: info => <span className="font-semibold text-white">{info.getValue()}</span>,
}),
columnHelper.accessor('city', {
header: 'Location',
cell: info => (
<div className="text-slate-400 text-sm">
{info.getValue() || '-'} <span className="text-slate-600">({info.row.original.country})</span>
</div>
),
}),
columnHelper.accessor('website', {
header: 'Website',
cell: info => {
const url = info.getValue()
if (url && url !== "k.A.") {
return (
<a href={url} target="_blank" rel="noreferrer" className="flex items-center gap-1 text-blue-400 hover:underline text-sm">
<Globe className="h-3 w-3" /> {new URL(url).hostname.replace('www.', '')}
</a>
)
}
return <span className="text-slate-600 text-sm italic">Not found</span>
},
}),
columnHelper.accessor('status', {
header: 'Status',
cell: info => {
const s = info.getValue()
return (
<span className={clsx(
"px-2 py-0.5 rounded-full text-[10px] font-bold uppercase tracking-wider",
s === 'NEW' && "bg-slate-800 text-slate-400 border border-slate-700",
s === 'DISCOVERED' && "bg-blue-500/10 text-blue-400 border border-blue-500/20",
s === 'ENRICHED' && "bg-green-500/10 text-green-400 border border-green-500/20",
)}>
{s}
</span>
)
}
}),
columnHelper.display({
id: 'actions',
header: '',
cell: info => {
const c = info.row.original
const isProcessing = processingId === c.id
if (isProcessing) {
return <Loader2 className="h-4 w-4 animate-spin text-blue-500" />
}
// Action Logic
if (c.status === 'NEW' || !c.website || c.website === "k.A.") {
return (
<button
onClick={(e) => { e.stopPropagation(); triggerDiscovery(c.id); }}
className="flex items-center gap-1 px-2 py-1 bg-slate-800 hover:bg-slate-700 text-xs font-medium text-slate-300 rounded border border-slate-700 transition-colors"
title="Search Website & Wiki"
>
<SearchIcon className="h-3 w-3" /> Find
</button>
)
}
// Ready for Analysis
return (
<button
onClick={(e) => { e.stopPropagation(); triggerAnalysis(c.id); }}
className="flex items-center gap-1 px-2 py-1 bg-blue-600/10 hover:bg-blue-600/20 text-blue-400 text-xs font-medium rounded border border-blue-500/20 transition-colors"
title="Run AI Analysis"
>
<Play className="h-3 w-3 fill-current" /> Analyze
</button>
)
}
})
], [processingId])
const table = useReactTable({
data,
columns,
getCoreRowModel: getCoreRowModel(),
})
if (loading && data.length === 0) return <div className="p-8 text-center text-slate-500">Loading companies...</div>
if (data.length === 0) return (
<div className="p-12 text-center">
<div className="inline-block p-4 bg-slate-800 rounded-full mb-4">
<AlertCircle className="h-8 w-8 text-slate-500" />
</div>
<h3 className="text-lg font-medium text-white">No companies found</h3>
<p className="text-slate-400 mt-2">Import a list to get started.</p>
</div>
)
return (
<div className="overflow-x-auto">
<table className="w-full text-left border-collapse">
<thead>
{table.getHeaderGroups().map(headerGroup => (
<tr key={headerGroup.id} className="border-b border-slate-800 bg-slate-900/50">
{headerGroup.headers.map(header => (
<th key={header.id} className="p-4 text-xs font-medium text-slate-500 uppercase tracking-wider">
{flexRender(header.column.columnDef.header, header.getContext())}
</th>
))}
</tr>
))}
</thead>
<tbody className="divide-y divide-slate-800/50">
{table.getRowModel().rows.map(row => (
// Make row clickable
<tr
key={row.id}
onClick={() => onRowClick(row.original.id)} // NEW: Row Click Handler
className="hover:bg-slate-800/30 transition-colors cursor-pointer"
>
{row.getVisibleCells().map(cell => (
<td key={cell.id} className="p-4 align-middle">
{flexRender(cell.column.columnDef.cell, cell.getContext())}
</td>
))}
</tr>
))}
</tbody>
</table>
</div>
)
}

View File

@@ -0,0 +1,85 @@
import { useState } from 'react'
import axios from 'axios'
import { X, UploadCloud } from 'lucide-react'
interface ImportWizardProps {
isOpen: boolean
onClose: () => void
onSuccess: () => void
apiBase: string
}
export function ImportWizard({ isOpen, onClose, onSuccess, apiBase }: ImportWizardProps) {
const [text, setText] = useState("")
const [loading, setLoading] = useState(false)
if (!isOpen) return null
const handleImport = async () => {
const lines = text.split('\n').map(l => l.trim()).filter(l => l.length > 0)
if (lines.length === 0) return
setLoading(true)
try {
await axios.post(`${apiBase}/companies/bulk`, { names: lines })
setText("")
onSuccess()
onClose()
} catch (e: any) {
console.error(e)
const msg = e.response?.data?.detail || e.message || "Unknown Error"
alert(`Import failed: ${msg}`)
} finally {
setLoading(false)
}
}
return (
<div className="fixed inset-0 bg-black/70 backdrop-blur-sm z-50 flex items-center justify-center p-4">
<div className="bg-slate-900 border border-slate-700 rounded-xl w-full max-w-lg shadow-2xl">
{/* Header */}
<div className="flex items-center justify-between p-4 border-b border-slate-800">
<h3 className="text-lg font-semibold text-white flex items-center gap-2">
<UploadCloud className="h-5 w-5 text-blue-400" />
Quick Import
</h3>
<button onClick={onClose} className="text-slate-400 hover:text-white">
<X className="h-5 w-5" />
</button>
</div>
{/* Body */}
<div className="p-4 space-y-4">
<p className="text-sm text-slate-400">
Paste company names below (one per line). Duplicates in the database will be skipped automatically.
</p>
<textarea
className="w-full h-64 bg-slate-950 border border-slate-700 rounded-lg p-3 text-sm text-slate-200 focus:ring-2 focus:ring-blue-600 outline-none font-mono"
placeholder="Company A&#10;Company B&#10;Company C..."
value={text}
onChange={e => setText(e.target.value)}
/>
</div>
{/* Footer */}
<div className="p-4 border-t border-slate-800 flex justify-end gap-3">
<button
onClick={onClose}
className="px-4 py-2 text-sm font-medium text-slate-400 hover:text-white"
>
Cancel
</button>
<button
onClick={handleImport}
disabled={loading || !text.trim()}
className="px-4 py-2 bg-blue-600 hover:bg-blue-500 text-white rounded-md text-sm font-medium disabled:opacity-50 disabled:cursor-not-allowed"
>
{loading ? "Importing..." : "Import Companies"}
</button>
</div>
</div>
</div>
)
}

View File

@@ -0,0 +1,123 @@
import { useEffect, useState } from 'react'
import axios from 'axios'
import { X, ExternalLink, Robot, Briefcase, Calendar } from 'lucide-react'
import clsx from 'clsx'
interface InspectorProps {
companyId: number | null
onClose: () => void
apiBase: string
}
type Signal = {
signal_type: string
confidence: number
value: string
proof_text: string
}
type CompanyDetail = {
id: number
name: string
website: string | null
industry_ai: string | null
status: string
created_at: string
signals: Signal[]
}
export function Inspector({ companyId, onClose, apiBase }: InspectorProps) {
const [data, setData] = useState<CompanyDetail | null>(null)
const [loading, setLoading] = useState(false)
useEffect(() => {
if (!companyId) return
setLoading(true)
axios.get(`${apiBase}/companies/${companyId}`)
.then(res => setData(res.data))
.catch(console.error)
.finally(() => setLoading(false))
}, [companyId])
if (!companyId) return null
return (
<div className="fixed inset-y-0 right-0 w-[500px] bg-slate-900 border-l border-slate-800 shadow-2xl transform transition-transform duration-300 ease-in-out z-40 overflow-y-auto">
{loading ? (
<div className="p-8 text-slate-500">Loading details...</div>
) : !data ? (
<div className="p-8 text-red-400">Failed to load data.</div>
) : (
<div className="flex flex-col h-full">
{/* Header */}
<div className="p-6 border-b border-slate-800 bg-slate-950/50">
<div className="flex justify-between items-start mb-4">
<h2 className="text-xl font-bold text-white leading-tight">{data.name}</h2>
<button onClick={onClose} className="text-slate-400 hover:text-white">
<X className="h-6 w-6" />
</button>
</div>
<div className="flex flex-wrap gap-2 text-sm">
{data.website && (
<a href={data.website} target="_blank" className="flex items-center gap-1 text-blue-400 hover:underline">
<ExternalLink className="h-3 w-3" /> {new URL(data.website).hostname.replace('www.', '')}
</a>
)}
{data.industry_ai && (
<span className="flex items-center gap-1 px-2 py-0.5 bg-slate-800 text-slate-300 rounded border border-slate-700">
<Briefcase className="h-3 w-3" /> {data.industry_ai}
</span>
)}
</div>
</div>
{/* Robotics Scorecard */}
<div className="p-6 space-y-6">
<div>
<h3 className="text-sm font-semibold text-slate-400 uppercase tracking-wider mb-3 flex items-center gap-2">
<Robot className="h-4 w-4" /> Robotics Potential
</h3>
<div className="grid grid-cols-2 gap-4">
{['cleaning', 'transport', 'security', 'service'].map(type => {
const sig = data.signals.find(s => s.signal_type.includes(type))
const score = sig ? sig.confidence : 0
return (
<div key={type} className="bg-slate-800/50 p-3 rounded-lg border border-slate-700">
<div className="flex justify-between mb-1">
<span className="text-sm text-slate-300 capitalize">{type}</span>
<span className={clsx("text-sm font-bold", score > 70 ? "text-green-400" : score > 30 ? "text-yellow-400" : "text-slate-500")}>
{score}%
</span>
</div>
<div className="w-full bg-slate-700 h-1.5 rounded-full overflow-hidden">
<div
className={clsx("h-full rounded-full", score > 70 ? "bg-green-500" : score > 30 ? "bg-yellow-500" : "bg-slate-600")}
style={{ width: `${score}%` }}
/>
</div>
{sig?.proof_text && (
<p className="text-xs text-slate-500 mt-2 line-clamp-2" title={sig.proof_text}>
"{sig.proof_text}"
</p>
)}
</div>
)
})}
</div>
</div>
{/* Meta Info */}
<div className="pt-6 border-t border-slate-800">
<div className="text-xs text-slate-500 flex items-center gap-2">
<Calendar className="h-3 w-3" /> Added: {new Date(data.created_at).toLocaleDateString()}
</div>
</div>
</div>
</div>
)}
</div>
)
}

View File

@@ -0,0 +1,19 @@
@tailwind base;
@tailwind components;
@tailwind utilities;
/* Custom Scrollbar for dark theme */
::-webkit-scrollbar {
width: 8px;
height: 8px;
}
::-webkit-scrollbar-track {
background: #1e293b;
}
::-webkit-scrollbar-thumb {
background: #475569;
border-radius: 4px;
}
::-webkit-scrollbar-thumb:hover {
background: #64748b;
}

View File

@@ -0,0 +1,10 @@
import React from 'react'
import ReactDOM from 'react-dom/client'
import App from './App.tsx'
import './index.css'
ReactDOM.createRoot(document.getElementById('root')!).render(
<React.StrictMode>
<App />
</React.StrictMode>,
)

View File

@@ -0,0 +1 @@
/// <reference types="vite/client" />

View File

@@ -0,0 +1,11 @@
/** @type {import('tailwindcss').Config} */
export default {
content: [
"./index.html",
"./src/**/*.{js,ts,jsx,tsx}",
],
theme: {
extend: {},
},
plugins: [],
}

View File

@@ -0,0 +1,16 @@
import { defineConfig } from 'vite'
import react from '@vitejs/plugin-react'
// https://vitejs.dev/config/
export default defineConfig({
plugins: [react()],
base: '/ce/', // Critical for Nginx Reverse Proxy
server: {
proxy: {
'/api': {
target: 'http://localhost:8000', // Forward API calls to FastAPI during dev
changeOrigin: true
}
}
}
})

View File

@@ -0,0 +1,15 @@
fastapi
uvicorn
sqlalchemy
pydantic
pydantic-settings
requests
beautifulsoup4
pandas
openpyxl
thefuzz
python-Levenshtein
google-genai
pillow
python-multipart
python-dotenv

39
create_dashboard.py Normal file
View File

@@ -0,0 +1,39 @@
import time
from notion_client import Client
def final_push():
# --- KONFIGURATION DIREKT IN DER FUNKTION ---
token = "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"
database_id = "acf0e7e1-fff2-425b-81a1-00fbc76085b8"
notion = Client(auth=token)
print(f"🚀 Starte Injektion in DB: {database_id}")
sectors = [
{"name": "Hotellerie", "desc": "Relevant für Empfang, Reinigung Zimmer, Parkplatz & Spa. Fokus auf Wellness vs. Business."},
{"name": "Pflege & Kliniken", "desc": "Hohe Hygienestandards, Desinfektion, Transport von Mahlzeiten/Wäsche."},
{"name": "Lager & Produktion", "desc": "Großflächenreinigung, Objektschutz (Security), Intralogistik-Transport."},
{"name": "Einzelhandel", "desc": "Frequenzorientierte Reinigung, interaktive Verkaufsförderung (Ads), Nachtreinigung."}
]
for s in sectors:
try:
notion.pages.create(
parent={"database_id": database_id},
properties={
"Name": {"title": [{"text": {"content": s["name"]}}]},
"Beschreibung": {"rich_text": [{"text": {"content": s["desc"]}}]},
"Art": {"select": {"name": "Sector"}}
}
)
print(f"{s['name']} wurde erfolgreich angelegt.")
time.sleep(0.5)
except Exception as e:
print(f" ❌ Fehler bei {s['name']}: {e}")
print("\n🏁 FERTIG. Schau jetzt in dein Notion Dashboard!")
if __name__ == "__main__":
final_push()

View File

@@ -152,6 +152,17 @@
<!-- WICHTIG: Relativer Link für Reverse Proxy --> <!-- WICHTIG: Relativer Link für Reverse Proxy -->
<a href="/gtm/" class="btn">Starten &rarr;</a> <a href="/gtm/" class="btn">Starten &rarr;</a>
</div> </div>
<!-- Company Explorer (Robotics) -->
<div class="card">
<span class="card-icon">🤖</span>
<h2>Company Explorer</h2>
<p>
Das zentrale CRM-Data-Mining Tool. Importieren, Deduplizieren und Anreichern von Firmenlisten mit Fokus auf Robotik-Potential.
</p>
<!-- Jetzt direkt zum Frontend -->
<a href="/ce/" class="btn">Starten &rarr;</a>
</div>
</div> </div>

View File

@@ -16,6 +16,7 @@ services:
- dashboard - dashboard
- b2b-app - b2b-app
- market-frontend - market-frontend
- company-explorer # NEW
# --- DASHBOARD (Landing Page) --- # --- DASHBOARD (Landing Page) ---
dashboard: dashboard:
@@ -25,6 +26,25 @@ services:
container_name: gemini-dashboard container_name: gemini-dashboard
restart: unless-stopped restart: unless-stopped
# --- COMPANY EXPLORER (Robotics Edition) ---
company-explorer:
build:
context: ./company-explorer
dockerfile: Dockerfile
container_name: company-explorer
restart: unless-stopped
volumes:
# Sideloading: Source Code (Hot Reload)
- ./company-explorer:/app
# Keys
- ./gemini_api_key.txt:/app/gemini_api_key.txt
- ./serpapikey.txt:/app/serpapikey.txt
# Logs (Debug)
- ./Log_from_docker:/app/logs_debug
environment:
- PYTHONUNBUFFERED=1
# Port 8000 is internal only
# --- B2B MARKETING ASSISTANT --- # --- B2B MARKETING ASSISTANT ---
b2b-app: b2b-app:
build: build:
@@ -124,11 +144,13 @@ services:
dns-monitor: dns-monitor:
image: alpine image: alpine
container_name: dns-monitor container_name: dns-monitor
dns:
- 8.8.8.8
- 1.1.1.1
environment: environment:
- SUBDOMAINS=floke,floke-ai,floke-gitea,floke-ha,floke-n8n - SUBDOMAINS=floke,floke-ai,floke-gitea,floke-ha,floke-n8n
- TZ=Europe/Berlin - TZ=Europe/Berlin
volumes: volumes:
- ./dns-monitor:/app - ./dns-monitor:/app
command: /app/monitor.sh command: /app/monitor.sh
restart: unless-stopped restart: unless-stopped

View File

@@ -72,5 +72,20 @@ http {
proxy_connect_timeout 1200s; proxy_connect_timeout 1200s;
proxy_send_timeout 1200s; proxy_send_timeout 1200s;
} }
location /ce/ {
# Company Explorer (Robotics Edition)
# Der Trailing Slash am Ende ist wichtig!
proxy_pass http://company-explorer:8000/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# Explicit timeouts
proxy_read_timeout 1200s;
proxy_connect_timeout 1200s;
proxy_send_timeout 1200s;
}
} }
} }

View File

@@ -861,3 +861,31 @@ Um eine stabile Erreichbarkeit der Dienste zu gewährleisten, wurde eine Docker-
* **[DuckDNS & Monitoring Setup](duckdns_setup.md):** Dokumentation zur Einrichtung des `duckdns` Updaters und des `dns-monitor` Sidecars, um Verbindungsprobleme und Caching-Fehler zu beheben. * **[DuckDNS & Monitoring Setup](duckdns_setup.md):** Dokumentation zur Einrichtung des `duckdns` Updaters und des `dns-monitor` Sidecars, um Verbindungsprobleme und Caching-Fehler zu beheben.
---
## 13. Funktionsweise im Detail
### Analyse-Tiefe der "Digital Signals" im Market Intelligence Tool
Die Identifizierung von digitalen Signalen bei Zielunternehmen erfolgt über einen pragmatischen, zweistufigen Prozess, um ein Gleichgewicht zwischen analytischer Tiefe und Performance zu gewährleisten:
1. **Vollständiges Parsen der Unternehmens-Homepage:**
* Die Haupt-URL des zu analysierenden Unternehmens wird einmalig vollständig gecrawlt. Der extrahierte Text (`homepage_text`) dient dem Sprachmodell als grundlegender Kontext, um das Geschäftsmodell und die Kernaussagen des Unternehmens zu verstehen.
2. **Analyse von Suchergebnis-Snippets für Signale:**
* Für die gezielte Suche nach spezifischen Signalen (z.B. eingesetzte Konkurrenzprodukte, offene Stellen, strategische Initiativen) wird die SerpAPI (Google-Suche) genutzt.
* **Wichtig:** Die in den Suchergebnissen gefundenen Ziel-URLs werden **nicht** erneut besucht und geparst. Stattdessen werden ausschließlich der **Titel (`title`)** und der von der Suchmaschine generierte **Textschnipsel (`snippet`)** als "Beweismittel" (`evidence`) an das Sprachmodell übergeben.
* Dieser Ansatz ist ein bewusster Kompromiss: Er ist extrem schnell und kosteneffizient, da er aufwändiges Crawling vermeidet. Die Snippets sind in der Regel aussagekräftig genug, um das Vorhandensein eines Signals mit hoher Wahrscheinlichkeit zu validieren.
### Zukünftige Erweiterung: Detaillierte Analyse von Stellenausschreibungen
Als eine zukünftige, sehr wertvolle Erweiterung ist die detaillierte, automatisierte Analyse von Stellenausschreibungen vorgemerkt.
* **Strategischer Mehrwert:**
* **Einblick in die Wirtschaftslage:** Die Art und Anzahl der offenen Stellen (z.B. Vertrieb vs. Entwicklung vs. Verwaltung) kann Aufschluss über die aktuelle Wachstums-, Konsolidierungs- oder Krisenphase eines Unternehmens geben.
* **IT-Landkarte & Tech-Stack:** Insbesondere IT-Stellenanzeigen sind eine Goldgrube für Technographic-Daten. Sie listen oft explizit die eingesetzten Programmiersprachen, Frameworks, Datenbanken, ERP-Systeme (z.B. SAP, D365) und Cloud-Anbieter auf. Dies erlaubt eine einzigartig detaillierte Erstellung der "IT-Landkarte" eines Zielunternehmens.
* **Herausforderung:**
* Der technische Aufwand für ein robustes System, das Karriereseiten findet, die verschiedenen Job-Portale parst und die relevanten Informationen extrahiert, ist immens.
* **Status:**
* Diese Erweiterung wird für eine spätere Entwicklungsphase vorgemerkt und sollte aufgrund der Komplexität in einem klar abgegrenzten, überschaubaren Rahmen umgesetzt werden.