22 KiB
Gemini Code Assistant Context
CRITICAL RULE: DOCUMENTATION PRESERVATION (DO NOT IGNORE)
ES IST STRENGSTENS UNTERSAGT, DOKUMENTATION ZU LÖSCHEN ODER DURCH PLATZHALTER WIE ... (rest of the file) ZU ERSETZEN.
Dies ist in der Vergangenheit mehrfach passiert und hat zu massivem Datenverlust in kritischen Dateien wie MIGRATION_PLAN.md geführt.
Regeln für den Agenten:
- Niemals große Textblöcke löschen, es sei denn, der User fordert dies explizit an.
- Immer
git diffprüfen, bevor ein Commit erstellt wird. Wenn eine Dokumentationsdatei 100 Zeilen verliert, ist das fast immer ein Fehler. - Beim Aktualisieren von Dokumentation: Nur neue Informationen hinzufügen oder veraltete präzise korrigieren. Niemals den Rest der Datei überschreiben.
- Wenn du eine Datei "restoren" musst, nutze
git log -p <filename>und stelle sicher, dass du wirklich alles wiederherstellst.
Wichtige Hinweise
- Projektdokumentation: Die primäre und umfassendste Dokumentation für dieses Projekt befindet sich in der Datei
readme.md. Bitte ziehen Sie diese Datei für ein detailliertes Verständnis der Architektur und der einzelnen Module zu Rate. - Git-Repository: Dieses Projekt wird über ein Git-Repository verwaltet. Alle Änderungen am Code werden versioniert. Beachten Sie den Abschnitt "Git Workflow & Conventions" für unsere Arbeitsregeln.
- WICHTIG: Der AI-Agent kann Änderungen committen, aber aus Sicherheitsgründen oft nicht
git pushausführen. Bitte führen Siegit pushmanuell aus, wenn der Agent dies meldet.
- WICHTIG: Der AI-Agent kann Änderungen committen, aber aus Sicherheitsgründen oft nicht
Git Workflow & Conventions
Den Arbeitstag abschließen mit #fertig
Um einen Arbeitsschritt oder einen Task abzuschließen, verwenden Sie den Befehl #fertig.
WICHTIG: Verwenden Sie nicht /fertig oder nur fertig. Nur der Befehl mit der Raute (#) wird korrekt erkannt.
Wenn Sie #fertig eingeben, führt der Agent folgende Schritte aus:
- Analyse: Der Agent prüft, ob seit dem letzten Commit Änderungen am Code vorgenommen wurden.
- Zusammenfassung: Er generiert eine automatische Arbeitszusammenfassung basierend auf den Code-Änderungen.
- Status-Update: Der Agent führt das Skript
python3 dev_session.py --report-statusim Hintergrund aus.- Die in der aktuellen Session investierte Zeit wird berechnet und in Notion gespeichert.
- Ein neuer Statusbericht mit der Zusammenfassung wird an den Notion-Task angehängt.
- Der Status des Tasks in Notion wird auf "Done" (oder einen anderen passenden Status) gesetzt.
- Commit & Push: Wenn Code-Änderungen vorhanden sind, wird ein Commit erstellt und ein
git pushinteraktiv angefragt.
Project Overview
This project is a Python-based system for automated company data enrichment and lead generation. It focuses on identifying B2B companies with high potential for robotics automation (Cleaning, Transport, Security, Service).
The system architecture has evolved from a CLI-based toolset to a modern web application (company-explorer) backed by Docker containers.
Current Status (Jan 15, 2026) - Company Explorer (Robotics Edition v0.5.0)
1. Contacts Management (v0.5)
- Full CRUD: Integrated Contact Management system with direct editing capabilities.
- Global List View: Dedicated view for all contacts across all companies with search and filter.
- Data Model: Supports advanced fields like Academic Title, Role Interpretation (Decision Maker vs. User), and Marketing Automation Status.
- Bulk Import: CSV-based bulk import for contacts that automatically creates missing companies and prevents duplicates via email matching.
2. UI/UX Modernization
- Light/Dark Mode: Full theme support with toggle.
- Grid Layout: Unified card-based layout for both Company and Contact lists.
- Mobile Responsiveness: Optimized Inspector overlay and navigation for mobile devices.
- Tabbed Inspector: Clean separation between Company Overview and Contact Management within the details pane.
3. Advanced Configuration (Settings)
- Industry Verticals: Database-backed configuration for target industries (Description, Focus Flag, Primary Product).
- Job Role Mapping: Configurable patterns (Regex/Text) to map job titles on business cards to internal roles (e.g., "CTO" -> "Innovation Driver").
- Robotics Categories: Existing AI reasoning logic remains configurable via the UI.
4. Robotics Potential Analysis (v2.3)
- Chain-of-Thought Logic: The AI analysis (
ClassificationService) uses multi-step reasoning to evaluate physical infrastructure. - Provider vs. User: Strict differentiation logic implemented.
5. Web Scraping & Legal Data (v2.2)
- Impressum Scraping: 2-Hop Strategy and Root Fallback logic.
- Manual Overrides: Users can manually correct Wikipedia, Website, and Impressum URLs directly in the UI.
Lessons Learned & Best Practices
-
Numeric Extraction (German Locale):
- Problem: "1.005 Mitarbeiter" was extracted as "1" (treating dot as decimal).
- Solution: Implemented context-aware logic. If a number has a dot followed by exactly 3 digits (and no comma), it is treated as a thousands separator.
- Revenue: For revenue (
is_revenue=True), dots are generally treated as decimals (e.g. "375.6 Mio") unless unambiguous multiple dots exist. Billion/Mrd is converted to 1000 Million.
-
The Wolfra/Greilmeier/Erding Fixes (Advanced Metric Parsing):
- Problem: Simple regex parsers fail on complex sentences with multiple numbers, concatenated years, or misleading prefixes.
- Solution (Hybrid Extraction & Regression Testing):
- LLM Guidance: The LLM provides an
expected_value(e.g., "8.000 m²"). - Robust Python Parser (
MetricParser): This parser aggressively cleans theexpected_value(stripping units like "m²") to get a numerical target. It then intelligently searches the full text for this target, ignoring other numbers (like "2" in "An 2 Standorten"). - Specific Bug Fixes:
- Year-Suffix: Logic to detect and remove trailing years from concatenated numbers (e.g., "802020" -> "80").
- Year-Prefix: Logic to ignore year-like numbers (1900-2100) if other, more likely candidates exist in the text.
- Sentence Truncation: Removed overly aggressive logic that cut off sentences after a hyphen, which caused metrics at the end of a phrase to be missed.
- LLM Guidance: The LLM provides an
- Safeguard: These specific cases are now locked in via
test_metric_parser.pyto prevent future regressions.
-
LLM JSON Stability:
- Problem: LLMs often wrap JSON in Markdown blocks (
```json), causingjson.loads()to fail. - Solution: ALWAYS use a
clean_json_responsehelper that strips markers before parsing. Never trust raw LLM output.
- Problem: LLMs often wrap JSON in Markdown blocks (
-
LLM Structure Inconsistency:
- Problem: Even with
json_mode=True, models sometimes wrap the result in a list[...]instead of a flat object{...}, breaking frontend property access. - Solution: Implement a check:
if isinstance(result, list): result = result[0].
- Problem: Even with
-
Scraping Navigation:
- Problem: Searching for "Impressum" only on the scraped URL (which might be a subpage found via Google) often fails.
- Solution: Always implement a fallback to the Root Domain AND a 2-Hop check via the "Kontakt" page.
-
Frontend State Management:
- Problem: Users didn't see when a background job finished.
- Solution: Implementing a polling mechanism (
setInterval) tied to aisProcessingstate is superior to static timeouts for long-running AI tasks.
-
Hyper-Personalized Marketing Engine (v3.2) - "Deep Persona Injection":
- Problem: Marketing texts were too generic and didn't reflect the specific psychological or operative profile of the different target roles (e.g., CFO vs. Facility Manager).
- Solution (Deep Sync & Prompt Hardening):
- Extended Schema: Added
description,convincing_arguments, andkpisto thePersonadatabase model to store richer profile data. - Notion Master Sync: Updated the synchronization logic to pull these deep insights directly from the Notion "Personas / Roles" database.
- Role-Centric Prompts: The
MarketingMatrixgenerator was re-engineered to inject the persona's "Mindset" and "KPIs" into the prompt.
- Extended Schema: Added
- Example (Healthcare):
- Infrastructure Lead: Focuses now on "IT Security", "DSGVO Compliance", and "WLAN integration".
- Economic Buyer (CFO): Focuses on "ROI Amortization", "Reduction of Overtime", and "Flexible Financing (RaaS)".
- Verification: Verified that the transition from a company-specific Opener (e.g., observing staff shortages at Klinikum Erding) to the Role-specific Intro (e.g., pitching transport robots to reduce walking distances for nursing directors) is seamless and logical.
Metric Parser - Regression Tests
To ensure the stability and accuracy of the metric extraction logic, a dedicated test suite (/company-explorer/backend/tests/test_metric_parser.py) has been created. It covers the following critical, real-world bug fixes:
-
test_wolfra_concatenated_year_bug:- Problem: A number and year were concatenated (e.g., "802020").
- Test: Ensures the parser correctly identifies and strips the trailing year, extracting
80.
-
test_erding_year_prefix_bug:- Problem: A year appeared before the actual metric in the sentence (e.g., "2022 ... 200.000 Besucher").
- Test: Verifies that the parser's "Smart Year Skip" logic ignores the year and correctly extracts
200000.
-
test_greilmeier_multiple_numbers_bug:- Problem: The text contained multiple numbers ("An 2 Standorten ... 8.000 m²"), and the parser incorrectly picked the first one.
- Test: Confirms that when an
expected_value(like "8.000 m²") is provided, the parser correctly cleans it and extracts the corresponding number (8000), ignoring other irrelevant numbers.
These tests are crucial for preventing regressions as the parser logic evolves.
Notion Maintenance & Data Sync
Since the "Golden Record" for Industry Verticals (Pains, Gains, Products) resides in Notion, specific tools are available to read and sync this data.
Location: /app/company-explorer/backend/scripts/notion_maintenance/
Prerequisites:
- Ensure
.envis loaded withNOTION_API_KEYand correct DB IDs.
Key Scripts:
-
check_relations.py(Reader - Deep):- Purpose: Reads Verticals and resolves linked Product Categories (Relation IDs -> Names). Essential for verifying the "Primary/Secondary Product" logic.
- Usage:
python3 check_relations.py
-
update_notion_full.py(Writer - Batch):- Purpose: Batch updates Pains and Gains for multiple verticals. Use this as a template when refining the messaging strategy.
- Usage: Edit the dictionary in the script, then run
python3 update_notion_full.py.
-
list_notion_structure.py(Schema Discovery):- Purpose: Lists all property keys and page titles. Use this to debug schema changes (e.g. if a column was renamed).
- Usage:
python3 list_notion_structure.py
- Usage:
Next Steps (Updated Feb 27, 2026)
- Notion Content: Finalize "Pains" and "Gains" for all 25 verticals in the Notion master database.
- Intelligence: Run
generate_matrix.pyin the Company Explorer backend to populate the matrix for all new English vertical names. - Automation: Register the production webhook (requires
admin-webhooksrights) to enable real-time CRM sync without manual job injection. - Execution: Connect the "Sending Engine" (the actual email dispatch logic) to the SuperOffice fields.
- Monitoring: Monitor the 'Atomic PATCH' logs in production for any 400 errors regarding field length or specific character sets.
Company Explorer Access & Debugging
The Company Explorer is the central intelligence engine.
Core Paths:
- Database:
/app/companies_v3_fixed_2.db(SQLite) - Backend Code:
/app/company-explorer/backend/ - Logs:
/app/logs_debug/company_explorer_debug.log
Accessing Data: To inspect live data without starting the full stack, use
sqlite3directly or the helper scripts (if environment permits).- Direct SQL:
sqlite3 /app/companies_v3_fixed_2.db "SELECT * FROM companies WHERE name LIKE '%Firma%';" - Python (requires env): The app runs in a Docker container. When debugging from outside (CLI agent), Python dependencies like
sqlalchemymight be missing in the global scope. Prefersqlite3for quick checks.
Key Endpoints (Internal API :8000):
POST /api/provision/superoffice-contact: Triggers the text generation logic.GET /api/companies/{id}: Full company profile including enrichment data.
Troubleshooting:
- "BaseModel" Error: Usually a mix-up between Pydantic and SQLAlchemy
Base. Check imports indatabase.py. - Missing Dependencies: The CLI agent runs in
/appbut not necessarily inside the container's venv. Use standard tools (grep,sqlite3) where possible.
- Purpose: Lists all property keys and page titles. Use this to debug schema changes (e.g. if a column was renamed).
Critical Debugging Session (Feb 21, 2026) - Re-Stabilizing the Analysis Engine
A critical session was required to fix a series of cascading failures in the ClassificationService. The key takeaways are documented here to prevent future issues.
-
The "Phantom"
NameError:- Symptom: The application crashed with a
NameError: name 'joinedload' is not defined, even though the import was correctly added toclassification.py. - Root Cause: The
uvicornserver's hot-reload mechanism within the Docker container did not reliably pick up file changes made from outside the container. A simpledocker-compose restartwas insufficient to clear the process's cached state. - Solution: After any significant code change, especially to imports or core logic, a forced-recreation of the container is mandatory.
# Correct Way to Apply Changes: docker-compose up -d --build --force-recreate company-explorer
- Symptom: The application crashed with a
-
The "Invisible" Logs:
- Symptom: No debug logs were being written, making it impossible to trace the execution flow.
- Root Cause: The
LOG_DIRpath in/company-explorer/backend/config.pywas misconfigured (/app/logs_debug) and did not point to the actual, historical log directory (/app/Log_from_docker). - Solution: Configuration paths must be treated as absolute and verified. Correcting the
LOG_DIRpath immediately resolved the issue.
-
Inefficient Debugging Loop:
- Symptom: The cycle of triggering a background job via API, waiting, and then manually checking logs was slow and inefficient.
- Root Cause: Lack of a tool to test the core application logic in isolation.
- Solution: The creation of a dedicated, interactive test script (
/company-explorer/backend/scripts/debug_single_company.py). This script allows running the entire analysis for a single company in the foreground, providing immediate and detailed feedback. This pattern is invaluable for complex, multi-step processes and should be a standard for future development.
Production Migration & Multi-Campaign Support (Feb 27, 2026)
The system has been fully migrated to the SuperOffice production environment (online3.superoffice.com, tenant Cust26720).
1. Final UDF Mappings (Production)
These ProgIDs are verified and active for the production tenant:
| Field Purpose | Entity | ProgID | Notes |
|---|---|---|---|
| MA Subject | Person | SuperOffice:19 |
|
| MA Intro | Person | SuperOffice:20 |
|
| MA Social Proof | Person | SuperOffice:21 |
|
| MA Unsubscribe | Person | SuperOffice:22 |
URL format |
| MA Campaign | Person | SuperOffice:23 |
List field (uses :DisplayText) |
| Vertical | Contact | SuperOffice:83 |
List field (mapped via JSON) |
| AI Summary | Contact | SuperOffice:84 |
Truncated to 132 chars |
| AI Last Update | Contact | SuperOffice:85 |
Format: [D:MM/DD/YYYY HH:MM:SS] |
| Opener Primary | Contact | SuperOffice:86 |
|
| Opener Secondary | Contact | SuperOffice:87 |
|
| Last Outreach | Contact | SuperOffice:88 |
2. Vertical ID Mapping (Production)
The full list of 25 verticals with their internal SuperOffice IDs (List udlist331):
Automotive - Dealer: 1613, Corporate - Campus: 1614, Energy - Grid & Utilities: 1615, Energy - Solar/Wind: 1616, Healthcare - Care Home: 1617, Healthcare - Hospital: 1618, Hospitality - Gastronomy: 1619, Hospitality - Hotel: 1620, Industry - Manufacturing: 1621, Infrastructure - Communities: 1622, Infrastructure - Public: 1623, Infrastructure - Transport: 1624, Infrastructure - Parking: 1625, Leisure - Entertainment: 1626, Leisure - Fitness: 1627, Leisure - Indoor Active: 1628, Leisure - Outdoor Park: 1629, Leisure - Wet & Spa: 1630, Logistics - Warehouse: 1631, Others: 1632, Reinigungsdienstleister: 1633, Retail - Food: 1634, Retail - Non-Food: 1635, Retail - Shopping Center: 1636, Tech - Data Center: 1637.
3. Technical Lessons Learned (SO REST API)
- Atomic PATCH (Stability): Bundling all contact updates into a single
PATCHrequest to the/Contact/{id}endpoint is far more stable than sequential UDF updates. If one field fails (e.g. invalid property), the whole transaction might roll back or partially fail—proactive validation is key. - Website Sync (
UrlsArray): Updating the website via REST requires manipulating theUrlsarray property. Simple field assignment toUrlAddressfails duringPATCH.- Correct Format:
"Urls": [{"Value": "https://example.com", "Description": "AI Discovered"}].
- Correct Format:
- List Resolution (
:DisplayText): To get the clean string value of a list field (like Campaign Name) without extra API calls, use the pseudo-fieldProgID:DisplayTextin the$selectparameter. - Field Length Limits: Standard SuperOffice text UDFs are limited to approx. 140-254 characters. AI-generated summaries must be truncated (e.g. 132 chars) to avoid 400 Bad Request errors.
- Docker
env_fileImportance: For production, mapping individual variables indocker-compose.ymlis error-prone. Usingenv_file: .envensures all services stay synchronized with the latest UDF IDs and mappings. - Production URL Schema: The production API is strictly hosted on
online3.superoffice.com(for this tenant), while OAuth remains atonline.superoffice.com.
4. Campaign Trigger Logic
The worker.py (v1.8) now extracts the campaign_tag from SuperOffice:23:DisplayText. This tag is passed to the Company Explorer's provisioning API. If a matching entry exists in the MarketingMatrix for that tag, specific texts are used; otherwise, it falls back to the "standard" Kaltakquise texts.
5. SuperOffice Authentication (Critical Update Feb 28, 2026)
Problem: Authentication failures ("Invalid refresh token" or "Invalid client_id") occurred because standard load_dotenv() did not override stale environment variables present in the shell process.
Solution: Always use load_dotenv(override=True) in Python scripts to force loading the actual values from the .env file.
Correct Authentication Pattern (Python):
from dotenv import load_dotenv
import os
# CRITICAL: override=True ensures we read from .env even if env vars are already set
load_dotenv(override=True)
client_id = os.getenv("SO_CLIENT_ID")
# ...
Known Working Config (Production):
- Environment:
online3 - Tenant:
Cust26720 - Token Logic: The
AuthHandlerimplementation inhealth_check_so.pyis the reference standard. Avoid using legacysuperoffice_client.pywithout verifying it usesoverride=True.
6. Sales & Opportunities (Roboplanet Specifics)
When creating sales via API, specific constraints apply due to the shared tenant with Wackler:
- SaleTypeId: MUST be 14 (
GE:"Roboplanet Verkauf";) to ensure the sale is assigned to the correct business unit.- Alternative: ID 16 (
GE:"Roboplanet Teststellung";) for trials.
- Alternative: ID 16 (
- Mandatory Fields:
Saledate(Estimated Date): Must be provided in ISO format (e.g.,YYYY-MM-DDTHH:MM:SSZ).Person: Highly recommended linking to a specific person, not just the company.
- Context: Avoid creating sales on the parent company "Wackler Service Group" (ID 3). Always target the specific lead company.
7. Service & Tickets (Anfragen)
SuperOffice Tickets represent the support and request system. Like Sales, they are organized to allow separation between Roboplanet and Wackler.
- Entity Name:
ticket - Roboplanet Specific Categories (CategoryId):
- ID 46:
GE:"Lead Roboplanet"; - ID 47:
GE:"Vertriebspartner Roboplanet"; - ID 48:
GE:"Weitergabe Roboplanet"; - Hierarchical:
Roboplanet/Support(often used for technical issues).
- ID 46:
- Key Fields:
ticketId: Internal ID.title: The subject of the request.contactId/personId: Links to company and contact person.ticketStatusId: 1 (Unbearbeitet), 2 (In Arbeit), 3 (Bearbeitet).ownedBy: Often "ROBO" for Roboplanet staff.
- Cross-Links: Tickets can be linked to
saleId(to track support during a sale) orprojectId.
This is the core logic used to generate the company-specific opener.