Files
orion/app/modules/prospecting/services
Samir Boulahtit 754bfca87d
Some checks failed
CI / validate (push) Has been cancelled
CI / dependency-scanning (push) Has been cancelled
CI / docs (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / ruff (push) Successful in 13s
CI / pytest (push) Has been cancelled
fix(prospecting): fix contact scraper and add address extraction
- Fix contact_type column: Enum(ContactType) → String(20) to match the
  migration (fixes "type contacttype does not exist" on insert)
- Rewrite scrape_contacts with structured-first approach:
  Phase 1: tel:/mailto: href extraction (high confidence)
  Phase 2: regex fallback with SVG/script stripping, international phone
           pattern (requires + prefix, min 10 digits)
  Phase 3: address extraction from Schema.org JSON-LD, <address> tags,
           and European street address regex (FR/DE/EN street keywords)
- URL-decode email values, strip tags to plain text for cross-element
  address matching
- Add /mentions-legales to scanned paths

Tested on batirenovation-strasbourg.fr: finds 3 contacts (email, phone,
address) vs 120+ false positives and a crash before.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:18:43 +02:00
..