Files
orion/docs/proposals/prospecting-contact-scraper-fix.md
Samir Boulahtit f310363f7c fix(prospecting): fix scan-jobs batch endpoints and add job tracking
- Reorder routes: batch endpoints before /{prospect_id} to fix FastAPI
  route matching (was parsing "batch" as prospect_id → 422)
- Add scan job tracking via stats_service.create_job/complete_job so
  the scan-jobs table gets populated after each batch run
- Add contact scrape batch endpoint (POST /contacts/batch) with
  get_pending_contact_scrape query
- Fix scan-jobs.js: explicit route map instead of naive replace
- Normalize domain_name on create/update (strip protocol, www, slash)
- Add domain_name to ProspectUpdate schema
- Add proposal for contact scraper enum + regex fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:31:33 +02:00

2.4 KiB

Prospecting Contact Scraper — Fix Enum + Improve Regex

Problem 1: DB Enum type mismatch

ProspectContact.contact_type is defined as a Python Enum (contacttype) in the model, but the DB column was created as a plain VARCHAR in the migration. When SQLAlchemy inserts, it casts to ::contacttype which doesn't exist in PostgreSQL.

Error: type "contacttype" does not exist

File: app/modules/prospecting/models/prospect_contact.py

Fix options:

  • A) Change the model column from Enum(ContactType) to String to match the migration
  • B) Create an Alembic migration to add the contacttype enum to PostgreSQL

Option A is simpler and consistent with how the scraper creates contacts (using plain strings like "email", "phone").

Problem 2: Phone regex too loose and Luxembourg-specific

The phone pattern (?:\+352|00352)?[\s.-]?\d{2,3}[\s.-]?\d{2,3}[\s.-]?\d{2,3} has two issues:

  1. Too loose — matches any 6-9 digit sequence (CSS values, timestamps, hex colors, zip codes). On batirenovation-strasbourg.fr it found 120+ false positives.
  2. Luxembourg-only — only recognizes +352/00352 prefix. This is a French site with +33 numbers.

File: app/modules/prospecting/services/enrichment_service.py:274

Fix: Replace with a broader international phone regex:

phone_pattern = re.compile(
    r'(?:\+\d{1,3}[\s.-]?)?'       # optional international prefix (+33, +352, etc.)
    r'\(?\d{1,4}\)?[\s.-]?'         # area code with optional parens
    r'\d{2,4}[\s.-]?'               # first group
    r'\d{2,4}(?:[\s.-]?\d{2,4})?'   # second group + optional third
)

Also add minimum length filter (10+ digits for international numbers) and exclude patterns that look like dates, hex colors, or CSS values.

Problem 3: Email with URL-encoded space

The scraper finds %20btirenovation@gmail.com (from an href="mailto:%20btirenovation@gmail.com") alongside the clean btirenovation@gmail.com. The %20 prefix should be stripped.

File: app/modules/prospecting/services/enrichment_service.py:293-303

Fix: URL-decode email values before storing, or strip %20 prefix.

Files to change

File Change
prospecting/models/prospect_contact.py Change contact_type from Enum to String
prospecting/services/enrichment_service.py Improve phone regex, add min-length filter, URL-decode emails
Alembic migration If needed for the enum change