fix(prospecting): fix scan-jobs batch endpoints and add job tracking
- Reorder routes: batch endpoints before /{prospect_id} to fix FastAPI
route matching (was parsing "batch" as prospect_id → 422)
- Add scan job tracking via stats_service.create_job/complete_job so
the scan-jobs table gets populated after each batch run
- Add contact scrape batch endpoint (POST /contacts/batch) with
get_pending_contact_scrape query
- Fix scan-jobs.js: explicit route map instead of naive replace
- Normalize domain_name on create/update (strip protocol, www, slash)
- Add domain_name to ProspectUpdate schema
- Add proposal for contact scraper enum + regex fixes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
51
docs/proposals/prospecting-contact-scraper-fix.md
Normal file
51
docs/proposals/prospecting-contact-scraper-fix.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# Prospecting Contact Scraper — Fix Enum + Improve Regex
|
||||
|
||||
## Problem 1: DB Enum type mismatch
|
||||
|
||||
`ProspectContact.contact_type` is defined as a Python Enum (`contacttype`) in the model, but the DB column was created as a plain `VARCHAR` in the migration. When SQLAlchemy inserts, it casts to `::contacttype` which doesn't exist in PostgreSQL.
|
||||
|
||||
**Error:** `type "contacttype" does not exist`
|
||||
|
||||
**File:** `app/modules/prospecting/models/prospect_contact.py`
|
||||
|
||||
**Fix options:**
|
||||
- A) Change the model column from `Enum(ContactType)` to `String` to match the migration
|
||||
- B) Create an Alembic migration to add the `contacttype` enum to PostgreSQL
|
||||
|
||||
Option A is simpler and consistent with how the scraper creates contacts (using plain strings like `"email"`, `"phone"`).
|
||||
|
||||
## Problem 2: Phone regex too loose and Luxembourg-specific
|
||||
|
||||
The phone pattern `(?:\+352|00352)?[\s.-]?\d{2,3}[\s.-]?\d{2,3}[\s.-]?\d{2,3}` has two issues:
|
||||
|
||||
1. **Too loose** — matches any 6-9 digit sequence (CSS values, timestamps, hex colors, zip codes). On batirenovation-strasbourg.fr it found 120+ false positives.
|
||||
2. **Luxembourg-only** — only recognizes `+352`/`00352` prefix. This is a French site with `+33` numbers.
|
||||
|
||||
**File:** `app/modules/prospecting/services/enrichment_service.py:274`
|
||||
|
||||
**Fix:** Replace with a broader international phone regex:
|
||||
```python
|
||||
phone_pattern = re.compile(
|
||||
r'(?:\+\d{1,3}[\s.-]?)?' # optional international prefix (+33, +352, etc.)
|
||||
r'\(?\d{1,4}\)?[\s.-]?' # area code with optional parens
|
||||
r'\d{2,4}[\s.-]?' # first group
|
||||
r'\d{2,4}(?:[\s.-]?\d{2,4})?' # second group + optional third
|
||||
)
|
||||
```
|
||||
Also add minimum length filter (10+ digits for international numbers) and exclude patterns that look like dates, hex colors, or CSS values.
|
||||
|
||||
## Problem 3: Email with URL-encoded space
|
||||
|
||||
The scraper finds `%20btirenovation@gmail.com` (from an `href="mailto:%20btirenovation@gmail.com"`) alongside the clean `btirenovation@gmail.com`. The `%20` prefix should be stripped.
|
||||
|
||||
**File:** `app/modules/prospecting/services/enrichment_service.py:293-303`
|
||||
|
||||
**Fix:** URL-decode email values before storing, or strip `%20` prefix.
|
||||
|
||||
## Files to change
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `prospecting/models/prospect_contact.py` | Change `contact_type` from `Enum` to `String` |
|
||||
| `prospecting/services/enrichment_service.py` | Improve phone regex, add min-length filter, URL-decode emails |
|
||||
| Alembic migration | If needed for the enum change |
|
||||
Reference in New Issue
Block a user