Files

Samir Boulahtit 50a4fc38a7 feat(prospecting): add batch delay + fix Celery error_message field

- Add PROSPECTING_BATCH_DELAY_SECONDS config (default 1.0s) — polite
  delay between prospects in batch scans to avoid rate limiting
- Apply delay to all 5 batch API endpoints and all Celery tasks
- Fix Celery tasks: error_message → error_log (matches model field)
- Add batch-scanning.md docs with rate limiting guide, scaling estimates
  for 70k+ URL imports, and pipeline order recommendations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-01 21:55:24 +02:00

2.9 KiB

Raw Blame History

Batch Scanning & Rate Limiting

Overview

The prospecting module performs passive scans against prospect websites to gather intelligence. Batch operations process multiple prospects sequentially with a configurable delay between each.

Scan Types

Type	What It Does	HTTP Requests/Prospect
HTTP Check	Connectivity, HTTPS, redirects	2 (HTTP + HTTPS)
Tech Scan	CMS, framework, server detection	1 (homepage)
Performance	PageSpeed Insights audit	1 (Google API)
Contact Scrape	Email, phone, address extraction	6 (homepage + 5 subpages)
Security Audit	Headers, SSL, exposed files, cookies	~35 (homepage + 30 path checks)
Score Compute	Calculate opportunity score	0 (local computation)

Rate Limiting

Configuration

# .env
PROSPECTING_BATCH_DELAY_SECONDS=1.0   # delay between prospects (default: 1s)
PROSPECTING_HTTP_TIMEOUT=10            # per-request timeout (default: 10s)

Where Delays Apply

Batch API endpoints (POST /enrichment/*/batch) — 1s delay between prospects
Celery background tasks (scan_tasks.py) — same 1s delay
Full enrichment (POST /enrichment/full/{id}) — no delay (single prospect)
Score compute batch — no delay (no outbound HTTP)

Scaling to 70k+ URLs

For bulk imports (e.g., domain registrar list), use Celery tasks with limits:

Scan Type	Time per prospect	70k URLs	Recommended Batch
HTTP Check	~2s (timeout + delay)	~39 hours	500/batch via Celery
Tech Scan	~2s	~39 hours	500/batch
Contact Scrape	~12s (6 pages + delay)	~10 days	100/batch
Security Audit	~40s (35 paths + delay)	~32 days	50/batch

Recommendation: For 70k URLs, run HTTP Check first (fastest, filters out dead sites). Then run subsequent scans only on prospects with has_website=True (~50-70% of domains typically have working sites).

Pipeline Order

1. HTTP Check batch    → sets has_website, filters dead domains
2. Tech Scan batch     → only where has_website=True
3. Contact Scrape      → only where has_website=True
4. Security Audit      → only where has_website=True
5. Score Compute       → all prospects (local, fast)

Each scan type uses last_*_at timestamps to track what's been processed. Re-running a batch only processes prospects that haven't been scanned yet.

User-Agent

All scans use a standard Chrome User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

The security audit also identifies as OrionBot/1.0 in the contact scraper for transparency.

Error Handling

Individual prospect failures don't stop the batch
Errors are logged but the next prospect continues
The scan job record tracks processed_items vs total_items
Celery tasks retry on failure (2 retries with exponential backoff)

2.9 KiB Raw Blame History