- Add PROSPECTING_BATCH_DELAY_SECONDS config (default 1.0s) — polite delay between prospects in batch scans to avoid rate limiting - Apply delay to all 5 batch API endpoints and all Celery tasks - Fix Celery tasks: error_message → error_log (matches model field) - Add batch-scanning.md docs with rate limiting guide, scaling estimates for 70k+ URL imports, and pipeline order recommendations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
75 lines
2.9 KiB
Markdown
75 lines
2.9 KiB
Markdown
# Batch Scanning & Rate Limiting
|
|
|
|
## Overview
|
|
|
|
The prospecting module performs passive scans against prospect websites to gather intelligence. Batch operations process multiple prospects sequentially with a configurable delay between each.
|
|
|
|
## Scan Types
|
|
|
|
| Type | What It Does | HTTP Requests/Prospect |
|
|
|---|---|---|
|
|
| **HTTP Check** | Connectivity, HTTPS, redirects | 2 (HTTP + HTTPS) |
|
|
| **Tech Scan** | CMS, framework, server detection | 1 (homepage) |
|
|
| **Performance** | PageSpeed Insights audit | 1 (Google API) |
|
|
| **Contact Scrape** | Email, phone, address extraction | 6 (homepage + 5 subpages) |
|
|
| **Security Audit** | Headers, SSL, exposed files, cookies | ~35 (homepage + 30 path checks) |
|
|
| **Score Compute** | Calculate opportunity score | 0 (local computation) |
|
|
|
|
## Rate Limiting
|
|
|
|
### Configuration
|
|
|
|
```bash
|
|
# .env
|
|
PROSPECTING_BATCH_DELAY_SECONDS=1.0 # delay between prospects (default: 1s)
|
|
PROSPECTING_HTTP_TIMEOUT=10 # per-request timeout (default: 10s)
|
|
```
|
|
|
|
### Where Delays Apply
|
|
|
|
- **Batch API endpoints** (`POST /enrichment/*/batch`) — 1s delay between prospects
|
|
- **Celery background tasks** (`scan_tasks.py`) — same 1s delay
|
|
- **Full enrichment** (`POST /enrichment/full/{id}`) — no delay (single prospect)
|
|
- **Score compute batch** — no delay (no outbound HTTP)
|
|
|
|
### Scaling to 70k+ URLs
|
|
|
|
For bulk imports (e.g., domain registrar list), use Celery tasks with limits:
|
|
|
|
| Scan Type | Time per prospect | 70k URLs | Recommended Batch |
|
|
|---|---|---|---|
|
|
| HTTP Check | ~2s (timeout + delay) | ~39 hours | 500/batch via Celery |
|
|
| Tech Scan | ~2s | ~39 hours | 500/batch |
|
|
| Contact Scrape | ~12s (6 pages + delay) | ~10 days | 100/batch |
|
|
| Security Audit | ~40s (35 paths + delay) | ~32 days | 50/batch |
|
|
|
|
**Recommendation:** For 70k URLs, run HTTP Check first (fastest, filters out dead sites). Then run subsequent scans only on prospects with `has_website=True` (~50-70% of domains typically have working sites).
|
|
|
|
### Pipeline Order
|
|
|
|
```
|
|
1. HTTP Check batch → sets has_website, filters dead domains
|
|
2. Tech Scan batch → only where has_website=True
|
|
3. Contact Scrape → only where has_website=True
|
|
4. Security Audit → only where has_website=True
|
|
5. Score Compute → all prospects (local, fast)
|
|
```
|
|
|
|
Each scan type uses `last_*_at` timestamps to track what's been processed. Re-running a batch only processes prospects that haven't been scanned yet.
|
|
|
|
## User-Agent
|
|
|
|
All scans use a standard Chrome User-Agent:
|
|
```
|
|
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
|
|
```
|
|
|
|
The security audit also identifies as `OrionBot/1.0` in the contact scraper for transparency.
|
|
|
|
## Error Handling
|
|
|
|
- Individual prospect failures don't stop the batch
|
|
- Errors are logged but the next prospect continues
|
|
- The scan job record tracks `processed_items` vs `total_items`
|
|
- Celery tasks retry on failure (2 retries with exponential backoff)
|