feat(prospecting): add content scraping for POC builder (Workstream 3A)
- New scrape_content() method in enrichment_service: extracts meta
description, H1/H2 headings, paragraphs, images (filtered for size),
social links, service items, and detected languages using BeautifulSoup
- Scans 6 pages per prospect: /, /about, /a-propos, /services,
/nos-services, /contact
- Results stored as JSON in prospect.scraped_content_json
- New endpoints: POST /content-scrape/{id} and /content-scrape/batch
- Added to full_enrichment pipeline (Step 5, before security audit)
- CONTENT_SCRAPE job type for scan-jobs tracking
- "Content Scrape" batch button on scan-jobs page
- Add beautifulsoup4 to requirements.txt
Tested on batirenovation-strasbourg.fr: extracted 30 headings,
21 paragraphs, 13 images.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -34,6 +34,11 @@
|
||||
<span x-html="$icon('mail', 'w-4 h-4 mr-2')"></span>
|
||||
Contact Scrape
|
||||
</button>
|
||||
<button type="button" @click="startBatchJob('content_scrape')"
|
||||
class="inline-flex items-center px-4 py-2 text-sm font-medium leading-5 text-white transition-colors duration-150 bg-teal-600 border border-transparent rounded-lg hover:bg-teal-700 focus:outline-none">
|
||||
<span x-html="$icon('document-text', 'w-4 h-4 mr-2')"></span>
|
||||
Content Scrape
|
||||
</button>
|
||||
<button type="button" @click="startBatchJob('security_audit')"
|
||||
class="inline-flex items-center px-4 py-2 text-sm font-medium leading-5 text-white transition-colors duration-150 bg-yellow-600 border border-transparent rounded-lg hover:bg-yellow-700 focus:outline-none">
|
||||
<span x-html="$icon('shield-check', 'w-4 h-4 mr-2')"></span>
|
||||
|
||||
Reference in New Issue
Block a user