feat(prospecting): add content scraping for POC builder (Workstream 3A)

- New scrape_content() method in enrichment_service: extracts meta
  description, H1/H2 headings, paragraphs, images (filtered for size),
  social links, service items, and detected languages using BeautifulSoup
- Scans 6 pages per prospect: /, /about, /a-propos, /services,
  /nos-services, /contact
- Results stored as JSON in prospect.scraped_content_json
- New endpoints: POST /content-scrape/{id} and /content-scrape/batch
- Added to full_enrichment pipeline (Step 5, before security audit)
- CONTENT_SCRAPE job type for scan-jobs tracking
- "Content Scrape" batch button on scan-jobs page
- Add beautifulsoup4 to requirements.txt

Tested on batirenovation-strasbourg.fr: extracted 30 headings,
21 paragraphs, 13 images.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-01 22:26:56 +02:00
parent 50a4fc38a7
commit 1828ac85eb
8 changed files with 218 additions and 2 deletions

View File

@@ -70,6 +70,10 @@ class Prospect(Base, TimestampMixin):
last_perf_scan_at = Column(DateTime, nullable=True)
last_contact_scrape_at = Column(DateTime, nullable=True)
last_security_audit_at = Column(DateTime, nullable=True)
last_content_scrape_at = Column(DateTime, nullable=True)
# Scraped page content for POC builder
scraped_content_json = Column(Text, nullable=True)
# Relationships
tech_profile = relationship("ProspectTechProfile", back_populates="prospect", uselist=False, cascade="all, delete-orphan")

View File

@@ -20,6 +20,7 @@ class JobType(str, enum.Enum):
SCORE_COMPUTE = "score_compute"
FULL_ENRICHMENT = "full_enrichment"
SECURITY_AUDIT = "security_audit"
CONTENT_SCRAPE = "content_scrape"
class JobStatus(str, enum.Enum):