Files
orion/docs/proposals/poc-content-mapping.md
Samir Boulahtit dfd42c1b10
Some checks failed
CI / ruff (push) Successful in 16s
CI / docs (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / validate (push) Has been cancelled
CI / dependency-scanning (push) Has been cancelled
CI / pytest (push) Has been cancelled
docs: add proposal for POC content mapping (scraped → template)
Details the gap between scraped content (21 paragraphs, 30 headings,
13 images) and what appears on POC pages (only placeholder fields).

Phase 1 plan: programmatic mapping of scraped headings/paragraphs/
images into template sections (hero subtitle, gallery, about text).
Phase 2: AI-powered content enhancement (deferred, provider TBD).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 21:14:17 +02:00

144 lines
5.8 KiB
Markdown

# POC Content Mapping — Scraped Content → Template Sections
## Problem
The POC builder creates pages from industry templates but the scraped content from the prospect's site doesn't appear meaningfully. The homepage shows generic template text ("Quality construction and renovation") instead of the prospect's actual content ("Depuis trois générations, nous mettons notre savoir-faire au service de la qualité...").
For batirenovation-strasbourg.fr:
- **Scraped:** 30 headings, 21 paragraphs, 13 images, 3 contacts
- **Shows on POC:** only business name, phone, email, address via placeholders
- **Missing:** all the prose content, service descriptions, company history, project descriptions
## Current Flow
```
Scraped content → context dict → {{placeholder}} replacement → CMS pages
```
Placeholders are limited to simple fields: `{{business_name}}`, `{{phone}}`, `{{email}}`, `{{address}}`, `{{meta_description}}`, `{{about_paragraph}}`.
The rich content (paragraphs, headings, images) is stored in `prospect.scraped_content_json` but never mapped into the template sections.
## Desired Flow
```
Scraped content → intelligent mapping → template sections populated with real content
```
### Without AI (Phase 1 — programmatic mapping)
Map scraped content to template sections by position and keyword matching:
| Template Section | Scraped Source | Logic |
|---|---|---|
| Hero title | `headings[0]` | First heading = main title |
| Hero subtitle | `headings[1]` or `meta_description` | Second heading or meta desc |
| Hero background | `images[0]` | First large image |
| Features items | `headings` containing service keywords | Match headings to service names, use following paragraph as description |
| About content | `paragraphs[0:3]` | First 3 paragraphs = company story |
| Services content | `paragraphs` matching service keywords | Group paragraphs by service |
| Projects/Gallery | `images[1:8]` | Scraped images as gallery |
| Contact details | contacts (email, phone, address) | Already working |
| Social links | `social_links` from scrape | Footer social icons |
### With AI (Phase 2 — Workstream 4)
Send scraped content + template structure to LLM with prompt:
```
Given this scraped content from a construction company website and this
template structure, generate professional marketing copy for each section.
Rewrite and enhance the original text, keeping the facts but improving
tone and clarity. Output JSON matching the template section format.
```
AI would:
1. Extract the company's key selling points from paragraphs
2. Write a compelling hero tagline
3. Generate professional service descriptions from raw text
4. Create an about section from company history paragraphs
5. Translate to multiple languages
6. Generate missing content (testimonial placeholders, CTA copy)
## Plan — Phase 1 (Programmatic, no AI)
### Changes to `poc_builder_service.py`
#### 1. Enhanced `_build_context()` — extract more from scraped content
```python
context["hero_subtitle"] = scraped["headings"][1] if len(headings) > 1 else ""
context["hero_image"] = scraped["images"][0] if scraped.get("images") else None
context["about_paragraphs"] = scraped["paragraphs"][:3]
context["all_paragraphs_html"] = "\n".join(f"<p>{p}</p>" for p in scraped["paragraphs"][:8])
context["gallery_images"] = scraped["images"][:8]
context["social_links"] = scraped.get("social_links", {})
```
#### 2. New `_enrich_homepage_sections()` — inject scraped data into sections JSON
After placeholder replacement, before saving to DB:
```python
def _enrich_homepage_sections(self, sections: dict, context: dict) -> dict:
# Hero: use scraped subtitle and image
if sections.get("hero") and context.get("hero_subtitle"):
hero = sections["hero"]
for lang in (hero.get("subtitle", {}).get("translations", {}) or {}):
hero["subtitle"]["translations"][lang] = context["hero_subtitle"]
if context.get("hero_image"):
hero["background_image"] = context["hero_image"]
# Add gallery section from scraped images
if context.get("gallery_images") and len(context["gallery_images"]) > 2:
sections["gallery"] = {
"enabled": True,
"title": {"translations": {"en": "Our Work", "fr": "Nos Réalisations"}},
"images": [{"src": img, "alt": ""} for img in context["gallery_images"]],
}
return sections
```
#### 3. Enrich subpages with scraped paragraphs
Already partially done (appending scraped_paragraphs_html to about/services/projects). Improve by:
- Using scraped headings as section titles when they match service keywords
- Distributing paragraphs across pages based on keyword proximity
- Adding scraped images inline in content pages
### Changes to templates
#### Hero section: support `background_image` field
```html
{% if hero.background_image %}
<section style="background-image: url('{{ hero.background_image }}'); background-size: cover;">
{% else %}
<section class="gradient-primary">
{% endif %}
```
#### Add gallery rendering to landing-full.html
Already supported via `_gallery.html` macro — just need the section data populated.
### Changes to storefront base template
#### Social links in footer
When `request.state.is_preview` is True, render social links from the store's CMS data or from the prospect's scraped social links.
## Files to modify
| File | Change |
|---|---|
| `hosting/services/poc_builder_service.py` | Enhanced `_build_context()`, new `_enrich_homepage_sections()`, better content distribution |
| `cms/platform/sections/_hero.html` | Support `background_image` field for scraped hero images |
| `cms/storefront/landing-full.html` | Ensure gallery section renders |
| Template JSON files | Add `{{hero_subtitle}}`, `{{hero_image}}` placeholders |
## Estimated effort
- Phase 1 (programmatic): ~2-3 hours
- Phase 2 (AI): depends on provider integration (deferred)