diff --git a/docs/proposals/poc-content-mapping.md b/docs/proposals/poc-content-mapping.md new file mode 100644 index 00000000..633028a4 --- /dev/null +++ b/docs/proposals/poc-content-mapping.md @@ -0,0 +1,143 @@ +# POC Content Mapping — Scraped Content → Template Sections + +## Problem + +The POC builder creates pages from industry templates but the scraped content from the prospect's site doesn't appear meaningfully. The homepage shows generic template text ("Quality construction and renovation") instead of the prospect's actual content ("Depuis trois générations, nous mettons notre savoir-faire au service de la qualité..."). + +For batirenovation-strasbourg.fr: +- **Scraped:** 30 headings, 21 paragraphs, 13 images, 3 contacts +- **Shows on POC:** only business name, phone, email, address via placeholders +- **Missing:** all the prose content, service descriptions, company history, project descriptions + +## Current Flow + +``` +Scraped content → context dict → {{placeholder}} replacement → CMS pages +``` + +Placeholders are limited to simple fields: `{{business_name}}`, `{{phone}}`, `{{email}}`, `{{address}}`, `{{meta_description}}`, `{{about_paragraph}}`. + +The rich content (paragraphs, headings, images) is stored in `prospect.scraped_content_json` but never mapped into the template sections. + +## Desired Flow + +``` +Scraped content → intelligent mapping → template sections populated with real content +``` + +### Without AI (Phase 1 — programmatic mapping) + +Map scraped content to template sections by position and keyword matching: + +| Template Section | Scraped Source | Logic | +|---|---|---| +| Hero title | `headings[0]` | First heading = main title | +| Hero subtitle | `headings[1]` or `meta_description` | Second heading or meta desc | +| Hero background | `images[0]` | First large image | +| Features items | `headings` containing service keywords | Match headings to service names, use following paragraph as description | +| About content | `paragraphs[0:3]` | First 3 paragraphs = company story | +| Services content | `paragraphs` matching service keywords | Group paragraphs by service | +| Projects/Gallery | `images[1:8]` | Scraped images as gallery | +| Contact details | contacts (email, phone, address) | Already working | +| Social links | `social_links` from scrape | Footer social icons | + +### With AI (Phase 2 — Workstream 4) + +Send scraped content + template structure to LLM with prompt: +``` +Given this scraped content from a construction company website and this +template structure, generate professional marketing copy for each section. +Rewrite and enhance the original text, keeping the facts but improving +tone and clarity. Output JSON matching the template section format. +``` + +AI would: +1. Extract the company's key selling points from paragraphs +2. Write a compelling hero tagline +3. Generate professional service descriptions from raw text +4. Create an about section from company history paragraphs +5. Translate to multiple languages +6. Generate missing content (testimonial placeholders, CTA copy) + +## Plan — Phase 1 (Programmatic, no AI) + +### Changes to `poc_builder_service.py` + +#### 1. Enhanced `_build_context()` — extract more from scraped content + +```python +context["hero_subtitle"] = scraped["headings"][1] if len(headings) > 1 else "" +context["hero_image"] = scraped["images"][0] if scraped.get("images") else None +context["about_paragraphs"] = scraped["paragraphs"][:3] +context["all_paragraphs_html"] = "\n".join(f"

{p}

" for p in scraped["paragraphs"][:8]) +context["gallery_images"] = scraped["images"][:8] +context["social_links"] = scraped.get("social_links", {}) +``` + +#### 2. New `_enrich_homepage_sections()` — inject scraped data into sections JSON + +After placeholder replacement, before saving to DB: + +```python +def _enrich_homepage_sections(self, sections: dict, context: dict) -> dict: + # Hero: use scraped subtitle and image + if sections.get("hero") and context.get("hero_subtitle"): + hero = sections["hero"] + for lang in (hero.get("subtitle", {}).get("translations", {}) or {}): + hero["subtitle"]["translations"][lang] = context["hero_subtitle"] + if context.get("hero_image"): + hero["background_image"] = context["hero_image"] + + # Add gallery section from scraped images + if context.get("gallery_images") and len(context["gallery_images"]) > 2: + sections["gallery"] = { + "enabled": True, + "title": {"translations": {"en": "Our Work", "fr": "Nos Réalisations"}}, + "images": [{"src": img, "alt": ""} for img in context["gallery_images"]], + } + + return sections +``` + +#### 3. Enrich subpages with scraped paragraphs + +Already partially done (appending scraped_paragraphs_html to about/services/projects). Improve by: +- Using scraped headings as section titles when they match service keywords +- Distributing paragraphs across pages based on keyword proximity +- Adding scraped images inline in content pages + +### Changes to templates + +#### Hero section: support `background_image` field + +```html +{% if hero.background_image %} +
+{% else %} +
+{% endif %} +``` + +#### Add gallery rendering to landing-full.html + +Already supported via `_gallery.html` macro — just need the section data populated. + +### Changes to storefront base template + +#### Social links in footer + +When `request.state.is_preview` is True, render social links from the store's CMS data or from the prospect's scraped social links. + +## Files to modify + +| File | Change | +|---|---| +| `hosting/services/poc_builder_service.py` | Enhanced `_build_context()`, new `_enrich_homepage_sections()`, better content distribution | +| `cms/platform/sections/_hero.html` | Support `background_image` field for scraped hero images | +| `cms/storefront/landing-full.html` | Ensure gallery section renders | +| Template JSON files | Add `{{hero_subtitle}}`, `{{hero_image}}` placeholders | + +## Estimated effort + +- Phase 1 (programmatic): ~2-3 hours +- Phase 2 (AI): depends on provider integration (deferred)