docs(prospecting): add scoring, database, and research docs
Some checks failed
CI / validate (push) Has been cancelled
CI / dependency-scanning (push) Has been cancelled
CI / docs (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / ruff (push) Successful in 11s
CI / pytest (push) Has been cancelled

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-28 16:04:52 +01:00
parent 6d6eba75bf
commit 78ee05f50e
3 changed files with 361 additions and 0 deletions

View File

@@ -0,0 +1,171 @@
# Database Schema
## Entity Relationship Diagram
```
┌─────────────────────┐ ┌────────────────────────┐
│ prospects │────<│ prospect_tech_profiles │
├─────────────────────┤ ├────────────────────────┤
│ id │ │ id │
│ channel │ │ prospect_id (FK) │
│ business_name │ │ cms, server │
│ domain_name │ │ hosting_provider │
│ status │ │ js_framework, cdn │
│ source │ │ analytics │
│ has_website │ │ ecommerce_platform │
│ uses_https │ │ tech_stack_json (JSON) │
│ ... │ └────────────────────────┘
└─────────────────────┘
│ ┌──────────────────────────────┐
└──────────────<│ prospect_performance_profiles │
│ ├──────────────────────────────┤
│ │ id │
│ │ prospect_id (FK) │
│ │ performance_score (0-100) │
│ │ accessibility_score │
│ │ seo_score │
│ │ FCP, LCP, TBT, CLS │
│ │ is_mobile_friendly │
│ └──────────────────────────────┘
│ ┌───────────────────────┐
└──────────────<│ prospect_scores │
│ ├───────────────────────┤
│ │ id │
│ │ prospect_id (FK) │
│ │ score (0-100) │
│ │ technical_health_score│
│ │ modernity_score │
│ │ business_value_score │
│ │ engagement_score │
│ │ reason_flags (JSON) │
│ │ lead_tier │
│ └───────────────────────┘
│ ┌───────────────────────┐
└──────────────<│ prospect_contacts │
│ ├───────────────────────┤
│ │ id │
│ │ prospect_id (FK) │
│ │ contact_type │
│ │ value │
│ │ source_url │
│ │ is_primary │
│ └───────────────────────┘
│ ┌───────────────────────┐
└──────────────<│ prospect_interactions │
│ ├───────────────────────┤
│ │ id │
│ │ prospect_id (FK) │
│ │ interaction_type │
│ │ subject, notes │
│ │ outcome │
│ │ next_action │
│ │ next_action_date │
│ │ created_by_user_id │
│ └───────────────────────┘
│ ┌───────────────────────┐
└──────────────<│ prospect_scan_jobs │
├───────────────────────┤
│ id │
│ job_type │
│ status │
│ total_items │
│ processed_items │
│ celery_task_id │
└───────────────────────┘
┌──────────────────────┐ ┌──────────────────┐
│ campaign_templates │────<│ campaign_sends │
├──────────────────────┤ ├──────────────────┤
│ id │ │ id │
│ name │ │ template_id (FK) │
│ lead_type │ │ prospect_id (FK) │
│ channel │ │ channel │
│ language │ │ rendered_subject │
│ subject_template │ │ rendered_body │
│ body_template │ │ status │
│ is_active │ │ sent_at │
└──────────────────────┘ │ sent_by_user_id │
└──────────────────┘
```
## Tables
### prospects
Central table for all leads — both digital (domain-based) and offline (in-person).
| Column | Type | Description |
|--------|------|-------------|
| id | INTEGER PK | Auto-increment |
| channel | ENUM(digital, offline) | How the lead was discovered |
| business_name | VARCHAR(255) | Required for offline |
| domain_name | VARCHAR(255) | Required for digital, unique |
| status | ENUM | pending, active, inactive, parked, error, contacted, converted |
| source | VARCHAR(100) | e.g. "domain_scan", "networking_event", "street" |
| has_website | BOOLEAN | Determined by HTTP check |
| uses_https | BOOLEAN | SSL status |
| http_status_code | INTEGER | Last HTTP response |
| address | VARCHAR(500) | Physical address (offline) |
| city | VARCHAR(100) | City |
| postal_code | VARCHAR(10) | Postal code |
| country | VARCHAR(2) | Default "LU" |
| notes | TEXT | Free-form notes |
| tags | JSON | Flexible tagging |
| captured_by_user_id | INTEGER FK | Who captured this lead |
| location_lat / location_lng | FLOAT | GPS from mobile capture |
| last_*_at | DATETIME | Timestamps for each scan type |
### prospect_tech_profiles
Technology stack detection results. One per prospect.
| Column | Type | Description |
|--------|------|-------------|
| cms | VARCHAR(100) | WordPress, Drupal, Joomla, etc. |
| server | VARCHAR(100) | Nginx, Apache |
| hosting_provider | VARCHAR(100) | Hosting company |
| cdn | VARCHAR(100) | CDN provider |
| js_framework | VARCHAR(100) | React, Vue, Angular, jQuery |
| analytics | VARCHAR(200) | Google Analytics, Matomo, etc. |
| ecommerce_platform | VARCHAR(100) | Shopify, WooCommerce, etc. |
| tech_stack_json | JSON | Full detection results |
### prospect_performance_profiles
Lighthouse audit results. One per prospect.
| Column | Type | Description |
|--------|------|-------------|
| performance_score | INTEGER | 0-100 |
| accessibility_score | INTEGER | 0-100 |
| seo_score | INTEGER | 0-100 |
| first_contentful_paint_ms | INTEGER | FCP |
| largest_contentful_paint_ms | INTEGER | LCP |
| total_blocking_time_ms | INTEGER | TBT |
| cumulative_layout_shift | FLOAT | CLS |
| is_mobile_friendly | BOOLEAN | Mobile test |
### prospect_scores
Calculated opportunity scores. One per prospect. See [scoring.md](scoring.md) for algorithm details.
### prospect_contacts
Scraped or manually entered contact info. Many per prospect.
### prospect_interactions
CRM-style interaction log. Many per prospect. Types: note, call, email_sent, email_received, meeting, visit, sms, proposal_sent.
### prospect_scan_jobs
Background job tracking for batch operations.
### campaign_templates / campaign_sends
Marketing campaign templates and send tracking. Templates support placeholders like `{business_name}`, `{domain}`, `{score}`, `{issues}`.

View File

@@ -0,0 +1,80 @@
# .lu Domain Lead Generation — Research Findings
Research on data sources, APIs, legal requirements, and cost analysis for the prospecting module.
## 1. Data Sources for .lu Domains
The official .lu registry (DNS-LU / RESTENA) does **not** publish zone files. All providers use web crawling to discover domains, so no list is 100% complete. Expect 70-80% coverage.
### Providers
| Provider | Domains | Price | Format | Notes |
|----------|---------|-------|--------|-------|
| NetworksDB | ~70,000 | $5 | Zipped text | Best value, one-time purchase |
| DomainMetaData | Varies | $9.90/mo | CSV | Daily updates |
| Webatla | ~75,000 | Unknown | CSV | Good coverage |
## 2. Technical APIs — Cost Analysis
### Technology Detection
| Service | Free Tier | Notes |
|---------|-----------|-------|
| CRFT Lookup | Unlimited | Budget option, includes Lighthouse |
| Wappalyzer | 50/month | Most accurate |
| WhatCMS | Free lookups | CMS-only |
**Approach used**: Custom HTML parsing for CMS, JS framework, analytics, and server detection (no external API dependency).
### Performance Audits
PageSpeed Insights API — **free**, 25,000 queries/day, 400/100 seconds.
### SSL Checks
Simple HTTPS connectivity check (fast). SSL Labs API available for deep analysis of high-priority leads.
### WHOIS
Due to GDPR, .lu WHOIS data for private individuals is hidden. Only owner type and country visible. Contact info scraped from websites instead.
## 3. Legal — Luxembourg & GDPR
### B2B Cold Email Rules
Luxembourg has **no specific B2B cold email restrictions** per Article 11(1) of the Electronic Privacy Act (applies only to natural persons).
**Requirements**:
1. Identify yourself clearly (company name, address)
2. Provide opt-out mechanism in every email
3. Message must relate to recipient's business
4. Store contact data securely
5. Only contact businesses, not private individuals
**Legal basis**: Legitimate interest (GDPR Art. 6(1)(f))
### GDPR Penalties
Fines up to EUR 20 million or 4% of global revenue for violations.
**Key violations to avoid**:
- Emailing private individuals without consent
- No opt-out mechanism
- Holding personal data longer than necessary
### Recommendation
- Focus on `info@`, `contact@`, and business role emails
- Always include unsubscribe link
- Document legitimate interest basis
## 4. Cost Summary
| Item | Cost | Type |
|------|------|------|
| Domain list (NetworksDB) | $5 | One-time |
| PageSpeed API | Free | Ongoing |
| Contact scraping | Free | Self-hosted |
| Tech detection | Free | Self-hosted |
Working MVP costs under $25 total.

View File

@@ -0,0 +1,110 @@
# Opportunity Scoring Model
## Overview
The scoring model assigns each prospect a score from 0-100 based on the opportunity potential for offering web services. Higher scores indicate better leads. The model supports two channels: **digital** (domain-based) and **offline** (in-person discovery).
## Score Components — Digital Channel
### Technical Health (Max 40 points)
Issues that indicate immediate opportunities:
| Issue | Points | Condition |
|-------|--------|-----------|
| No SSL | 15 | `uses_https = false` |
| Very Slow | 15 | `performance_score < 30` |
| Slow | 10 | `performance_score < 50` |
| Moderate Speed | 5 | `performance_score < 70` |
| Not Mobile Friendly | 10 | `is_mobile_friendly = false` |
### Modernity / Stack (Max 25 points)
Outdated technology stack:
| Issue | Points | Condition |
|-------|--------|-----------|
| Outdated CMS | 15 | CMS is Drupal, Joomla, or TYPO3 |
| Unknown CMS | 5 | No CMS detected but has website |
| Legacy JavaScript | 5 | Uses jQuery without modern framework |
| No Analytics | 5 | No Google Analytics or similar |
### Business Value (Max 25 points)
Indicators of business potential:
| Factor | Points | Condition |
|--------|--------|-----------|
| Has Website | 10 | Active website exists |
| Has E-commerce | 10 | E-commerce platform detected |
| Short Domain | 5 | Domain name <= 15 characters |
### Engagement Potential (Max 10 points)
Ability to contact the business:
| Factor | Points | Condition |
|--------|--------|-----------|
| Has Contacts | 5 | Any contact info found |
| Has Email | 3 | Email address found |
| Has Phone | 2 | Phone number found |
## Score Components — Offline Channel
Offline leads have a simplified scoring model based on the information captured during in-person encounters:
| Scenario | Technical Health | Modernity | Business Value | Engagement | Total |
|----------|-----------------|-----------|----------------|------------|-------|
| No website at all | 30 | 20 | 20 | 0 | **70** (top_priority) |
| Uses gmail/free email | +0 | +10 | +0 | +0 | +10 |
| Met in person | +0 | +0 | +0 | +5 | +5 |
| Has email contact | +0 | +0 | +0 | +3 | +3 |
| Has phone contact | +0 | +0 | +0 | +2 | +2 |
A business with no website met in person with contact info scores: 70 + 5 + 3 + 2 = **80** (top_priority).
## Lead Tiers
Based on the total score:
| Tier | Score Range | Description |
|------|-------------|-------------|
| `top_priority` | 70-100 | Best leads, multiple issues or no website at all |
| `quick_win` | 50-69 | Good leads, 1-2 easy fixes |
| `strategic` | 30-49 | Moderate potential |
| `low_priority` | 0-29 | Low opportunity |
## Reason Flags
Each score includes `reason_flags` that explain why points were awarded:
```json
{
"score": 78,
"reason_flags": ["no_ssl", "slow", "outdated_cms"],
"lead_tier": "top_priority"
}
```
Common flags (digital):
- `no_ssl` — Missing HTTPS
- `very_slow` — Performance score < 30
- `slow` — Performance score < 50
- `not_mobile_friendly` — Fails mobile tests
- `outdated_cms` — Using old CMS
- `legacy_js` — Using jQuery only
- `no_analytics` — No tracking installed
Offline-specific flags:
- `no_website` — Business has no website
- `uses_gmail` — Uses free email provider
- `met_in_person` — Lead captured in person (warm lead)
## Customizing the Model
The scoring logic is in `app/modules/prospecting/services/scoring_service.py`. You can adjust:
1. **Point values** — Change weights for different issues
2. **Thresholds** — Adjust performance score cutoffs
3. **Conditions** — Add new scoring criteria
4. **Tier boundaries** — Change score ranges for tiers