Files
orion/docs/proposals/loyalty-go-live-readiness.md
Samir Boulahtit 947ca43c7b
All checks were successful
CI / ruff (push) Successful in 18s
CI / pytest (push) Successful in 2h39m33s
CI / validate (push) Successful in 35s
CI / dependency-scanning (push) Successful in 36s
CI / docs (push) Successful in 56s
CI / deploy (push) Successful in 1m13s
docs(loyalty): record 2026-05-30 afternoon — prod-readiness 1-3 done + alerting back online
Picked up the morning's carry-over and ran the full prod-readiness
chain end-to-end. Resolution: SG credential out of git permanently
via untrack + .example template (e44f5c04); per-host migration on
prod (alertmanager.yml gitignored, real file lives outside git);
deploy-api-only.sh succeeded for the first time; today's 9 queued
loyalty commits live on prod with ?v=e44f5c04 (and verified by
re-running the loyalty redirect flicker repro — clean).

Multi-hour rabbit hole on actual email delivery: provider's port 587
PLAIN backend is OAuth-wired (returns RFC 6749 invalid_grant text
for password auth); switched to provider's documented port 465 SSL/TLS
endpoint. Discovered Hetzner Cloud blocks outbound 25 and 465 by
default as anti-spam policy. Auto-approved unblock ticket landed in
minutes; one-line smarthost change to :465 reactivated email
alerting after 13+ days down. alertmanager handles implicit TLS on
465 natively, no stunnel/relay needed.

Hetzner doc updated with the egress-block warning + mail1 SMTP
callout in 1227567d as 5h-debug payback. Next session resumes at
Test 5.2 (/account/loyalty with 168 pts customer) → 5.3 history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:20:00 +02:00

51 KiB
Raw Blame History

Loyalty Go-Live Readiness — 2026-05-10

Snapshot of where the loyalty platform stands the night of 2026-05-10. The canonical sequenced plan is still app/modules/loyalty/docs/production-launch-plan.md; this doc records the current state ( / / 🟡) and what surfaced during the prod readiness pass.

TL;DR

The technical pre-launch checklist is green. The remaining gate is a human one — walking the 8 user-journey E2E tests on prod with a real test customer and confirming nothing surprises us. After that, flip the loyalty platform live for FASHIONHUB's stores and start the Google Wallet production-access review in parallel (13 day Google review, non-blocking).

2026-05-16 update — Test 1 round 1: 7 bugs found, 6 fixed, 1 pending

First attempt at the customer-facing journey on FASHIONHUB's fallback subdomain (fashionhub.rewardflow.lu) surfaced more than expected. A critical timestamp bug was masquerading as a re-enrollment confusion, which sent us briefly down the wrong investigation path. The clean-slate reset described below cleared the bad data so the remaining gates can be verified on a known-good baseline.

Six bugs fixed and deployed to prod (5 commits):

Bug Layer Fix
TimestampMixin evaluated datetime.now(UTC) once at module import — every row stamped at process-start time models/database/base.py Pass _utc_now callable as default / onupdate. Critical: affected every created_at / updated_at on every table that uses the mixin since the last app restart.
Admin/store/merchant card detail page showed "-" for phone + birthday even when both were captured during enrollment app/modules/loyalty/schemas/card.py + 3 endpoints Added customer_phone + customer_birthday to CardDetailResponse and populated from customer.phone / customer.birth_date. Data was persisting all along — purely a serialization gap.
Storefront <html lang="en"> hardcoded made <input type="date"> show in mm/dd/yyyy on the FR storefront app/templates/storefront/base.html Dynamic lang="{{ current_language|default('en') }}" so the browser respects the FR locale.
Storefront nav "Home" rendered as English literal across all locales despite nav.home existing in every locale file app/templates/storefront/base.html Use {{ _('nav.home') }} on both desktop and mobile nav.
Store.description (the per-store tagline) was single-language only — FASHIONHUB's "Trendy clothing and accessories" rendered in EN on the FR storefront footer Store model + migration tenancy_005 + template + seed_demo.py Added description_translations JSON column with the same shape used by CMS / Platform / Subscription. Added get_translated_description(lang) getter with FR/DE → DEFAULT_LANGUAGE → description fallback. Seeded FR/DE/LB/EN for Fashion Group's two stores so they render correctly out of the box.
make init-prod and make db-reset referenced scripts/seed/seed_email_templates.py, which doesn't exist (the real seeders are _core.py + _loyalty.py) — db-reset would silently bomb mid-way Makefile Call both real scripts in both targets.
scripts/seed/create_default_content_pages.py still passed meta_keywords to ContentPage, but the column was dropped in migration cms_003 — fresh seeding failed on the first platform scripts/seed/create_default_content_pages.py Drop the meta_keywords kwarg.

One bug still open:

  • B1-F — welcome email not received. The original investigation was confounded by the timestamp bug (customer looked like it was from May 12 when it was actually fresh, making the re-enrollment hypothesis seem plausible). Needs fresh repro on the clean DB: enroll with a new email, tail api + celery-worker logs live, check email_logs for a row. If still no email, then there's a real bug in the dispatch path — notification_service.send_enrollment_confirmation is called from card_service.enroll_customer:636 and wraps the call in a try/except that only logs warnings (card_service.py:631-645), so a silent failure in _resolve_context or the Celery enqueue would be invisible from the user's perspective.

Two product decisions pending (from the same session, not yet implemented):

  • B1-E — QR code in welcome email. Scoped: pass wallet_save_url into the loyalty_enrollment template, generate QR server-side (Python qrcode), update the HTML body in all 4 locales in scripts/seed/seed_email_templates_loyalty.py:294-299, reseed. Blocked on B1-F (no point adding a QR to an email that doesn't send).
  • C1-C backfill scope. Other stores (WizaTech, BookWorld, LuxWeb, WizaMart etc.) still only have a single-language description. Fashion Group was seeded; rest can be done by hand via admin UI as merchants come online, or batch-updated later. No code work needed.

Prod data reset

Wiped and reseeded — used the corrected sequence from deployment/hetzner-server-setup.md section 12. Two doc gaps found and patched in the same pass:

  • Reset procedure called scripts/seed/seed_email_templates.py (doesn't exist) — now calls both real scripts
  • Reset procedure was missing seed_demo.py at the end of step 8 — now included

After reset, admin credentials are back to the defaults from init_production.py (admin / Ollama@8044, etc.); platform admin SMTP overrides in /admin/settings need to be re-applied (port 587, STARTTLS, support@wizard.lu).

Status board delta

  • Step 1 (email templates seeded) — re-seeded post-reset, still
  • Step 3 (migrations) — now at tenancy_005, still
  • Step 6 (web user-journey E2E tests) — Test 1 round 2 pending on clean DB; the bugs found in round 1 are no longer blockers

Next session

Session paused 2026-05-16 evening. To resume Test 1 round 2:

  1. Re-apply SMTP overrides under /admin/settings (port 587, STARTTLS, support@wizard.lu) — the reset wiped them.
  2. Confirm /admin/loyalty/programs shows the Fashion Group program (should be seeded by seed_demo.py).
  3. Tail api and celery-worker logs live, then enroll at https://fashionhub.rewardflow.lu/loyalty/join with a fresh email. The point of the live tail is to catch where B1-F actually dies — at dispatch, at SMTP, or somewhere else.

2026-05-17 update — B1-F resolved (chain of 4 nested bugs)

End-to-end enrollment → Celery dispatch → email_logs status=sent → real emails arriving in inbox. Verified with the FR locale: enrollment ("Bienvenue chez Fashion Group S.A. Loyalty !") and welcome-bonus ("Vous avez gagné 50 points bonus !") both send within ~4s of submit.

The "no welcome email" symptom hid four layered bugs; each silently masked the next, which is why early diagnostics looked clean:

# Bug Fix
1 @shared_task defaulted to amqp://localhost// because celery_app.set_default() was never called AND the api process never imported celery_config. .delay() raised kombu.OperationalError: Connection refused. 44c42909set_default() + early import in main.py (with # isort: split so ruff doesn't reorder it).
2 on_failure log handler crashed on reserved LogRecord attribute name argsKeyError masked every real task exception. 3e650ff8 — rename to task_args / task_kwargs.
3 loyalty.send_notification_email wasn't in worker's task registry — notifications.py wasn't imported by loyalty/tasks/__init__.py. Worker received the message, couldn't find the task, ACKed silently. 2a216101 — add the import + __all__ entry.
4 Celery worker process never imported all models. First DB query failed InvalidRequestError: expression 'ContentPage' failed to locate a name. 5b21908b_preload_all_module_models() walks the registry and force-imports each module's models package at celery_config load.

Three earlier same-session commits also shipped: SMTP password eye toggle (64a178f4), JS error on /admin/loyalty/programs (8d6830fc), 422 on ProgramCreate (120532e6).

Audit finding

app/modules/prospecting/tasks/__init__.py has the same shape as bug #3 above — scan_tasks.py exists but isn't imported. Not blocking anything today (no prospecting Celery dispatch is wired up yet), but should be fixed alongside the unit-test pass below.

Follow-ups (queued for next session)

  1. Two Test 1 nits — date format mm/dd/yyyy on FR storefront enrollment form (verify the <html lang> deploy actually landed; if it did, the user's browser doesn't honor lang for <input type="date"> and we need a JS date-picker swap); "Continuer mes achats" CTA on enroll-success.html:118 is wrong for loyalty-only storefronts with no catalog.
  2. Test 2 — cross-store re-enrollment at FASHIONOUTLET with the email from Test 1.
  3. Hetzner doc check — verify whether docs/deployment/hetzner-server-setup.md needs any new step from tonight's fixes. Most likely no (the fixes are in-code, not deployment), but worth a glance.
  4. Unit tests — none of the four B1-F bugs were caught by the existing suite. Add at minimum:
    • Assert celery_app.conf.broker_url is redis://... after importing main (catches future set_default() ordering regressions).
    • Assert loyalty.send_notification_email is in celery_app.tasks after importing app.modules.loyalty.tasks (catches future missing imports in task package __init__.py).
    • Assert configure_mappers() succeeds after importing app.core.celery_config (catches future missing-models regressions in celery).
    • Either assert task_base.on_failure doesn't crash on a synthetic failure, or standardize an extra= sanitiser that strips reserved LogRecord attribute names.
  5. Fix prospecting/tasks/__init__.py — add the missing import.
  6. Audit every other module's email path — are billing's trial-expiration emails really dispatched via Celery? Messaging's password-reset emails? If yes, same silent-failure risk exists until a real send hits prod. Add an integration test that triggers a representative email from every module and asserts an email_logs row appears within N seconds.

2026-05-18 update — Test 1 round 2 cleanup + admin polish

Both follow-up #1 items from yesterday shipped + verified by user. Plus a bonus polish on the admin program form, surfaced when the user spotted a non-clickable "Conditions Générales" link on the storefront and asked why.

Test 1 nits resolved

Bug Diagnosis Fix Commit
Birthday picker shows mm/dd/yyyy on FR even though <html lang="fr"> IS in the page source Firefox-specific: it ignores the lang attribute for <input type="date"> (Mozilla bug #1344625, open since 2017). Chrome/Safari/Edge respect it. Swap to flatpickr on both loyalty/storefront/enroll.html and loyalty/store/enroll.html. Configured dateFormat: 'Y-m-d' (ISO to API) + altInput: true + altFormat: 'd/m/Y' (dd/mm/yyyy visible) + maxDate: 'today' + locale: '{{ current_language }}' for month/day name translations. Loaded via extra_head + extra_scripts blocks. ab3e133a
"Continuer mes achats" CTA on enrollment success page makes no sense for a loyalty-only storefront with no catalog Same destination is fine ({{ base_url }}), only the label needed to change. Renamed i18n key continue_shoppingback_to_home in all four locales: EN "Back to Home" / FR "Retour à l'accueil" / DE "Zurück zur Startseite" / LB "Zréck op d'Haaptsäit". 236fee01

Admin program-form polish (bonus)

User noticed the storefront's "Conditions Générales" link wasn't clickable. Root cause: the program had both terms_text and terms_cms_page_slug empty. The storefront then renders a plain <span> instead of an <a> (enroll.html:122-124) — intentional, so the link doesn't open an empty modal, but easy to accidentally publish a program in this broken state.

Two changes to loyalty/shared/program-form.html:

  1. Yellow warning banner inside the Terms section, visible only when both fields are empty. Tells the admin exactly what the storefront will look like and what to fix.
  2. Save button disabled until at least one of the two terms fields is filled. The button gets a localised :title tooltip explaining why it's disabled, with disabled:cursor-not-allowed so the disabled state is obvious on hover.

Three new i18n keys (terms_required_warning, terms_text_hint, terms_required_tooltip) added in en/fr/de/lb. Shipped as 5f288502.

Test 1 status

All 6 originally-reported bugs (B1-A through B1-F) plus the 2 follow-up nits are now resolved end-to-end on prod. Test 1 is fully done.

Remaining follow-ups (carry over to next session)

From the 2026-05-17 list, items #2 (Test 2), #3 (Hetzner doc check), #4 (unit tests for the B1-F chain), #5 (prospecting tasks/__init__.py missing import), and #6 (other-module email audit) are still queued. Items #1 (Test 1 nits) are now closed by this session's commits.

2026-05-19 update — Test 2 complete

Cross-store re-enrollment at FASHIONOUTLET (same email as Test 1's successful +17mayf@gmail.com) walked cleanly. Behavioral checks all green, one copy bug fixed.

Test 2 verification

Check Result
Exactly 1 loyalty_cards row for customer / merchant — no duplicate
Zero new email_logs rows — no duplicate welcome email
Cross-location locations block lists both Fashion Hub + Fashion Outlet
Title renders "Vous êtes déjà membre !" (already conditional from prior fix)

Bug fixed: contradicting subtitle on success page (dee2eab2)

The title at enroll-success.html:21 was already x-text-conditional on enrollContext.already_enrolled — so it correctly switched between "Welcome!" and "You're already a member!". But the subtitle just below at line 24 was a static {{ _('success.message') }}, always rendering "Vous êtes maintenant membre..." even on the already-enrolled branch. Two contradicting messages stacked.

Added an already_enrolled_message i18n key in en/fr/de/lb and made the subtitle conditional the same way as the title:

  • EN: "Welcome back — your card is ready whenever you are."
  • FR: "Heureux de vous revoir — votre carte est prête à l'emploi."
  • DE: "Willkommen zurück — Ihre Karte ist einsatzbereit."
  • LB: "Wëllkomm zréck — Är Kaart ass prett wann Dir et sidd."

Status board delta

  • Step 6 (web user-journey E2E tests) — Tests 1 and 2 done. Tests 38 still ahead.

Carry over for next session

  • Test 3 — Staff stamps/points at the terminal (/store/FASHIONHUB/loyalty/terminal)
  • Items #3 (Hetzner doc check), #4 (unit tests for the B1-F chain), #5 (prospecting tasks/__init__.py missing import), #6 (other-module email audit) still queued from the 2026-05-17 follow-up list.

2026-05-23 update — Test 3 done + cooldown bug + routing investigation

Test 3 (staff stamps/points at terminal) — all 6 sub-steps verified

Lookup by card-number AND by email both work; phone + birthday show correctly on the card detail (B1-D regression check passed); points earning credits; cooldown rejection fires (after the fix below).

Cooldown bug fixed (93ab072f)

stamp_service.add_stamp properly checks cooldown before crediting. The parallel points_service.earn_points wrote card.last_points_at but never read it — so the program's cooldown_minutes was silently ignored for points-based programs. Mirrored the stamp check in points_service after the row lock; added PointsCooldownException with error_code POINTS_COOLDOWN.

Cooldown toast localised (aa8ca594)

After the cooldown fix shipped, the FR-locale toast still showed the raw English from the backend. Three small changes:

  • static/shared/js/api-client.js — propagate error.details (alongside errorCode) so callers can render localised toasts.
  • loyalty-terminal.js:277 — in the transaction-dispatch catch, branch on errorCode === 'POINTS_COOLDOWN' | 'STAMP_COOLDOWN' and render loyalty.store.terminal.cooldown_wait_minutes with {minutes} from error.details.cooldown_minutes; toast type switches to warning since the rejection is soft.
  • New cooldown_wait_minutes key in en/fr/de/lb under loyalty.store.terminal.*.

Routing investigation — 4 distinct bugs in path/host handling (not yet fixed)

User hit a 404 on https://fashionhub.rewardflow.lu/platforms/loyalty/store/fashionhub/dashboard after login, then noticed several other oddities. Diagnostics found four distinct routing-implementation bugs, all from the same architectural drift (path-based dev → subdomain/custom-domain prod):

  1. Mount 1 store-resolution broken on subdomain/store/login returns "Failed to load store information" even though the route is mounted at main.py:449-458 and the host should resolve the store via middleware. Workaround: use Mount 2 /store/{STORE_CODE}/login.
  2. Server-side post-login redirect leaks dev prefixapp/modules/tenancy/routes/pages/store.py:86 builds /platforms/{platform_code}/store/{store_code}/dashboard on a subdomain hit (should pick the :88 branch). Same pattern as B1-B but for redirects.
  3. JS post-login redirect uses wrong heuristicapp/modules/tenancy/static/store/js/login.js:155-158 treats "platform_code is set" as "we're in path-based mode" and prepends /platforms/{code}/ always. Should check window.location.pathname.startsWith('/platforms/') instead.
  4. Sidebar URL builder uses code-bearing form on subdomain — works (Mount 2 also matches) but inconsistent with the canonical platform-debug pattern; adds visible cruft to URLs.

Why didn't tests catch this?

Ran the full middleware suite (185 tests, 29s, all green). Confirmed thorough coverage of inbound resolution (host → platform/store, request.state population). Zero coverage on outbound URL construction — no test asserts post-login Location header, sidebar URLs, or Mount 1 actually serving on subdomain. The bugs exist precisely because nothing red-flags them.

Platform-debug enhancement scoped (not implemented)

User suggested enhancing /admin/platform-debug to test redirects. My scope: add a 5th panel called "Redirect Trace" alongside Platform Trace, Domain Health, Permissions Audit, Tenant Isolation. Auto-runs the 12-row (host × URL-pattern) matrix the page already enumerates, simulates each via httpx.AsyncClient(transport=ASGITransport(app=app)), asserts the redirect Location vs the expected canonical. The same backing endpoint becomes the harness for tests/integration/test_redirect_trace.py so the 4 routing bugs would surface in red.

Status board delta

  • Step 6 (web user-journey E2E tests) — Tests 1 , 2 , 3 done. Tests 48 ahead.

Carry over for next session

  • Test 4 — Cross-store redemption at FASHIONOUTLET with the card from Tests 1-3
  • Routing pass (after Test 8 finishes so we don't churn mid-walkthrough): fix the 4 routing bugs in one focused commit, add the RedirectTrace admin tool + the corresponding integration test, update hetzner doc + user-journeys doc Case 3 to match the canonical platform-debug pattern.
  • Existing follow-ups still queued: Hetzner doc check, B1-F unit tests, prospecting tasks/__init__.py missing import, other-module email audit.

2026-05-24 update — Test 4 done + storefront auth body-schema fix

Test 4 (cross-store redemption) — verified

Card #5 has its full earning history at FASHIONHUB (store_id=4): welcome bonus 50 + three points_earned totalling 218 = 268 total earned. Today's points_redeemed -100 @ store_id=5 (FASHIONOUTLET) succeeded cleanly, producing the mixed-store transaction history the cross-location flow is supposed to deliver. Balance = 168 pts.

Storefront forgot/reset password endpoints now accept JSON body (478c3a9c)

Both POST /api/v1/storefront/auth/forgot-password and .../reset-password were declared with bare email: str / reset_token: str, new_password: str parameters. FastAPI treats unannotated str params as query parameters, so the storefront's JSON request body was ignored and the endpoint 422'd with {"loc":["query","email"],"msg":"Field required"}. The endpoint docstrings even said "Request Body: email" — intent was clear, the implementation drifted.

Added two body schemas in app/modules/tenancy/schemas/auth.py (PasswordResetRequest, PasswordResetConfirm), re-exported via __init__.py, and switched both endpoint signatures to body: <Schema>.

Surfaced when the user tried to test Test 5 (customer storefront login) and needed to set a password on the customer that self-enrolled with just email + name + birthday.

Skill created: /loyalty-wrap (d03b96da)

Mechanises the end-of-day routine that's been manual every session. Lives at .claude/skills/loyalty-wrap/SKILL.md. Triggers on phrases like "call it a night", "save memory and docs", "wrap up", etc. Skills load at session start, so the first session where the user can actually invoke it as /loyalty-wrap is the next one after the one that committed it.

Status board delta

  • Step 6 (web user-journey E2E tests) — Tests 1 , 2 , 3 , 4 done. Test 5 in progress (blocked tonight on password-reset flow; now unblocked by the 478c3a9c fix, verification pending next session).

Carry over for next session

  1. Test 5 — password-reset end-to-end (new top priority): with the 478c3a9c fix deployed, retry the forgot-password flow → confirm an email_logs row appears with template_code='password_reset', status='sent' → click the link in the email → set a password → login → continue from step 5.3 (visit /account/loyalty dashboard + history).
  2. Transaction categories — permissions audit (new item raised by user): today only admin can create transaction categories. Merchants and store owners should be able to. Investigate the existing endpoint in app/modules/loyalty/services/category_service.py + app/modules/loyalty/routes/api/admin.py, decide the right scope (merchant-level? store-level?), wire up the merchant + store UIs, add the appropriate RBAC permissions.
  3. Routing pass still queued (after Test 8): fix the 4 routing bugs
    • Redirect Trace admin tool + integration tests + doc updates.
  4. Existing follow-ups: Hetzner doc check, B1-F unit tests, prospecting tasks/__init__.py missing import, other-module email audit.

2026-05-29 update — Test 5.0 storefront i18n sweep + FR/DE email accents

Test 5.0 (forgot-password) surfaced 5 distinct issues, all fixed

The user retried Test 5 with the 478c3a9c JSON-body fix in place. The forgot-password POST succeeded, but five downstream issues showed up in one walk-through. Triaged analysis-first per user request before batching the fix, then shipped as four commits.

# Where Fix
1 API forgot-password handler Read request.state.language first; fall back to customer.preferred_language (which is now backfilled at loyalty self-enrollment for both new and returning customers so future emails respect storefront locale).
2A customers/storefront/reset-password.html icons Replaced $icon('x-mark' / 'spinner' / 'check') with inline SVGs (matches forgot-password.html convention; standalone templates don't load icons.js).
2B Same template — full i18n Added lang attribute, swapped every hardcoded string for _() (22 new auth.* keys × 4 locales), added language selector, threaded JS validation strings via window.__resetPasswordI18n.
3 login + forgot + reset CTAs Renamed auth.continue_shoppingauth.back_to_home (loyalty storefronts have no catalog). 4-locale rename.
5 /account/dashboard, /profile, /addresses body i18n sweep across all three customer-area templates (~80 new customers.storefront.pages.{dashboard,profile,addresses}.* keys × 4 locales).

Issue 4 (login looked "strangely in FR" after the broken reset page) was NOT a bug — login.html was correctly translated all along; the contrast with the broken reset page just made it look weird.

FR + DE password_reset email body restored to native orthography

The seeded FR template body had every accent stripped (Envoye, recu, demande, equipe, Reinitialiser, etc.). Same pattern in DE (zurucksetzen, Schaltflache, lauft, konnen, Grussen).

Both templates now read natively. As a bonus, signatures on all 4 locales were changed from generic ("L'équipe" / "The Team" / "Das Team" / "D'Team") to {{ store_name }}-interpolated ("L'équipe Fashion Hub" / "The Fashion Hub Team" / etc.) using the auto-injected store_name branding variable from EmailService.get_branding.

The seeder is idempotent (upserts on (code, language)), so re-running scripts/seed/seed_email_templates_core.py updates existing rows in place — no DB wipe needed.

Alpine quoting bug surfaced and fixed downstream

The customer-dashboard unread-message line crashed Alpine with "expected expression, got '}'" because the original sweep emitted {{ _('...')|tojson }} directly inside x-text="..." — the JSON's double quotes broke out of the double-quoted HTML attribute. Fix moved the singular/plural strings onto window.__accountDashboardI18n and referenced them by global path from x-text. The nested x-data="{ unreadCount: 0 }" scope can't see the parent component's i18n property, but window.* is always reachable.

The other auth templates using |tojson (language-selector blocks) escape it via single-quoted outer attributes (x-data='...'), so the collision was unique to the new dashboard code.

Seed-script path bug surfaced during the prod reseed

scripts/seed/seed_email_templates_core.py had Path(__file__).parent.parent which resolves to scripts/, not the project root, so from app.core.database import get_db failed with ModuleNotFoundError: No module named 'app'. The loyalty sibling had parent.parent.parent already (correct). Fixed to match. The canonical deploy command in docs/deployment/hetzner-server-setup.md:549 sets PYTHONPATH=/app and would have masked the bug anyway, but defence in depth is cheap.

Status board delta

  • Step 6 (web user-journey E2E tests) — Tests 1 , 2 , 3 , 4 , 5.0 (forgot-password end-to-end on FR, including email reception and accent correctness). Test 5 itself (login + dashboard + history) is the next concrete step, gated on recreating the prod api container to serve the i18n-swept HTML.

Carry over for next session

  1. Recreate the prod api container first thing tomorrow: docker compose --profile full up -d --force-recreate api. The Alpine fix (1bade6e6) is in the image built today but the long-running container is still on the old image, so the dashboard still throws the x-text error end-of-day. Verify the dashboard renders cleanly after recreate.
  2. Continue Test 5 from step 5.1 (login as customer) → 5.2 (/account/loyalty dashboard, expect 168 pts) → 5.3 (/account/loyalty/history, expect cross-store transaction list).
  3. Static asset cache-busting gaps (new item raised by user): the ?v=<commit-sha> system from the 2026-05-18 cache-busting work is in place, but some JS/CSS still load without the ?v= query param. Audit which files miss it (likely standalone templates that bypass the static_v() / url_for helpers). The FE-024 arch rule was supposed to guard this — check whether it's firing on these gaps.
  4. DE/LB email template quality sweep — other DE templates likely have the same missing-umlaut pattern as password_reset (signup_welcome, order_confirmation, team_invite, etc.; ~11 codes × 4 locales). LB has inconsistent accents too. Worth a single pass with a native-speaker review.
  5. Transaction categories permissions audit (carried from 2026-05-24).
  6. Routing pass (carried — after Test 8).
  7. Existing backlog (carried): Hetzner doc check, B1-F unit tests, prospecting tasks/__init__.py missing import, other-module email audit.

2026-05-30 update — Test 5 widget i18n + cache-bust sweep + 401 storefront redirect + critical prod-readiness findings

Test 5 — customer dashboard surfaced 2 i18n defects

After Test 5.1 (customer login) succeeded, /account/dashboard showed two issues on FR locale: the Loyalty Rewards card was hardcoded English ("Loyalty Rewards" / "View your points & rewards" / "Points Balance") and the Account Summary section had a raw customers.customer_number key.

Root cause for the card: StorefrontDashboardCard is populated by widget providers (loyalty, orders), and the widget contract had no language threading. Root cause for the raw key: the customers-module locale JSON has a redundant top-level "customers" wrapper, so the real resolvable path is customers.customers.customer_number (the same double-prefix pattern as loyalty.loyalty.wallet.apple).

Fix in 5f359283: added language field to WidgetContext, customer dashboard route passes request.state.language, loyalty and orders widget providers translate server-side via the new widget.* namespace in their locale files (4 locales each). Fixed the 8 single-prefix references to use the actual double-prefix path.

Cache-busting audit — FE-024 had two real gaps

User flagged that ?v=<commit-sha> was missing from many assets. Audit traced it to two problems in the FE-024 architecture rule:

  1. The anti-pattern only matched url_for('<module>_static', ...) mount names — missed the bare 'static' mount which is what every persona base.html uses for shared JS / CSS / Tailwind output.
  2. base.html files were in the rule's exception list — exactly the files where most shared includes live.

Fix in 3ce94683: swept 5 persona base.html files + 15 standalone templates (login, register, forgot/reset password, error pages, onboarding, invitation-accept, admin module-info/config, etc.) — 53 references for .js/.css files converted from raw url_for('static', ...) to static_v(request, 'static', ...). Then tightened the FE-024 rule to add an anti-pattern for the bare 'static' mount and dropped base.html from the exception list (kept partials/). Validator baseline unchanged at 126 warnings, 0 FE-024 hits.

401 → /account/login redirect on customer storefront

User saw the loyalty dashboard render the "Rejoignez notre programme" CTA even though they were enrolled. Diagnosis: the page route accepts the customer cookie; JS then calls /api/v1/storefront/loyalty/card which requires the Bearer token from localStorage.customer_token. The stored token was stale, server returned 401, JS swallowed it, the template's x-show="!loading && !card" branch fired with the join CTA.

Fix in a0ae6388: added redirectIfCustomerAreaUnauthorized() helper to apiClient. On a /account/* page (and not on /account/login) it sets window.location.href = '/account/login?next=<encoded-path>'. Called from all three apiClient 401 handlers (request, requestFormData, getBlob). Customer login now honours ?next= (alongside the legacy ?return=). Also fixed getToken() and clearTokens() path detection to recognise /account/* and /api/v1/storefront/* (was hardcoded to /shop/* from before the migration to /storefront). Customer JWT TTL is 30 minutes (JWT_EXPIRE_MINUTES env var, middleware/auth.py:75).

Followed up with 856db328 — removed the dead /shop/ predicates entirely. Pure dead-code cleanup, no behaviour change.

Loyalty redirect flicker — two-stage fix

User repro'd by deleting localStorage.customer_token and F5'ing /account/loyalty — saw the "Rejoignez..." CTA flash for ~half a second before the redirect landed. Stage 1 (b04b36a2): flipped loading: falseloading: true initial state in loyalty-dashboard.js and loyalty-history.js so the template's x-show="loading" spinner covers the in-flight window. NOT enough on its own — the API throw triggered the caller's .finally(() => loading = false) before the browser actually navigated, so Alpine re-rendered with the wrong state mid-redirect. Stage 2 (6564f138): in all three apiClient 401 handlers, return a never-resolving new Promise(() => {}) instead of throwing when the redirect helper returns true. Caller's await never returns, .finally never fires, spinner stays up until navigation.

Login JS i18n sweep

bbb481aa translated the "Welcome back to your shopping experience" branding subtitle on /account/login. c9fe7171 translated the three remaining hardcoded Alpine toasts in the same template: post-registration banner, post-login success toast, login-failure fallback. Two new auth.* keys × 4 locales; the third reuses the existing auth.invalid_credentials.

.build-info stale → new scripts/deploy-api-only.sh

User repeatedly redeployed and refreshed but every redirect repro still flickered. Eventually noticed in the browser console: loadCard https://.../js/loyalty-dashboard.js?v=acbe2eff:50 — the ?v= was yesterday's commit hash. Browser was serving cached pre-fix JS because the cache-bust query never bumped.

Root cause: ?v= is computed by templates_config._asset_version() from app/core/build_info.py, which reads .build-info. That file is bind-mounted from the host and is only written by scripts/deploy.sh (line 4245). The manual git pull && docker compose up --build api sequence everyone had been using never touched it, so ?v= stayed pinned at the last deploy.sh run's commit — even though every intervening rebuild was correctly putting new code into the image. Five hours of "is this even deployed?" debugging chased to root.

deploy.sh itself wasn't a substitute because it's a CI/CD script — stashes the working tree, runs alembic, restarts every service in the full profile (db, redis, api, celery-worker, celery-beat, flower), 60s health budget. Heavy and disruptive for an api-only hotfix; the narrower manual pattern is correct, it was just missing the .build-info write.

Built scripts/deploy-api-only.sh (c13e8e29) to fill the gap: refuses if working tree is dirty, git pull --ff-only, writes .build-info, docker compose -f docker-compose.yml --profile full up -d --build api (api only — db/redis/celery untouched), tight 30s health budget. Hetzner doc §16.5 split into 16.5a (code-only fix, default to the new script) and 16.5b (full deploy.sh fallback for migrations / Dockerfile / requirements changes).

🔴 Critical prod-readiness findings — SG credential in git + alertmanager misconfigured post-SMTP-migration

The new dirty-tree gate blocked the deploy because monitoring/alertmanager/alertmanager.yml has local modifications on prod. Diff inspection:

-  smtp_auth_password: ''                # TODO: Paste your SG.xxx API key here
+  smtp_auth_password: 'SG.xxxxxxxxx'    # TODO: Paste your SG.xxx API key here

Three production-readiness problems surfaced in one finding:

  1. A SendGrid API key is pasted into a tracked git file on prod, and the in-repo template literally says "Paste your SG.xxx API key here" next to the empty value — actively encouraging the anti-pattern.
  2. The alertmanager container has been Up 13 days, started before the credential was pasted (mtime 2026-05-29 01:09 UTC). So the running alertmanager process is still using the old empty smtp_auth_password from the file at container-start time. Any alert that needs to send email today silently fails — alerting has been broken for at least 13 days, probably longer.
  3. The SMTP migration earlier this year never touched alertmanager.yml. That migration only updated the app's notification settings in the email_settings DB table; alertmanager reads its own config from disk and was never updated. So even with a properly-loaded credential, the config still points at SendGrid instead of mail1.myservices.hosting.

User decided to defer today's loyalty deploy and tackle the alertmanager work as the first thing tomorrow — production-readiness gate ranks over incremental Test 5 progress, and fixing the root cause (credential out of git + correct SMTP smarthost + alertmanager reload) means the deploy will run clean without --skip-worktree gymnastics.

Status board delta

  • Step 6 (web user-journey E2E tests) — Tests 1 , 2 , 3 , 4 , 5.0 , 5.1 in progress (login + dashboard work, blocked on prod deploy of today's fixes which are queued on gitea/master but not yet served because of the unrelated alertmanager dirty-tree blocker).
  • New step surfaced — alerting infrastructure is silently broken in production (13+ days). Should be tracked as a go-live blocker; prod is currently flying blind on alerting.

Carry over for next session

User explicitly chose tomorrow's order: prod-readiness items 1+2 BEFORE continuing Test 5.

  1. Trace the SG credential paste origin — user claims sole-developer status but doesn't remember pasting. Grep shell history, check file ownership, find when the credential was introduced. Understand the path so it doesn't happen again.
  2. Update alertmanager.yml for the SendGrid → SMTP migration that never landed: smtp_smarthost: 'mail1.myservices.hosting:587', smtp_auth_username: 'support@wizard.lu', the SMTP password from /admin/settings. Then SIGHUP alertmanager to hot-reload (docker compose -f docker-compose.yml --profile full kill -s SIGHUP alertmanager). Verify with a synthetic alert that email delivery actually works.
  3. Move credential out of gitgit rm --cached monitoring/alertmanager/alertmanager.yml, add to .gitignore, ship monitoring/alertmanager/alertmanager.yml.example as the template (with empty placeholder + comment pointing at the deploy doc for the real values). Closes the recurrence path.
  4. Deploy today's queued loyalty fixes — with alertmanager.yml gitignored, the working tree on prod is clean and bash scripts/deploy-api-only.sh should run without the --skip-worktree dance. Then verify ?v=c13e8e29 (or later) on rendered assets.
  5. Re-run the loyalty redirect repro to confirm the flicker is gone now that today's JS actually reaches the browser.
  6. Continue Test 5 from 5.1 → 5.2 (/account/loyalty, 168 pts) → 5.3 (/account/loyalty/history).
  7. Standing backlog (lower priority): DE/LB email template quality sweep, transaction categories permissions audit, routing pass, Hetzner doc check, B1-F unit tests, prospecting/tasks/__init__.py, other-module email audit.

2026-05-30 update (afternoon) — production-readiness items 1-3 resolved + alerting back online after 13+ days

Picked up the carry-over list from this morning's wrap and ran it end-to-end. All four blockers are now closed and the loyalty queue landed on prod for the first time today.

SG-credential forensics + alertmanager.yml untracked

User-driven nano edits in past sessions (bash history lines 290-357 confirmed it). No rogue actor, just forgotten exploratory work that was never committed. Resolution shipped as e44f5c04: git rm --cached monitoring/alertmanager/alertmanager.yml, .gitignore entry added, and monitoring/alertmanager/alertmanager.yml.example ships in repo with the post-migration routing pre-filled (mail1.myservices.hosting:465, support@wizard.lu auth, alerts@wizard.lu From, only smtp_auth_password: 'CHANGEME' left for prod-side fill-in).

Per-host migration on prod + loyalty queue deployed

On the Hetzner box: backup → git checkout → pull e44f5c04 → copy template over the old file → fill in the SMTP password → SIGHUP alertmanager. Then bash scripts/deploy-api-only.sh for the first time ran cleanly (working tree no longer dirty), pulling the 9 queued loyalty commits from this morning. Verified ?v=e44f5c04 on all rendered assets, ran the loyalty redirect repro (localStorage delete + F5 on /account/loyalty): spinner straight through to login redirect, no "Rejoignez..." CTA flash. Stage 1 + Stage 2 of the flicker fix work as designed once the browser actually sees the new JS.

The alertmanager email delivery rabbit hole — and the answer

After the SMTP migration, alertmanager still couldn't send. The error log was identical to before the migration: *smtp.plainAuth auth: 535 Authentication failed: The provided authorization grant is invalid, expired, or revoked — verbatim OAuth 2.0 RFC 6749 §5.2 invalid_grant text. Multi-hour diagnosis chased through every plausible layer:

  1. Provider's port 587 AUTH PLAIN backend is OAuth-wired (returns the OAuth-flavored 535 with a regular password). AUTH LOGIN on the same port accepts the same credential cleanly. swaks proved this.
  2. alertmanager uses Go stdlib smtp.PlainAuth, which prefers PLAIN whenever the server advertises it. No config knob to force LOGIN. smtp_auth_identity tweak had no effect.
  3. Provider's docs name port 465 SSL/TLS as the official submission endpoint — not 587. Switched to 465 → connection timed out from prod AND from user's home laptop.
  4. VM-side sweep (UFW outbound = allow, iptables OUTPUT = ACCEPT, nftables empty, DOCKER-USER empty) cleared the local firewall as a cause. Block had to be upstream.
  5. Found Hetzner's documented anti-spam policy: "Outgoing traffic to ports 25 and 465 are blocked by default on all Cloud Servers. Send us a request to unblock these ports." The block is at the cloud network layer, completely invisible from the VM.
  6. Filed Hetzner unblock ticket via Cloud Console. Auto-approved within minutes — Hetzner has tooling for legitimate SMTP unblock requests.
  7. Post-unblock: nc -4 -zv mail1.myservices.hosting 465 succeeds. swaks AUTH PLAIN on 465 succeeds (235 Authentication successful + 250 2.0.0 Ok: queued). One-line alertmanager change from :587 to :465, SIGHUP, watched tcpdump confirm implicit-TLS handshake on port 465. Three pending alerts (TargetDown, HostHighCpuUsage, HostHighDiskUsage) delivered to inbox within minutes. Alerting back online for the first time in 13+ days.

Key finding worth documenting: alertmanager's email integration via Go's stdlib net/smtp DOES handle implicit TLS on port 465 natively. No smtp_tls_config block needed, no stunnel sidecar. Just set the smarthost port to 465 + smtp_require_tls: true and reload. tcpdump confirms the TLS-on-connect handshake completes correctly.

SMTP password rotation (mid-flow)

User rotated the SMTP password mid-debugging because it leaked into chat (swaks base64 line was redactable but not redacted on the initial paste). New value propagated to /admin/settings SMTP block AND alertmanager.yml. swaks verified with --auth LOGIN on 587 (the path the app uses) — 235 Authentication successful followed by 250 Ok: queued. Test email landed.

Hetzner doc 5h debug payback (1227567d)

Updated docs/deployment/hetzner-server-setup.md:

  • Step 4 (Firewall Configuration) gets a warning admonition right after the UFW status check, explaining that Hetzner Cloud blocks outbound 25 and 465 at the network layer (invisible from the VM), with the symptom signature and the auto-approved unblock ticket template ready to paste.
  • Step 19.5 (Alertmanager SMTP Setup) gets a "live prod uses mail1.myservices.hosting:465, not SendGrid" callout reflecting the reality that the SendGrid configuration documented in §19.5 is no longer how this prod env is wired. Includes the live alertmanager SMTP block (with smtp_auth_password kept gitignored, only .example ships in repo), the two prerequisites (Hetzner 465 unblock + implicit-TLS-aware smarthost port), and the redacted swaks verification command.

Saves the next person from repeating the same 5-hour detour.

Status board delta

  • Step 6 (web user-journey E2E tests) — Tests 1 , 2 , 3 , 4 , 5.0 , 5.1 . Test 5.2/5.3 are the next concrete browser steps (loyalty dashboard + history with the 168-pt customer).
  • Step 19 (alerting infrastructure) — email delivery now works, remove the previously-flagged "silently broken in production" item.
  • Step 6 implicit blockers — all cleared: prod is serving today's i18n + redirect + flicker fixes, alertmanager email flows, no outstanding deploy blockers.

Carry over for next session

  1. Test 5.2 → login as samir.boulahtit+17mayf@gmail.com, visit /account/loyalty, confirm 168 pts balance + cross-store rewards render correctly.
  2. Test 5.3/account/loyalty/history, confirm the 5 transactions (50+143+43+32 earned at FASHIONHUB, 100 redeemed at FASHIONOUTLET).
  3. Standing backlog as before — DE/LB email template quality sweep, transaction categories permissions audit, routing pass, B1-F unit tests, prospecting/tasks/__init__.py, other-module email audit.

Status board

# Pre-launch step State Notes
1 Seed loyalty email templates on prod 20 rows (5 templates × 4 locales) all is_active=true
2 Google Wallet config on Hetzner Wallet config validator green: credentials valid, issuer 3388000000023089598, origin https://rewardflow.lu, default logo reachable
3 Database migrations All four module heads current incl. loyalty_011 (acting-device audit) on prod
4 FR/DE/LB translations for analytics i18n keys 🟡 8 keys still EN-only. Cosmetic, doesn't block soft launch
5 messaging.manage_templates permission for store owners 🟡 Only matters if merchants self-edit templates. Admin can edit centrally. Defer
6 8 web user-journey E2E tests The remaining gate — user does this with a real test customer
6b 6 Android terminal E2E tests Pairing, PIN, daily flows, offline queue, auto-lock, device revoke — gated on user obtaining a tablet
7 Google Wallet real-device pass test Already confirmed earlier — cards register, points/redeem visible on personal Google Wallet
8 Go live Gated by #6. Cleanup test data + enable platform feature flags for FASHIONHUB
9 Google Wallet production access Post-launch, 13 day Google review. App-side change is zero; same issuer + service account, passes become public-visible once approved

What got sorted tonight

SMTP wired to a self-hosted mail server

Started here:

  • prod .env had EMAIL_PROVIDER=sendgrid + a SendGrid API key
  • SendGrid free trial (60 days) had expired
  • SMTP_* env vars were placeholders pointing at smtp.example.com

Discovered that /admin/settings lets you store SMTP config in the DB (table admin_settings, category email) and those values win over .env. User had already configured:

  • email_provider=smtp
  • smtp_host=mail1.myservices.hosting
  • smtp_port=465 ← problematic
  • smtp_user=support@wizard.lu / encrypted password
  • smtp_use_ssl=true, smtp_use_tls=false

Diagnosis from the prod container:

Check Result
DNS resolves mail1.myservices.hosting 185.26.107.245
TCP mail1.myservices.hosting:465 timed out
TCP mail1.myservices.hosting:587 open

Either Hetzner blocks 465 outbound for this VPS or the provider firewalls Hetzner's IP range on 465. Either way, port 587 (submission

  • STARTTLS) is the modern path and works.

Fix: changed /admin/settings to port 587, SSL off, TLS on. Test email landed in inbox immediately, sender header Support Wizard <support@wizard.lu> — proving the DB override was being used.

Cosmetic bug found and fixed

The test email's body claimed the configuration that would have been used if .env were authoritative — i.e. it said Provider: sendgrid and From: noreply@wizard.lu even though the actual send went via SMTP from support@wizard.lu. Two places in the code:

  1. app/modules/core/routes/api/admin_settings.py::send_test_email — body template hardcoded app_settings.email_provider and app_settings.email_from_address
  2. app/modules/messaging/services/email_service.py — the "template not found" EmailLog branch recorded settings.email_provider / settings.email_from_address instead of the effective config

Both now read from get_effective_email_config(db) / self._platform_config, so the test email page and audit logs reflect what was actually used.

Commit: f2d1bdcd on master, deployed via Gitea Actions.

What the user does next

In priority order:

  1. Tonight or tomorrow — review email copy. Open /admin/email-templates and skim the 5 loyalty templates (EN locale). loyalty_enrollment, loyalty_welcome_bonus and loyalty_reward_ready are the customer-visible ones — adjust subject lines + body copy if anything reads off-brand.
  2. Walk the 8 web user-journey E2E tests — checklist at the bottom of app/modules/loyalty/docs/user-journeys.md. Use a personal email as the test customer. 2b. Once a tablet is on hand: walk the 6 Android terminal tests — same doc, "Android Terminal Tests" section (Tests 914). Covers pairing (QR + manual), offline PIN bcrypt verify, daily flows (stamp/earn/redeem/enroll), offline queue drain, idle auto-lock, and device revocation cutoff.
  3. Flip live for FASHIONHUB — clean any test data, double-check Celery (docker compose ps | grep celery), enable loyalty feature on FASHIONHUB's stores via the admin UI.
  4. In parallel, file Google Wallet production accesspay.google.com/business/console → Wallet API → Manage → Request production access. Use sample pass screenshots from FASHIONHUB. Google reviews the Issuer, not individual merchants — once approved all merchants on the platform are covered.

Open follow-ups (non-blocking)

These can wait but are worth tracking:

  • FR/DE/LB translations for the 8 analytics i18n keys (store.analytics.revenue_title, store.analytics.cohort_title, etc.). EN shows through; cosmetic only.
  • messaging.manage_templates permission discovery for merchant_owner role — needed if/when merchants self-edit templates. Admin can edit centrally for v1.
  • Failed-PIN-attempt reporting from Android tablet → server lockout counter — tablet bcrypts locally and silently fails; a stolen tablet's brute-forcer doesn't trip server-side lockout. Add a tiny POST /pins/{id}/record-failed-attempt endpoint plus a call from the PinViewModel's failure branch.
  • Splash screen + per-action success animation for the Android tablet — Phase F polish that was intentionally deferred.

Reference