docs(loyalty): record 2026-05-30 widget i18n + cache-bust + 401 redirect + alertmanager finding

Nine code commits shipped today (5f359283 → c13e8e29) covering Test 5 widget/customer-module i18n, a 53-template cache-bust sweep with FE-024 rule tightening, the customer-storefront 401-to-/account/login redirect, the loyalty redirect-flicker fix, the login JS i18n sweep, and a new scripts/deploy-api-only.sh script + Hetzner §16.5 split. None of them are on prod yet — surfaced during the deploy that the new dirty-tree gate is correctly blocking on monitoring/alertmanager/ alertmanager.yml, which holds a SendGrid API key pasted into a tracked file. Knock-on finding: alertmanager has been running with stale empty SMTP config for 13+ days, AND the file still references SendGrid instead of the post-migration smarthost, so prod's alerting is silently broken. User opted to fix prod-readiness items first thing tomorrow before resuming Test 5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 23:20:07 +02:00
parent c13e8e29b5
commit cff0b3f911
1 changed files with 204 additions and 0 deletions
--- a/docs/proposals/loyalty-go-live-readiness.md
+++ b/docs/proposals/loyalty-go-live-readiness.md
@@ -429,6 +429,210 @@ depth is cheap.
   prospecting `tasks/__init__.py` missing import, other-module email
   audit.

+## 2026-05-30 update — Test 5 widget i18n + cache-bust sweep + 401 storefront redirect + critical prod-readiness findings
+
+### Test 5 — customer dashboard surfaced 2 i18n defects
+
+After Test 5.1 (customer login) succeeded, `/account/dashboard` showed
+two issues on FR locale: the Loyalty Rewards card was hardcoded English
+("Loyalty Rewards" / "View your points & rewards" / "Points Balance")
+and the Account Summary section had a raw `customers.customer_number`
+key.
+
+Root cause for the card: `StorefrontDashboardCard` is populated by
+widget providers (loyalty, orders), and the widget contract had no
+language threading. Root cause for the raw key: the customers-module
+locale JSON has a redundant top-level `"customers"` wrapper, so the
+real resolvable path is `customers.customers.customer_number` (the
+same double-prefix pattern as `loyalty.loyalty.wallet.apple`).
+
+Fix in `5f359283`: added `language` field to `WidgetContext`, customer
+dashboard route passes `request.state.language`, loyalty and orders
+widget providers translate server-side via the new `widget.*` namespace
+in their locale files (4 locales each). Fixed the 8 single-prefix
+references to use the actual double-prefix path.
+
+### Cache-busting audit — FE-024 had two real gaps
+
+User flagged that `?v=<commit-sha>` was missing from many assets. Audit
+traced it to two problems in the FE-024 architecture rule:
+
+1. The anti-pattern only matched `url_for('<module>_static', ...)` mount
+   names — missed the bare `'static'` mount which is what every persona
+   `base.html` uses for shared JS / CSS / Tailwind output.
+2. `base.html` files were in the rule's exception list — exactly the
+   files where most shared includes live.
+
+Fix in `3ce94683`: swept 5 persona `base.html` files + 15 standalone
+templates (login, register, forgot/reset password, error pages,
+onboarding, invitation-accept, admin module-info/config, etc.) — 53
+references for `.js`/`.css` files converted from raw `url_for('static',
+...)` to `static_v(request, 'static', ...)`. Then tightened the FE-024
+rule to add an anti-pattern for the bare `'static'` mount and dropped
+`base.html` from the exception list (kept `partials/`). Validator
+baseline unchanged at 126 warnings, 0 FE-024 hits.
+
+### 401 → /account/login redirect on customer storefront
+
+User saw the loyalty dashboard render the "Rejoignez notre programme"
+CTA even though they were enrolled. Diagnosis: the page route accepts
+the customer cookie; JS then calls `/api/v1/storefront/loyalty/card`
+which requires the Bearer token from `localStorage.customer_token`. The
+stored token was stale, server returned 401, JS swallowed it, the
+template's `x-show="!loading && !card"` branch fired with the join
+CTA.
+
+Fix in `a0ae6388`: added `redirectIfCustomerAreaUnauthorized()` helper
+to apiClient. On a `/account/*` page (and not on `/account/login`) it
+sets `window.location.href = '/account/login?next=<encoded-path>'`.
+Called from all three apiClient 401 handlers (request, requestFormData,
+getBlob). Customer login now honours `?next=` (alongside the legacy
+`?return=`). Also fixed `getToken()` and `clearTokens()` path detection
+to recognise `/account/*` and `/api/v1/storefront/*` (was hardcoded to
+`/shop/*` from before the migration to `/storefront`). Customer JWT
+TTL is 30 minutes (`JWT_EXPIRE_MINUTES` env var,
+`middleware/auth.py:75`).
+
+Followed up with `856db328` — removed the dead `/shop/` predicates
+entirely. Pure dead-code cleanup, no behaviour change.
+
+### Loyalty redirect flicker — two-stage fix
+
+User repro'd by deleting `localStorage.customer_token` and F5'ing
+`/account/loyalty` — saw the "Rejoignez..." CTA flash for ~half a
+second before the redirect landed. Stage 1 (`b04b36a2`): flipped
+`loading: false` → `loading: true` initial state in `loyalty-dashboard.js`
+and `loyalty-history.js` so the template's `x-show="loading"` spinner
+covers the in-flight window. NOT enough on its own — the API throw
+triggered the caller's `.finally(() => loading = false)` *before* the
+browser actually navigated, so Alpine re-rendered with the wrong
+state mid-redirect. Stage 2 (`6564f138`): in all three apiClient 401
+handlers, return a never-resolving `new Promise(() => {})` instead of
+throwing when the redirect helper returns true. Caller's `await` never
+returns, `.finally` never fires, spinner stays up until navigation.
+
+### Login JS i18n sweep
+
+`bbb481aa` translated the "Welcome back to your shopping experience"
+branding subtitle on `/account/login`. `c9fe7171` translated the three
+remaining hardcoded Alpine toasts in the same template:
+post-registration banner, post-login success toast, login-failure
+fallback. Two new `auth.*` keys × 4 locales; the third reuses the
+existing `auth.invalid_credentials`.
+
+### `.build-info` stale → new `scripts/deploy-api-only.sh`
+
+User repeatedly redeployed and refreshed but every redirect repro still
+flickered. Eventually noticed in the browser console:
+`loadCard https://.../js/loyalty-dashboard.js?v=acbe2eff:50` — the
+`?v=` was yesterday's commit hash. Browser was serving cached pre-fix
+JS because the cache-bust query never bumped.
+
+Root cause: `?v=` is computed by `templates_config._asset_version()`
+from `app/core/build_info.py`, which reads `.build-info`. That file is
+bind-mounted from the host and is only written by `scripts/deploy.sh`
+(line 42–45). The manual `git pull && docker compose up --build api`
+sequence everyone had been using never touched it, so `?v=` stayed
+pinned at the last `deploy.sh` run's commit — even though every
+intervening rebuild was correctly putting new code into the image.
+Five hours of "is this even deployed?" debugging chased to root.
+
+`deploy.sh` itself wasn't a substitute because it's a CI/CD script —
+stashes the working tree, runs alembic, restarts every service in the
+`full` profile (db, redis, api, celery-worker, celery-beat, flower),
+60s health budget. Heavy and disruptive for an api-only hotfix; the
+narrower manual pattern is correct, it was just missing the
+`.build-info` write.
+
+Built `scripts/deploy-api-only.sh` (`c13e8e29`) to fill the gap:
+refuses if working tree is dirty, `git pull --ff-only`, writes
+`.build-info`, `docker compose -f docker-compose.yml --profile full
+up -d --build api` (api only — db/redis/celery untouched), tight 30s
+health budget. Hetzner doc §16.5 split into 16.5a (code-only fix,
+default to the new script) and 16.5b (full `deploy.sh` fallback for
+migrations / Dockerfile / requirements changes).
+
+### 🔴 Critical prod-readiness findings — SG credential in git + alertmanager misconfigured post-SMTP-migration
+
+The new dirty-tree gate blocked the deploy because
+`monitoring/alertmanager/alertmanager.yml` has local modifications on
+prod. Diff inspection:
+
+```diff
+-  smtp_auth_password: ''                # TODO: Paste your SG.xxx API key here
+  smtp_auth_password: 'SG.xxxxxxxxx'    # TODO: Paste your SG.xxx API key here
+```
+
+Three production-readiness problems surfaced in one finding:
+
+1. **A SendGrid API key is pasted into a tracked git file on prod**, and
+   the in-repo template literally says "Paste your SG.xxx API key here"
+   next to the empty value — actively encouraging the anti-pattern.
+2. **The `alertmanager` container has been Up 13 days**, started
+   *before* the credential was pasted (mtime 2026-05-29 01:09 UTC).
+   So the running alertmanager process is still using the old empty
+   `smtp_auth_password` from the file at container-start time. Any
+   alert that needs to send email today silently fails — alerting has
+   been broken for at least 13 days, probably longer.
+3. **The SMTP migration earlier this year never touched
+   `alertmanager.yml`.** That migration only updated the app's
+   notification settings in the `email_settings` DB table; alertmanager
+   reads its own config from disk and was never updated. So even with
+   a properly-loaded credential, the config still points at SendGrid
+   instead of `mail1.myservices.hosting`.
+
+User decided to defer today's loyalty deploy and tackle the
+alertmanager work as the first thing tomorrow — production-readiness
+gate ranks over incremental Test 5 progress, and fixing the root
+cause (credential out of git + correct SMTP smarthost + alertmanager
+reload) means the deploy will run clean without `--skip-worktree`
+gymnastics.
+
+### Status board delta
+
+- Step 6 (web user-journey E2E tests) — Tests 1 ✅, 2 ✅, 3 ✅, 4 ✅,
+  5.0 ✅, **5.1 in progress** (login + dashboard work, blocked on
+  prod deploy of today's fixes which are queued on `gitea/master` but
+  not yet served because of the unrelated alertmanager dirty-tree
+  blocker).
+- New step surfaced — **alerting infrastructure is silently broken
+  in production** (13+ days). Should be tracked as a go-live blocker;
+  prod is currently flying blind on alerting.
+
+### Carry over for next session
+
+User explicitly chose tomorrow's order: prod-readiness items 1+2 BEFORE
+continuing Test 5.
+
+1. **Trace the SG credential paste origin** — user claims sole-developer
+   status but doesn't remember pasting. Grep shell history, check file
+   ownership, find when the credential was introduced. Understand the
+   path so it doesn't happen again.
+2. **Update `alertmanager.yml`** for the SendGrid → SMTP migration that
+   never landed: `smtp_smarthost: 'mail1.myservices.hosting:587'`,
+   `smtp_auth_username: 'support@wizard.lu'`, the SMTP password from
+   `/admin/settings`. Then SIGHUP alertmanager to hot-reload
+   (`docker compose -f docker-compose.yml --profile full kill -s SIGHUP
+   alertmanager`). Verify with a synthetic alert that email delivery
+   actually works.
+3. **Move credential out of git** — `git rm --cached
+   monitoring/alertmanager/alertmanager.yml`, add to `.gitignore`,
+   ship `monitoring/alertmanager/alertmanager.yml.example` as the
+   template (with empty placeholder + comment pointing at the deploy
+   doc for the real values). Closes the recurrence path.
+4. **Deploy today's queued loyalty fixes** — with `alertmanager.yml`
+   gitignored, the working tree on prod is clean and `bash
+   scripts/deploy-api-only.sh` should run without the `--skip-worktree`
+   dance. Then verify `?v=c13e8e29` (or later) on rendered assets.
+5. **Re-run the loyalty redirect repro** to confirm the flicker is
+   gone now that today's JS actually reaches the browser.
+6. **Continue Test 5** from 5.1 → 5.2 (/account/loyalty, 168 pts) →
+   5.3 (/account/loyalty/history).
+7. **Standing backlog** (lower priority): DE/LB email template quality
+   sweep, transaction categories permissions audit, routing pass,
+   Hetzner doc check, B1-F unit tests, `prospecting/tasks/__init__.py`,
+   other-module email audit.
+
 ## Status board

 | # | Pre-launch step | State | Notes |