docs(loyalty): record 2026-05-30 afternoon — prod-readiness 1-3 done + alerting back online

Picked up the morning's carry-over and ran the full prod-readiness chain end-to-end. Resolution: SG credential out of git permanently via untrack + .example template (e44f5c04); per-host migration on prod (alertmanager.yml gitignored, real file lives outside git); deploy-api-only.sh succeeded for the first time; today's 9 queued loyalty commits live on prod with ?v=e44f5c04 (and verified by re-running the loyalty redirect flicker repro — clean). Multi-hour rabbit hole on actual email delivery: provider's port 587 PLAIN backend is OAuth-wired (returns RFC 6749 invalid_grant text for password auth); switched to provider's documented port 465 SSL/TLS endpoint. Discovered Hetzner Cloud blocks outbound 25 and 465 by default as anti-spam policy. Auto-approved unblock ticket landed in minutes; one-line smarthost change to :465 reactivated email alerting after 13+ days down. alertmanager handles implicit TLS on 465 natively, no stunnel/relay needed. Hetzner doc updated with the egress-block warning + mail1 SMTP callout in 1227567d as 5h-debug payback. Next session resumes at Test 5.2 (/account/loyalty with 168 pts customer) → 5.3 history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 20:20:00 +02:00
parent 1227567d08
commit 947ca43c7b
1 changed files with 126 additions and 0 deletions
--- a/docs/proposals/loyalty-go-live-readiness.md
+++ b/docs/proposals/loyalty-go-live-readiness.md
@@ -633,6 +633,132 @@ continuing Test 5.
   Hetzner doc check, B1-F unit tests, `prospecting/tasks/__init__.py`,
   other-module email audit.

+## 2026-05-30 update (afternoon) — production-readiness items 1-3 resolved + alerting back online after 13+ days
+
+Picked up the carry-over list from this morning's wrap and ran it
+end-to-end. All four blockers are now closed and the loyalty queue
+landed on prod for the first time today.
+
+### SG-credential forensics + alertmanager.yml untracked
+
+User-driven nano edits in past sessions (bash history lines 290-357
+confirmed it). No rogue actor, just forgotten exploratory work that
+was never committed. Resolution shipped as `e44f5c04`:
+`git rm --cached monitoring/alertmanager/alertmanager.yml`,
+`.gitignore` entry added, and `monitoring/alertmanager/alertmanager.yml.example`
+ships in repo with the post-migration routing pre-filled
+(`mail1.myservices.hosting:465`, `support@wizard.lu` auth,
+`alerts@wizard.lu` From, only `smtp_auth_password: 'CHANGEME'` left
+for prod-side fill-in).
+
+### Per-host migration on prod + loyalty queue deployed
+
+On the Hetzner box: backup → `git checkout` → pull `e44f5c04` →
+copy template over the old file → fill in the SMTP password →
+SIGHUP alertmanager. Then `bash scripts/deploy-api-only.sh` for the
+first time ran cleanly (working tree no longer dirty), pulling the 9
+queued loyalty commits from this morning. Verified `?v=e44f5c04` on
+all rendered assets, ran the loyalty redirect repro (`localStorage`
+delete + F5 on `/account/loyalty`): spinner straight through to login
+redirect, no "Rejoignez..." CTA flash. Stage 1 + Stage 2 of the
+flicker fix work as designed once the browser actually sees the new
+JS.
+
+### The alertmanager email delivery rabbit hole — and the answer
+
+After the SMTP migration, alertmanager still couldn't send. The error
+log was identical to before the migration: `*smtp.plainAuth auth: 535
+Authentication failed: The provided authorization grant is invalid,
+expired, or revoked` — verbatim OAuth 2.0 RFC 6749 §5.2
+`invalid_grant` text. Multi-hour diagnosis chased through every
+plausible layer:
+
+1. Provider's port 587 `AUTH PLAIN` backend is OAuth-wired (returns
+   the OAuth-flavored 535 with a regular password). `AUTH LOGIN` on
+   the same port accepts the same credential cleanly. swaks proved
+   this.
+2. alertmanager uses Go stdlib `smtp.PlainAuth`, which prefers PLAIN
+   whenever the server advertises it. No config knob to force LOGIN.
+   `smtp_auth_identity` tweak had no effect.
+3. Provider's docs name port **465 SSL/TLS** as the official
+   submission endpoint — not 587. Switched to 465 → connection timed
+   out from prod AND from user's home laptop.
+4. VM-side sweep (UFW outbound = allow, iptables OUTPUT = ACCEPT,
+   nftables empty, DOCKER-USER empty) cleared the local firewall as a
+   cause. Block had to be upstream.
+5. Found Hetzner's documented anti-spam policy: *"Outgoing traffic to
+   ports 25 and 465 are blocked by default on all Cloud Servers. Send
+   us a request to unblock these ports."* The block is at the cloud
+   network layer, completely invisible from the VM.
+6. Filed Hetzner unblock ticket via Cloud Console. **Auto-approved
+   within minutes** — Hetzner has tooling for legitimate SMTP unblock
+   requests.
+7. Post-unblock: `nc -4 -zv mail1.myservices.hosting 465` succeeds.
+   swaks AUTH PLAIN on 465 succeeds (`235 Authentication
+   successful` + `250 2.0.0 Ok: queued`). One-line alertmanager change
+   from `:587` to `:465`, SIGHUP, watched tcpdump confirm implicit-TLS
+   handshake on port 465. Three pending alerts (TargetDown,
+   HostHighCpuUsage, HostHighDiskUsage) delivered to inbox within
+   minutes. **Alerting back online for the first time in 13+ days.**
+
+Key finding worth documenting: alertmanager's email integration via
+Go's stdlib `net/smtp` DOES handle implicit TLS on port 465 natively.
+No `smtp_tls_config` block needed, no stunnel sidecar. Just set the
+smarthost port to `465` + `smtp_require_tls: true` and reload. tcpdump
+confirms the TLS-on-connect handshake completes correctly.
+
+### SMTP password rotation (mid-flow)
+
+User rotated the SMTP password mid-debugging because it leaked into
+chat (swaks base64 line was redactable but not redacted on the
+initial paste). New value propagated to `/admin/settings` SMTP block
+AND `alertmanager.yml`. swaks verified with `--auth LOGIN` on 587
+(the path the app uses) — `235 Authentication successful` followed by
+`250 Ok: queued`. Test email landed.
+
+### Hetzner doc 5h debug payback (`1227567d`)
+
+Updated `docs/deployment/hetzner-server-setup.md`:
+
+- **Step 4 (Firewall Configuration)** gets a warning admonition right
+  after the UFW status check, explaining that Hetzner Cloud blocks
+  outbound 25 and 465 at the network layer (invisible from the VM),
+  with the symptom signature and the auto-approved unblock ticket
+  template ready to paste.
+- **Step 19.5 (Alertmanager SMTP Setup)** gets a "live prod uses
+  mail1.myservices.hosting:465, not SendGrid" callout reflecting the
+  reality that the SendGrid configuration documented in §19.5 is no
+  longer how this prod env is wired. Includes the live alertmanager
+  SMTP block (with `smtp_auth_password` kept gitignored, only
+  `.example` ships in repo), the two prerequisites (Hetzner 465
+  unblock + implicit-TLS-aware smarthost port), and the redacted
+  swaks verification command.
+
+Saves the next person from repeating the same 5-hour detour.
+
+### Status board delta
+
+- Step 6 (web user-journey E2E tests) — Tests 1 ✅, 2 ✅, 3 ✅, 4 ✅,
+  5.0 ✅, 5.1 ✅. Test 5.2/5.3 are the next concrete browser steps
+  (loyalty dashboard + history with the 168-pt customer).
+- Step 19 (alerting infrastructure) — **email delivery now works**,
+  remove the previously-flagged "silently broken in production" item.
+- Step 6 implicit blockers — all cleared: prod is serving today's
+  i18n + redirect + flicker fixes, alertmanager email flows, no
+  outstanding deploy blockers.
+
+### Carry over for next session
+
+1. **Test 5.2** → login as `samir.boulahtit+17mayf@gmail.com`, visit
+   `/account/loyalty`, confirm 168 pts balance + cross-store rewards
+   render correctly.
+2. **Test 5.3** → `/account/loyalty/history`, confirm the 5
+   transactions (50+143+43+32 earned at FASHIONHUB, −100 redeemed at
+   FASHIONOUTLET).
+3. Standing backlog as before — DE/LB email template quality sweep,
+   transaction categories permissions audit, routing pass, B1-F unit
+   tests, `prospecting/tasks/__init__.py`, other-module email audit.
+
 ## Status board

 | # | Pre-launch step | State | Notes |