From 947ca43c7bd8e595625e275bfbbcc198c6c2c3fc Mon Sep 17 00:00:00 2001 From: Samir Boulahtit Date: Sat, 30 May 2026 20:20:00 +0200 Subject: [PATCH] =?UTF-8?q?docs(loyalty):=20record=202026-05-30=20afternoo?= =?UTF-8?q?n=20=E2=80=94=20prod-readiness=201-3=20done=20+=20alerting=20ba?= =?UTF-8?q?ck=20online?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Picked up the morning's carry-over and ran the full prod-readiness chain end-to-end. Resolution: SG credential out of git permanently via untrack + .example template (e44f5c04); per-host migration on prod (alertmanager.yml gitignored, real file lives outside git); deploy-api-only.sh succeeded for the first time; today's 9 queued loyalty commits live on prod with ?v=e44f5c04 (and verified by re-running the loyalty redirect flicker repro — clean). Multi-hour rabbit hole on actual email delivery: provider's port 587 PLAIN backend is OAuth-wired (returns RFC 6749 invalid_grant text for password auth); switched to provider's documented port 465 SSL/TLS endpoint. Discovered Hetzner Cloud blocks outbound 25 and 465 by default as anti-spam policy. Auto-approved unblock ticket landed in minutes; one-line smarthost change to :465 reactivated email alerting after 13+ days down. alertmanager handles implicit TLS on 465 natively, no stunnel/relay needed. Hetzner doc updated with the egress-block warning + mail1 SMTP callout in 1227567d as 5h-debug payback. Next session resumes at Test 5.2 (/account/loyalty with 168 pts customer) → 5.3 history. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/proposals/loyalty-go-live-readiness.md | 126 ++++++++++++++++++++ 1 file changed, 126 insertions(+) diff --git a/docs/proposals/loyalty-go-live-readiness.md b/docs/proposals/loyalty-go-live-readiness.md index b2ce4614..19099ffb 100644 --- a/docs/proposals/loyalty-go-live-readiness.md +++ b/docs/proposals/loyalty-go-live-readiness.md @@ -633,6 +633,132 @@ continuing Test 5. Hetzner doc check, B1-F unit tests, `prospecting/tasks/__init__.py`, other-module email audit. +## 2026-05-30 update (afternoon) — production-readiness items 1-3 resolved + alerting back online after 13+ days + +Picked up the carry-over list from this morning's wrap and ran it +end-to-end. All four blockers are now closed and the loyalty queue +landed on prod for the first time today. + +### SG-credential forensics + alertmanager.yml untracked + +User-driven nano edits in past sessions (bash history lines 290-357 +confirmed it). No rogue actor, just forgotten exploratory work that +was never committed. Resolution shipped as `e44f5c04`: +`git rm --cached monitoring/alertmanager/alertmanager.yml`, +`.gitignore` entry added, and `monitoring/alertmanager/alertmanager.yml.example` +ships in repo with the post-migration routing pre-filled +(`mail1.myservices.hosting:465`, `support@wizard.lu` auth, +`alerts@wizard.lu` From, only `smtp_auth_password: 'CHANGEME'` left +for prod-side fill-in). + +### Per-host migration on prod + loyalty queue deployed + +On the Hetzner box: backup → `git checkout` → pull `e44f5c04` → +copy template over the old file → fill in the SMTP password → +SIGHUP alertmanager. Then `bash scripts/deploy-api-only.sh` for the +first time ran cleanly (working tree no longer dirty), pulling the 9 +queued loyalty commits from this morning. Verified `?v=e44f5c04` on +all rendered assets, ran the loyalty redirect repro (`localStorage` +delete + F5 on `/account/loyalty`): spinner straight through to login +redirect, no "Rejoignez..." CTA flash. Stage 1 + Stage 2 of the +flicker fix work as designed once the browser actually sees the new +JS. + +### The alertmanager email delivery rabbit hole — and the answer + +After the SMTP migration, alertmanager still couldn't send. The error +log was identical to before the migration: `*smtp.plainAuth auth: 535 +Authentication failed: The provided authorization grant is invalid, +expired, or revoked` — verbatim OAuth 2.0 RFC 6749 §5.2 +`invalid_grant` text. Multi-hour diagnosis chased through every +plausible layer: + +1. Provider's port 587 `AUTH PLAIN` backend is OAuth-wired (returns + the OAuth-flavored 535 with a regular password). `AUTH LOGIN` on + the same port accepts the same credential cleanly. swaks proved + this. +2. alertmanager uses Go stdlib `smtp.PlainAuth`, which prefers PLAIN + whenever the server advertises it. No config knob to force LOGIN. + `smtp_auth_identity` tweak had no effect. +3. Provider's docs name port **465 SSL/TLS** as the official + submission endpoint — not 587. Switched to 465 → connection timed + out from prod AND from user's home laptop. +4. VM-side sweep (UFW outbound = allow, iptables OUTPUT = ACCEPT, + nftables empty, DOCKER-USER empty) cleared the local firewall as a + cause. Block had to be upstream. +5. Found Hetzner's documented anti-spam policy: *"Outgoing traffic to + ports 25 and 465 are blocked by default on all Cloud Servers. Send + us a request to unblock these ports."* The block is at the cloud + network layer, completely invisible from the VM. +6. Filed Hetzner unblock ticket via Cloud Console. **Auto-approved + within minutes** — Hetzner has tooling for legitimate SMTP unblock + requests. +7. Post-unblock: `nc -4 -zv mail1.myservices.hosting 465` succeeds. + swaks AUTH PLAIN on 465 succeeds (`235 Authentication + successful` + `250 2.0.0 Ok: queued`). One-line alertmanager change + from `:587` to `:465`, SIGHUP, watched tcpdump confirm implicit-TLS + handshake on port 465. Three pending alerts (TargetDown, + HostHighCpuUsage, HostHighDiskUsage) delivered to inbox within + minutes. **Alerting back online for the first time in 13+ days.** + +Key finding worth documenting: alertmanager's email integration via +Go's stdlib `net/smtp` DOES handle implicit TLS on port 465 natively. +No `smtp_tls_config` block needed, no stunnel sidecar. Just set the +smarthost port to `465` + `smtp_require_tls: true` and reload. tcpdump +confirms the TLS-on-connect handshake completes correctly. + +### SMTP password rotation (mid-flow) + +User rotated the SMTP password mid-debugging because it leaked into +chat (swaks base64 line was redactable but not redacted on the +initial paste). New value propagated to `/admin/settings` SMTP block +AND `alertmanager.yml`. swaks verified with `--auth LOGIN` on 587 +(the path the app uses) — `235 Authentication successful` followed by +`250 Ok: queued`. Test email landed. + +### Hetzner doc 5h debug payback (`1227567d`) + +Updated `docs/deployment/hetzner-server-setup.md`: + +- **Step 4 (Firewall Configuration)** gets a warning admonition right + after the UFW status check, explaining that Hetzner Cloud blocks + outbound 25 and 465 at the network layer (invisible from the VM), + with the symptom signature and the auto-approved unblock ticket + template ready to paste. +- **Step 19.5 (Alertmanager SMTP Setup)** gets a "live prod uses + mail1.myservices.hosting:465, not SendGrid" callout reflecting the + reality that the SendGrid configuration documented in §19.5 is no + longer how this prod env is wired. Includes the live alertmanager + SMTP block (with `smtp_auth_password` kept gitignored, only + `.example` ships in repo), the two prerequisites (Hetzner 465 + unblock + implicit-TLS-aware smarthost port), and the redacted + swaks verification command. + +Saves the next person from repeating the same 5-hour detour. + +### Status board delta + +- Step 6 (web user-journey E2E tests) — Tests 1 ✅, 2 ✅, 3 ✅, 4 ✅, + 5.0 ✅, 5.1 ✅. Test 5.2/5.3 are the next concrete browser steps + (loyalty dashboard + history with the 168-pt customer). +- Step 19 (alerting infrastructure) — **email delivery now works**, + remove the previously-flagged "silently broken in production" item. +- Step 6 implicit blockers — all cleared: prod is serving today's + i18n + redirect + flicker fixes, alertmanager email flows, no + outstanding deploy blockers. + +### Carry over for next session + +1. **Test 5.2** → login as `samir.boulahtit+17mayf@gmail.com`, visit + `/account/loyalty`, confirm 168 pts balance + cross-store rewards + render correctly. +2. **Test 5.3** → `/account/loyalty/history`, confirm the 5 + transactions (50+143+43+32 earned at FASHIONHUB, −100 redeemed at + FASHIONOUTLET). +3. Standing backlog as before — DE/LB email template quality sweep, + transaction categories permissions audit, routing pass, B1-F unit + tests, `prospecting/tasks/__init__.py`, other-module email audit. + ## Status board | # | Pre-launch step | State | Notes |