docs(loyalty): record 2026-05-30 afternoon — prod-readiness 1-3 done + alerting back online
All checks were successful
CI / ruff (push) Successful in 18s
CI / pytest (push) Successful in 2h39m33s
CI / validate (push) Successful in 35s
CI / dependency-scanning (push) Successful in 36s
CI / docs (push) Successful in 56s
CI / deploy (push) Successful in 1m13s

Picked up the morning's carry-over and ran the full prod-readiness
chain end-to-end. Resolution: SG credential out of git permanently
via untrack + .example template (e44f5c04); per-host migration on
prod (alertmanager.yml gitignored, real file lives outside git);
deploy-api-only.sh succeeded for the first time; today's 9 queued
loyalty commits live on prod with ?v=e44f5c04 (and verified by
re-running the loyalty redirect flicker repro — clean).

Multi-hour rabbit hole on actual email delivery: provider's port 587
PLAIN backend is OAuth-wired (returns RFC 6749 invalid_grant text
for password auth); switched to provider's documented port 465 SSL/TLS
endpoint. Discovered Hetzner Cloud blocks outbound 25 and 465 by
default as anti-spam policy. Auto-approved unblock ticket landed in
minutes; one-line smarthost change to :465 reactivated email
alerting after 13+ days down. alertmanager handles implicit TLS on
465 natively, no stunnel/relay needed.

Hetzner doc updated with the egress-block warning + mail1 SMTP
callout in 1227567d as 5h-debug payback. Next session resumes at
Test 5.2 (/account/loyalty with 168 pts customer) → 5.3 history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-30 20:20:00 +02:00
parent 1227567d08
commit 947ca43c7b

View File

@@ -633,6 +633,132 @@ continuing Test 5.
Hetzner doc check, B1-F unit tests, `prospecting/tasks/__init__.py`,
other-module email audit.
## 2026-05-30 update (afternoon) — production-readiness items 1-3 resolved + alerting back online after 13+ days
Picked up the carry-over list from this morning's wrap and ran it
end-to-end. All four blockers are now closed and the loyalty queue
landed on prod for the first time today.
### SG-credential forensics + alertmanager.yml untracked
User-driven nano edits in past sessions (bash history lines 290-357
confirmed it). No rogue actor, just forgotten exploratory work that
was never committed. Resolution shipped as `e44f5c04`:
`git rm --cached monitoring/alertmanager/alertmanager.yml`,
`.gitignore` entry added, and `monitoring/alertmanager/alertmanager.yml.example`
ships in repo with the post-migration routing pre-filled
(`mail1.myservices.hosting:465`, `support@wizard.lu` auth,
`alerts@wizard.lu` From, only `smtp_auth_password: 'CHANGEME'` left
for prod-side fill-in).
### Per-host migration on prod + loyalty queue deployed
On the Hetzner box: backup → `git checkout` → pull `e44f5c04` →
copy template over the old file → fill in the SMTP password →
SIGHUP alertmanager. Then `bash scripts/deploy-api-only.sh` for the
first time ran cleanly (working tree no longer dirty), pulling the 9
queued loyalty commits from this morning. Verified `?v=e44f5c04` on
all rendered assets, ran the loyalty redirect repro (`localStorage`
delete + F5 on `/account/loyalty`): spinner straight through to login
redirect, no "Rejoignez..." CTA flash. Stage 1 + Stage 2 of the
flicker fix work as designed once the browser actually sees the new
JS.
### The alertmanager email delivery rabbit hole — and the answer
After the SMTP migration, alertmanager still couldn't send. The error
log was identical to before the migration: `*smtp.plainAuth auth: 535
Authentication failed: The provided authorization grant is invalid,
expired, or revoked` — verbatim OAuth 2.0 RFC 6749 §5.2
`invalid_grant` text. Multi-hour diagnosis chased through every
plausible layer:
1. Provider's port 587 `AUTH PLAIN` backend is OAuth-wired (returns
the OAuth-flavored 535 with a regular password). `AUTH LOGIN` on
the same port accepts the same credential cleanly. swaks proved
this.
2. alertmanager uses Go stdlib `smtp.PlainAuth`, which prefers PLAIN
whenever the server advertises it. No config knob to force LOGIN.
`smtp_auth_identity` tweak had no effect.
3. Provider's docs name port **465 SSL/TLS** as the official
submission endpoint — not 587. Switched to 465 → connection timed
out from prod AND from user's home laptop.
4. VM-side sweep (UFW outbound = allow, iptables OUTPUT = ACCEPT,
nftables empty, DOCKER-USER empty) cleared the local firewall as a
cause. Block had to be upstream.
5. Found Hetzner's documented anti-spam policy: *"Outgoing traffic to
ports 25 and 465 are blocked by default on all Cloud Servers. Send
us a request to unblock these ports."* The block is at the cloud
network layer, completely invisible from the VM.
6. Filed Hetzner unblock ticket via Cloud Console. **Auto-approved
within minutes** — Hetzner has tooling for legitimate SMTP unblock
requests.
7. Post-unblock: `nc -4 -zv mail1.myservices.hosting 465` succeeds.
swaks AUTH PLAIN on 465 succeeds (`235 Authentication
successful` + `250 2.0.0 Ok: queued`). One-line alertmanager change
from `:587` to `:465`, SIGHUP, watched tcpdump confirm implicit-TLS
handshake on port 465. Three pending alerts (TargetDown,
HostHighCpuUsage, HostHighDiskUsage) delivered to inbox within
minutes. **Alerting back online for the first time in 13+ days.**
Key finding worth documenting: alertmanager's email integration via
Go's stdlib `net/smtp` DOES handle implicit TLS on port 465 natively.
No `smtp_tls_config` block needed, no stunnel sidecar. Just set the
smarthost port to `465` + `smtp_require_tls: true` and reload. tcpdump
confirms the TLS-on-connect handshake completes correctly.
### SMTP password rotation (mid-flow)
User rotated the SMTP password mid-debugging because it leaked into
chat (swaks base64 line was redactable but not redacted on the
initial paste). New value propagated to `/admin/settings` SMTP block
AND `alertmanager.yml`. swaks verified with `--auth LOGIN` on 587
(the path the app uses) — `235 Authentication successful` followed by
`250 Ok: queued`. Test email landed.
### Hetzner doc 5h debug payback (`1227567d`)
Updated `docs/deployment/hetzner-server-setup.md`:
- **Step 4 (Firewall Configuration)** gets a warning admonition right
after the UFW status check, explaining that Hetzner Cloud blocks
outbound 25 and 465 at the network layer (invisible from the VM),
with the symptom signature and the auto-approved unblock ticket
template ready to paste.
- **Step 19.5 (Alertmanager SMTP Setup)** gets a "live prod uses
mail1.myservices.hosting:465, not SendGrid" callout reflecting the
reality that the SendGrid configuration documented in §19.5 is no
longer how this prod env is wired. Includes the live alertmanager
SMTP block (with `smtp_auth_password` kept gitignored, only
`.example` ships in repo), the two prerequisites (Hetzner 465
unblock + implicit-TLS-aware smarthost port), and the redacted
swaks verification command.
Saves the next person from repeating the same 5-hour detour.
### Status board delta
- Step 6 (web user-journey E2E tests) — Tests 1 ✅, 2 ✅, 3 ✅, 4 ✅,
5.0 ✅, 5.1 ✅. Test 5.2/5.3 are the next concrete browser steps
(loyalty dashboard + history with the 168-pt customer).
- Step 19 (alerting infrastructure) — **email delivery now works**,
remove the previously-flagged "silently broken in production" item.
- Step 6 implicit blockers — all cleared: prod is serving today's
i18n + redirect + flicker fixes, alertmanager email flows, no
outstanding deploy blockers.
### Carry over for next session
1. **Test 5.2** → login as `samir.boulahtit+17mayf@gmail.com`, visit
`/account/loyalty`, confirm 168 pts balance + cross-store rewards
render correctly.
2. **Test 5.3** → `/account/loyalty/history`, confirm the 5
transactions (50+143+43+32 earned at FASHIONHUB, 100 redeemed at
FASHIONOUTLET).
3. Standing backlog as before — DE/LB email template quality sweep,
transaction categories permissions audit, routing pass, B1-F unit
tests, `prospecting/tasks/__init__.py`, other-module email audit.
## Status board
| # | Pre-launch step | State | Notes |