docs(loyalty): record 2026-05-30 afternoon — prod-readiness 1-3 done + alerting back online
All checks were successful
All checks were successful
Picked up the morning's carry-over and ran the full prod-readiness chain end-to-end. Resolution: SG credential out of git permanently via untrack + .example template (e44f5c04); per-host migration on prod (alertmanager.yml gitignored, real file lives outside git); deploy-api-only.sh succeeded for the first time; today's 9 queued loyalty commits live on prod with ?v=e44f5c04 (and verified by re-running the loyalty redirect flicker repro — clean). Multi-hour rabbit hole on actual email delivery: provider's port 587 PLAIN backend is OAuth-wired (returns RFC 6749 invalid_grant text for password auth); switched to provider's documented port 465 SSL/TLS endpoint. Discovered Hetzner Cloud blocks outbound 25 and 465 by default as anti-spam policy. Auto-approved unblock ticket landed in minutes; one-line smarthost change to :465 reactivated email alerting after 13+ days down. alertmanager handles implicit TLS on 465 natively, no stunnel/relay needed. Hetzner doc updated with the egress-block warning + mail1 SMTP callout in1227567das 5h-debug payback. Next session resumes at Test 5.2 (/account/loyalty with 168 pts customer) → 5.3 history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -633,6 +633,132 @@ continuing Test 5.
|
||||
Hetzner doc check, B1-F unit tests, `prospecting/tasks/__init__.py`,
|
||||
other-module email audit.
|
||||
|
||||
## 2026-05-30 update (afternoon) — production-readiness items 1-3 resolved + alerting back online after 13+ days
|
||||
|
||||
Picked up the carry-over list from this morning's wrap and ran it
|
||||
end-to-end. All four blockers are now closed and the loyalty queue
|
||||
landed on prod for the first time today.
|
||||
|
||||
### SG-credential forensics + alertmanager.yml untracked
|
||||
|
||||
User-driven nano edits in past sessions (bash history lines 290-357
|
||||
confirmed it). No rogue actor, just forgotten exploratory work that
|
||||
was never committed. Resolution shipped as `e44f5c04`:
|
||||
`git rm --cached monitoring/alertmanager/alertmanager.yml`,
|
||||
`.gitignore` entry added, and `monitoring/alertmanager/alertmanager.yml.example`
|
||||
ships in repo with the post-migration routing pre-filled
|
||||
(`mail1.myservices.hosting:465`, `support@wizard.lu` auth,
|
||||
`alerts@wizard.lu` From, only `smtp_auth_password: 'CHANGEME'` left
|
||||
for prod-side fill-in).
|
||||
|
||||
### Per-host migration on prod + loyalty queue deployed
|
||||
|
||||
On the Hetzner box: backup → `git checkout` → pull `e44f5c04` →
|
||||
copy template over the old file → fill in the SMTP password →
|
||||
SIGHUP alertmanager. Then `bash scripts/deploy-api-only.sh` for the
|
||||
first time ran cleanly (working tree no longer dirty), pulling the 9
|
||||
queued loyalty commits from this morning. Verified `?v=e44f5c04` on
|
||||
all rendered assets, ran the loyalty redirect repro (`localStorage`
|
||||
delete + F5 on `/account/loyalty`): spinner straight through to login
|
||||
redirect, no "Rejoignez..." CTA flash. Stage 1 + Stage 2 of the
|
||||
flicker fix work as designed once the browser actually sees the new
|
||||
JS.
|
||||
|
||||
### The alertmanager email delivery rabbit hole — and the answer
|
||||
|
||||
After the SMTP migration, alertmanager still couldn't send. The error
|
||||
log was identical to before the migration: `*smtp.plainAuth auth: 535
|
||||
Authentication failed: The provided authorization grant is invalid,
|
||||
expired, or revoked` — verbatim OAuth 2.0 RFC 6749 §5.2
|
||||
`invalid_grant` text. Multi-hour diagnosis chased through every
|
||||
plausible layer:
|
||||
|
||||
1. Provider's port 587 `AUTH PLAIN` backend is OAuth-wired (returns
|
||||
the OAuth-flavored 535 with a regular password). `AUTH LOGIN` on
|
||||
the same port accepts the same credential cleanly. swaks proved
|
||||
this.
|
||||
2. alertmanager uses Go stdlib `smtp.PlainAuth`, which prefers PLAIN
|
||||
whenever the server advertises it. No config knob to force LOGIN.
|
||||
`smtp_auth_identity` tweak had no effect.
|
||||
3. Provider's docs name port **465 SSL/TLS** as the official
|
||||
submission endpoint — not 587. Switched to 465 → connection timed
|
||||
out from prod AND from user's home laptop.
|
||||
4. VM-side sweep (UFW outbound = allow, iptables OUTPUT = ACCEPT,
|
||||
nftables empty, DOCKER-USER empty) cleared the local firewall as a
|
||||
cause. Block had to be upstream.
|
||||
5. Found Hetzner's documented anti-spam policy: *"Outgoing traffic to
|
||||
ports 25 and 465 are blocked by default on all Cloud Servers. Send
|
||||
us a request to unblock these ports."* The block is at the cloud
|
||||
network layer, completely invisible from the VM.
|
||||
6. Filed Hetzner unblock ticket via Cloud Console. **Auto-approved
|
||||
within minutes** — Hetzner has tooling for legitimate SMTP unblock
|
||||
requests.
|
||||
7. Post-unblock: `nc -4 -zv mail1.myservices.hosting 465` succeeds.
|
||||
swaks AUTH PLAIN on 465 succeeds (`235 Authentication
|
||||
successful` + `250 2.0.0 Ok: queued`). One-line alertmanager change
|
||||
from `:587` to `:465`, SIGHUP, watched tcpdump confirm implicit-TLS
|
||||
handshake on port 465. Three pending alerts (TargetDown,
|
||||
HostHighCpuUsage, HostHighDiskUsage) delivered to inbox within
|
||||
minutes. **Alerting back online for the first time in 13+ days.**
|
||||
|
||||
Key finding worth documenting: alertmanager's email integration via
|
||||
Go's stdlib `net/smtp` DOES handle implicit TLS on port 465 natively.
|
||||
No `smtp_tls_config` block needed, no stunnel sidecar. Just set the
|
||||
smarthost port to `465` + `smtp_require_tls: true` and reload. tcpdump
|
||||
confirms the TLS-on-connect handshake completes correctly.
|
||||
|
||||
### SMTP password rotation (mid-flow)
|
||||
|
||||
User rotated the SMTP password mid-debugging because it leaked into
|
||||
chat (swaks base64 line was redactable but not redacted on the
|
||||
initial paste). New value propagated to `/admin/settings` SMTP block
|
||||
AND `alertmanager.yml`. swaks verified with `--auth LOGIN` on 587
|
||||
(the path the app uses) — `235 Authentication successful` followed by
|
||||
`250 Ok: queued`. Test email landed.
|
||||
|
||||
### Hetzner doc 5h debug payback (`1227567d`)
|
||||
|
||||
Updated `docs/deployment/hetzner-server-setup.md`:
|
||||
|
||||
- **Step 4 (Firewall Configuration)** gets a warning admonition right
|
||||
after the UFW status check, explaining that Hetzner Cloud blocks
|
||||
outbound 25 and 465 at the network layer (invisible from the VM),
|
||||
with the symptom signature and the auto-approved unblock ticket
|
||||
template ready to paste.
|
||||
- **Step 19.5 (Alertmanager SMTP Setup)** gets a "live prod uses
|
||||
mail1.myservices.hosting:465, not SendGrid" callout reflecting the
|
||||
reality that the SendGrid configuration documented in §19.5 is no
|
||||
longer how this prod env is wired. Includes the live alertmanager
|
||||
SMTP block (with `smtp_auth_password` kept gitignored, only
|
||||
`.example` ships in repo), the two prerequisites (Hetzner 465
|
||||
unblock + implicit-TLS-aware smarthost port), and the redacted
|
||||
swaks verification command.
|
||||
|
||||
Saves the next person from repeating the same 5-hour detour.
|
||||
|
||||
### Status board delta
|
||||
|
||||
- Step 6 (web user-journey E2E tests) — Tests 1 ✅, 2 ✅, 3 ✅, 4 ✅,
|
||||
5.0 ✅, 5.1 ✅. Test 5.2/5.3 are the next concrete browser steps
|
||||
(loyalty dashboard + history with the 168-pt customer).
|
||||
- Step 19 (alerting infrastructure) — **email delivery now works**,
|
||||
remove the previously-flagged "silently broken in production" item.
|
||||
- Step 6 implicit blockers — all cleared: prod is serving today's
|
||||
i18n + redirect + flicker fixes, alertmanager email flows, no
|
||||
outstanding deploy blockers.
|
||||
|
||||
### Carry over for next session
|
||||
|
||||
1. **Test 5.2** → login as `samir.boulahtit+17mayf@gmail.com`, visit
|
||||
`/account/loyalty`, confirm 168 pts balance + cross-store rewards
|
||||
render correctly.
|
||||
2. **Test 5.3** → `/account/loyalty/history`, confirm the 5
|
||||
transactions (50+143+43+32 earned at FASHIONHUB, −100 redeemed at
|
||||
FASHIONOUTLET).
|
||||
3. Standing backlog as before — DE/LB email template quality sweep,
|
||||
transaction categories permissions audit, routing pass, B1-F unit
|
||||
tests, `prospecting/tasks/__init__.py`, other-module email audit.
|
||||
|
||||
## Status board
|
||||
|
||||
| # | Pre-launch step | State | Notes |
|
||||
|
||||
Reference in New Issue
Block a user