6 Commits

Author SHA1 Message Date
e44f5c0458 chore(alertmanager): untrack alertmanager.yml + ship .example template (post-SMTP migration)
All checks were successful
CI / ruff (push) Successful in 17s
CI / pytest (push) Successful in 2h48m4s
CI / validate (push) Successful in 36s
CI / docs (push) Successful in 56s
CI / deploy (push) Successful in 1m12s
CI / dependency-scanning (push) Successful in 37s
Yesterday's deploy debug surfaced a SendGrid API key pasted into the
tracked monitoring/alertmanager/alertmanager.yml on prod, with the
in-repo file literally captioning the field "TODO: Paste your SG.xxx
API key here" — actively encouraging the anti-pattern. Forensic
follow-up (bash history lines 290-357) confirmed it was a user-driven
nano edit that was never committed, just left as a long-running local
mod. Three problems collapsed into this finding:

  1. Real SMTP credential lived in a tracked git file on prod.
  2. The SendGrid → mail1.myservices.hosting SMTP migration never
     touched alertmanager — it still pointed at smtp.sendgrid.net.
  3. The alertmanager container has been Up 13 days with the
     pre-paste empty smtp_auth_password loaded from disk, so prod's
     email alerting has been silently failing.

Resolution shipped here:

- `git rm --cached monitoring/alertmanager/alertmanager.yml` so the
  prod-edited file on each host stops being a tracked file and the
  credential can't accidentally reach git again.
- Add `monitoring/alertmanager/alertmanager.yml` to .gitignore.
- Ship `monitoring/alertmanager/alertmanager.yml.example` as the
  template — pre-filled with the post-migration non-secret routing
  (`mail1.myservices.hosting:587`, `support@wizard.lu` auth,
  `alerts@wizard.lu` From for inbox clarity), only `smtp_auth_password`
  left as `CHANGEME`. Includes inline guidance for the From-vs-auth
  rule that some SMTP relays enforce.

Per-host steps (Hetzner): backup the prod-edited file → revert local
change → pull → copy the template over the old file → fill in the
password → SIGHUP alertmanager. Doc reference will follow in the next
commit (Hetzner deploy doc still needs an "alertmanager.yml lives
outside git" footnote).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-30 11:32:10 +02:00
e61e02fb39 fix(redis): configure maxmemory and eviction policy to prevent OOM
Some checks failed
CI / ruff (push) Successful in 11s
CI / pytest (push) Failing after 47m48s
CI / validate (push) Successful in 24s
CI / dependency-scanning (push) Successful in 31s
CI / docs (push) Has been skipped
CI / deploy (push) Has been skipped
Redis had no maxmemory set, causing the Prometheus alert expression
(used/max) to evaluate to +Inf. Set maxmemory to 100mb with allkeys-lru
eviction policy, and guard the alert expression against division by zero.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:57:38 +01:00
35d1559162 feat(monitoring): add Redis exporter + Sentry docs to deployment guide
Some checks failed
CI / ruff (push) Successful in 10s
CI / pytest (push) Failing after 47m30s
CI / validate (push) Successful in 24s
CI / dependency-scanning (push) Successful in 29s
CI / docs (push) Has been skipped
CI / deploy (push) Has been skipped
- Add redis-exporter container to docker-compose (oliver006/redis_exporter, 32MB)
- Add Redis scrape target to Prometheus config
- Add 4 Redis alert rules: RedisDown, HighMemory, HighConnections, RejectedConnections
- Document Step 19b (Sentry Error Tracking) in Hetzner deployment guide
- Document Step 19c (Redis Monitoring) in Hetzner deployment guide
- Update resource budget and port reference tables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:30:18 +01:00
f67510b706 docs: switch email provider recommendation from Mailgun to SendGrid
SendGrid handles both transactional emails and marketing campaigns
under one account. Updated alertmanager SMTP placeholders, hetzner
setup guide (Step 19.5), and environment reference to recommend
SendGrid as the primary email provider.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 22:21:49 +01:00
4bce16fb73 feat(infra): add alerting, network segmentation, and ops docs (Steps 19-24)
All checks were successful
CI / ruff (push) Successful in 11s
CI / pytest (push) Successful in 36m6s
CI / validate (push) Successful in 22s
CI / dependency-scanning (push) Successful in 28s
CI / docs (push) Successful in 37s
CI / deploy (push) Successful in 47s
- Prometheus alert rules (host, container, API, Celery, target-down)
- Alertmanager with email routing (critical 1h, warning 4h repeat)
- Docker network segmentation (frontend/backend/monitoring)
- Incident response runbook with 8 copy-paste runbooks
- Environment variables reference (55+ vars documented)
- Hetzner setup docs updated with Steps 19-24
- Launch readiness updated with Feb 2026 infrastructure status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 22:06:54 +01:00
ef7187b508 feat: add automated backups and Prometheus/Grafana monitoring stack (Steps 17-18)
Some checks failed
CI / dependency-scanning (push) Has been cancelled
CI / docs (push) Has been cancelled
CI / ruff (push) Successful in 7s
CI / validate (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / pytest (push) Has started running
Backups: pg_dump scripts with daily/weekly rotation and Cloudflare R2 offsite sync.
Monitoring: Prometheus, Grafana, node-exporter, cAdvisor in docker-compose; /metrics
endpoint activated via prometheus_client.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 22:40:08 +01:00