orion

Author	SHA1	Message	Date
Samir Boulahtit	e44f5c0458	chore(alertmanager): untrack alertmanager.yml + ship .example template (post-SMTP migration) All checks were successful CI / ruff (push) Successful in 17s Details CI / pytest (push) Successful in 2h48m4s Details CI / validate (push) Successful in 36s Details CI / docs (push) Successful in 56s Details CI / deploy (push) Successful in 1m12s Details CI / dependency-scanning (push) Successful in 37s Details Yesterday's deploy debug surfaced a SendGrid API key pasted into the tracked monitoring/alertmanager/alertmanager.yml on prod, with the in-repo file literally captioning the field "TODO: Paste your SG.xxx API key here" — actively encouraging the anti-pattern. Forensic follow-up (bash history lines 290-357) confirmed it was a user-driven nano edit that was never committed, just left as a long-running local mod. Three problems collapsed into this finding: 1. Real SMTP credential lived in a tracked git file on prod. 2. The SendGrid → mail1.myservices.hosting SMTP migration never touched alertmanager — it still pointed at smtp.sendgrid.net. 3. The alertmanager container has been Up 13 days with the pre-paste empty smtp_auth_password loaded from disk, so prod's email alerting has been silently failing. Resolution shipped here: - `git rm --cached monitoring/alertmanager/alertmanager.yml` so the prod-edited file on each host stops being a tracked file and the credential can't accidentally reach git again. - Add `monitoring/alertmanager/alertmanager.yml` to .gitignore. - Ship `monitoring/alertmanager/alertmanager.yml.example` as the template — pre-filled with the post-migration non-secret routing (`mail1.myservices.hosting:587`, `support@wizard.lu` auth, `alerts@wizard.lu` From for inbox clarity), only `smtp_auth_password` left as `CHANGEME`. Includes inline guidance for the From-vs-auth rule that some SMTP relays enforce. Per-host steps (Hetzner): backup the prod-edited file → revert local change → pull → copy the template over the old file → fill in the password → SIGHUP alertmanager. Doc reference will follow in the next commit (Hetzner deploy doc still needs an "alertmanager.yml lives outside git" footnote). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-30 11:32:10 +02:00
Samir Boulahtit	e61e02fb39	fix(redis): configure maxmemory and eviction policy to prevent OOM Some checks failed CI / ruff (push) Successful in 11s Details CI / pytest (push) Failing after 47m48s Details CI / validate (push) Successful in 24s Details CI / dependency-scanning (push) Successful in 31s Details CI / docs (push) Has been skipped Details CI / deploy (push) Has been skipped Details Redis had no maxmemory set, causing the Prometheus alert expression (used/max) to evaluate to +Inf. Set maxmemory to 100mb with allkeys-lru eviction policy, and guard the alert expression against division by zero. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 17:57:38 +01:00
Samir Boulahtit	35d1559162	feat(monitoring): add Redis exporter + Sentry docs to deployment guide Some checks failed CI / ruff (push) Successful in 10s Details CI / pytest (push) Failing after 47m30s Details CI / validate (push) Successful in 24s Details CI / dependency-scanning (push) Successful in 29s Details CI / docs (push) Has been skipped Details CI / deploy (push) Has been skipped Details - Add redis-exporter container to docker-compose (oliver006/redis_exporter, 32MB) - Add Redis scrape target to Prometheus config - Add 4 Redis alert rules: RedisDown, HighMemory, HighConnections, RejectedConnections - Document Step 19b (Sentry Error Tracking) in Hetzner deployment guide - Document Step 19c (Redis Monitoring) in Hetzner deployment guide - Update resource budget and port reference tables Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 23:30:18 +01:00
Samir Boulahtit	f67510b706	docs: switch email provider recommendation from Mailgun to SendGrid SendGrid handles both transactional emails and marketing campaigns under one account. Updated alertmanager SMTP placeholders, hetzner setup guide (Step 19.5), and environment reference to recommend SendGrid as the primary email provider. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 22:21:49 +01:00
Samir Boulahtit	4bce16fb73	feat(infra): add alerting, network segmentation, and ops docs (Steps 19-24) All checks were successful CI / ruff (push) Successful in 11s Details CI / pytest (push) Successful in 36m6s Details CI / validate (push) Successful in 22s Details CI / dependency-scanning (push) Successful in 28s Details CI / docs (push) Successful in 37s Details CI / deploy (push) Successful in 47s Details - Prometheus alert rules (host, container, API, Celery, target-down) - Alertmanager with email routing (critical 1h, warning 4h repeat) - Docker network segmentation (frontend/backend/monitoring) - Incident response runbook with 8 copy-paste runbooks - Environment variables reference (55+ vars documented) - Hetzner setup docs updated with Steps 19-24 - Launch readiness updated with Feb 2026 infrastructure status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 22:06:54 +01:00
Samir Boulahtit	ef7187b508	feat: add automated backups and Prometheus/Grafana monitoring stack (Steps 17-18) Some checks failed CI / dependency-scanning (push) Has been cancelled Details CI / docs (push) Has been cancelled Details CI / ruff (push) Successful in 7s Details CI / validate (push) Has been cancelled Details CI / deploy (push) Has been cancelled Details CI / pytest (push) Has started running Details Backups: pg_dump scripts with daily/weekly rotation and Cloudflare R2 offsite sync. Monitoring: Prometheus, Grafana, node-exporter, cAdvisor in docker-compose; /metrics endpoint activated via prometheus_client. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 22:40:08 +01:00

6 Commits