Commit Graph

3 Commits

Author SHA1 Message Date
e61e02fb39 fix(redis): configure maxmemory and eviction policy to prevent OOM
Some checks failed
CI / ruff (push) Successful in 11s
CI / pytest (push) Failing after 47m48s
CI / validate (push) Successful in 24s
CI / dependency-scanning (push) Successful in 31s
CI / docs (push) Has been skipped
CI / deploy (push) Has been skipped
Redis had no maxmemory set, causing the Prometheus alert expression
(used/max) to evaluate to +Inf. Set maxmemory to 100mb with allkeys-lru
eviction policy, and guard the alert expression against division by zero.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:57:38 +01:00
35d1559162 feat(monitoring): add Redis exporter + Sentry docs to deployment guide
Some checks failed
CI / ruff (push) Successful in 10s
CI / pytest (push) Failing after 47m30s
CI / validate (push) Successful in 24s
CI / dependency-scanning (push) Successful in 29s
CI / docs (push) Has been skipped
CI / deploy (push) Has been skipped
- Add redis-exporter container to docker-compose (oliver006/redis_exporter, 32MB)
- Add Redis scrape target to Prometheus config
- Add 4 Redis alert rules: RedisDown, HighMemory, HighConnections, RejectedConnections
- Document Step 19b (Sentry Error Tracking) in Hetzner deployment guide
- Document Step 19c (Redis Monitoring) in Hetzner deployment guide
- Update resource budget and port reference tables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:30:18 +01:00
4bce16fb73 feat(infra): add alerting, network segmentation, and ops docs (Steps 19-24)
All checks were successful
CI / ruff (push) Successful in 11s
CI / pytest (push) Successful in 36m6s
CI / validate (push) Successful in 22s
CI / dependency-scanning (push) Successful in 28s
CI / docs (push) Successful in 37s
CI / deploy (push) Successful in 47s
- Prometheus alert rules (host, container, API, Celery, target-down)
- Alertmanager with email routing (critical 1h, warning 4h repeat)
- Docker network segmentation (frontend/backend/monitoring)
- Incident response runbook with 8 copy-paste runbooks
- Environment variables reference (55+ vars documented)
- Hetzner setup docs updated with Steps 19-24
- Launch readiness updated with Feb 2026 infrastructure status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 22:06:54 +01:00