feat(infra): add alerting, network segmentation, and ops docs (Steps 19-24)
All checks were successful
CI / ruff (push) Successful in 11s
CI / pytest (push) Successful in 36m6s
CI / validate (push) Successful in 22s
CI / dependency-scanning (push) Successful in 28s
CI / docs (push) Successful in 37s
CI / deploy (push) Successful in 47s

- Prometheus alert rules (host, container, API, Celery, target-down)
- Alertmanager with email routing (critical 1h, warning 4h repeat)
- Docker network segmentation (frontend/backend/monitoring)
- Incident response runbook with 8 copy-paste runbooks
- Environment variables reference (55+ vars documented)
- Hetzner setup docs updated with Steps 19-24
- Launch readiness updated with Feb 2026 infrastructure status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-15 22:06:54 +01:00
parent 1cb659e3a5
commit 4bce16fb73
9 changed files with 1845 additions and 5 deletions

View File

@@ -2,7 +2,7 @@
This document tracks the launch readiness status of the complete platform including Store Dashboard, Shop/Storefront, and Admin features.
**Last Updated:** 2026-01-08
**Last Updated:** 2026-02-15
**Overall Status:** 95% Feature Complete - LAUNCH READY
---
@@ -104,7 +104,7 @@ Previous blockers (password reset, search, order emails) have been resolved. Onl
|-----------|--------|-----|
| Email System | 20% | Password reset, tier change notifications |
| Payment Verification | Missing | Stripe payment intent verification |
| Monitoring | 50% | Framework ready, alerting TODO |
| Monitoring | Ready | Prometheus + Grafana + Alertmanager with 12 alert rules |
---
@@ -192,6 +192,24 @@ Previous blockers (password reset, search, order emails) have been resolved. Onl
---
## February 2026 Infrastructure Hardening
| Component | Status | Details |
|-----------|--------|---------|
| Hetzner VPS | Running | CAX11 (4 GB RAM, ARM64), Ubuntu 24.04 |
| Docker stack | 11 containers | API, DB, Redis, Celery x2, Flower, Prometheus, Grafana, node-exporter, cAdvisor, Alertmanager |
| Monitoring | Complete | Prometheus (5 targets), Grafana dashboards, 12 alert rules |
| Alerting | Complete | Alertmanager with email routing (critical 1h, warning 4h) |
| Backups | Complete | Daily pg_dump, R2 offsite, Hetzner snapshots |
| Network security | Complete | 3 Docker networks (frontend/backend/monitoring), fail2ban, unattended-upgrades |
| Reverse proxy | Complete | Caddy with auto-SSL for all domains |
| CI/CD | Complete | Gitea Actions, auto-deploy on push to master |
| Cloudflare proxy | Documented | Origin certs + WAF ready, deploy when needed |
| Incident response | Complete | 8 runbooks, severity levels, decision tree |
| Environment docs | Complete | 55+ env vars documented with defaults |
---
## Validation Status
All code validators pass:
@@ -228,10 +246,13 @@ Performance Validator: PASSED (with skips)
### Infrastructure
- [ ] Production Stripe keys
- [ ] SSL certificates
- [ ] Database backups configured
- [ ] Monitoring/alerting setup
- [x] SSL certificates (Caddy auto-SSL via Let's Encrypt)
- [x] Database backups configured (daily pg_dump + R2 offsite + Hetzner snapshots)
- [x] Monitoring/alerting setup (Prometheus + Grafana + Alertmanager)
- [ ] Error tracking (Sentry)
- [x] Docker network segmentation (frontend/backend/monitoring)
- [x] fail2ban + unattended-upgrades
- [ ] Cloudflare proxy (WAF, DDoS protection)
### Pre-Launch Testing
- [ ] End-to-end order flow