feat(monitoring): add Redis exporter + Sentry docs to deployment guide
Some checks failed
CI / ruff (push) Successful in 10s
CI / pytest (push) Failing after 47m30s
CI / validate (push) Successful in 24s
CI / dependency-scanning (push) Successful in 29s
CI / docs (push) Has been skipped
CI / deploy (push) Has been skipped

- Add redis-exporter container to docker-compose (oliver006/redis_exporter, 32MB)
- Add Redis scrape target to Prometheus config
- Add 4 Redis alert rules: RedisDown, HighMemory, HighConnections, RejectedConnections
- Document Step 19b (Sentry Error Tracking) in Hetzner deployment guide
- Document Step 19c (Redis Monitoring) in Hetzner deployment guide
- Update resource budget and port reference tables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-27 23:30:18 +01:00
parent ce822af883
commit 35d1559162
54 changed files with 664 additions and 343 deletions

View File

@@ -1101,9 +1101,10 @@ Prometheus + Grafana monitoring stack with host and container metrics.
| grafana | 192 MB | Dashboards (SQLite backend) |
| node-exporter | 64 MB | Host CPU/RAM/disk metrics |
| cadvisor | 128 MB | Per-container resource metrics |
| **Total new** | **640 MB** | |
| redis-exporter | 32 MB | Redis memory, connections, command stats |
| **Total new** | **672 MB** | |
Existing stack ~1.8 GB + 640 MB new = ~2.4 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
Existing stack ~1.8 GB + 672 MB new = ~2.5 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
### 18.1 DNS Record
@@ -1372,6 +1373,264 @@ This is the professional approach — emails come from the client's domain with
---
## Step 19b: Sentry Error Tracking
Application-level error tracking with [Sentry](https://sentry.io). While Prometheus monitors infrastructure metrics (CPU, memory, HTTP error rates), Sentry captures **individual exceptions** with full stack traces, request context, and breadcrumbs — making it possible to debug production errors without SSH access.
!!! info "How Sentry fits into the monitoring stack"
```
┌──────────────────────────────────────────────────────────────┐
│ Observability Stack │
├──────────────────┬──────────────────┬────────────────────────┤
│ Prometheus │ Grafana │ Sentry │
│ Infrastructure │ Dashboards │ Application errors │
│ metrics & alerts │ & visualization │ & performance traces │
├──────────────────┴──────────────────┴────────────────────────┤
│ Prometheus: "API 5xx rate is 3%" │
│ Sentry: "TypeError in /api/v1/orders/checkout line 42 │
│ request_id=abc123, user_id=7, store=acme" │
└──────────────────────────────────────────────────────────────┘
```
### What's Already Wired
The codebase already initializes Sentry in two places — you just need to provide the DSN:
| Component | File | Integrations |
|---|---|---|
| FastAPI (API server) | `main.py:42-58` | `FastApiIntegration`, `SqlalchemyIntegration` |
| Celery (background workers) | `app/core/celery_config.py:31-39` | `CeleryIntegration` |
Both read from the same `SENTRY_DSN` environment variable. When unset, Sentry is silently skipped.
### 19b.1 Create Sentry Project
1. Sign up at [sentry.io](https://sentry.io) (free Developer plan: **5K errors/month**, 1 user)
2. Create a new project:
- **Platform**: Python → FastAPI
- **Project name**: `orion` (or `rewardflow`)
- **Team**: default
3. Copy the **DSN** from the project settings — it looks like:
```
https://abc123def456@o123456.ingest.de.sentry.io/7891011
```
!!! tip "Sentry pricing"
| Plan | Errors/month | Cost | Notes |
|---|---|---|---|
| **Developer** (free) | 5,000 | $0 | 1 user, 30-day retention |
| **Team** | 50,000 | $26/mo | Unlimited users, 90-day retention |
| **Business** | 50,000 | $80/mo | SSO, audit logs, 90-day retention |
The free plan is sufficient for launch. Upgrade to Team if you exceed 5K errors/month or need multiple team members.
### 19b.2 Configure Environment
Add to `~/apps/orion/.env` on the server:
```bash
# Sentry Error Tracking
SENTRY_DSN=https://your-key@o123456.ingest.de.sentry.io/your-project-id
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=0.1
```
| Variable | Default | Description |
|---|---|---|
| `SENTRY_DSN` | `None` (disabled) | Project DSN from Sentry dashboard |
| `SENTRY_ENVIRONMENT` | `development` | Tags errors by environment (`production`, `staging`) |
| `SENTRY_TRACES_SAMPLE_RATE` | `0.1` | Fraction of requests traced for performance (0.1 = 10%) |
!!! warning "Traces sample rate"
`0.1` (10%) is a good starting point. At high traffic, lower to `0.01` (1%) to stay within the free plan's span limits. For initial launch with low traffic, you can temporarily set `1.0` (100%) for full visibility.
### 19b.3 Deploy
Restart the API and Celery containers to pick up the new env vars:
```bash
cd ~/apps/orion
docker compose --profile full restart api celery-worker celery-beat
```
Check the API logs to confirm Sentry initialized:
```bash
docker compose --profile full logs api --tail 20 | grep -i sentry
```
You should see:
```
Sentry initialized for environment: production
```
### 19b.4 Verify
**1. Trigger a test error** by hitting the API with a request that will fail:
```bash
curl -s https://api.wizard.lu/api/v1/nonexistent-endpoint-sentry-test
```
**2. Check Sentry dashboard:**
- Go to [sentry.io](https://sentry.io) → your project → **Issues**
- You should see a `404 Not Found` or similar error appear within seconds
- Click into it to see the full stack trace, request headers, and breadcrumbs
**3. Verify Celery integration** — check that the Celery worker also reports to Sentry:
```bash
docker compose --profile full logs celery-worker --tail 10 | grep -i sentry
```
### 19b.5 Sentry Features to Configure
After verifying the basic setup, configure these in the Sentry web UI:
**Alerts (Sentry → Alerts → Create Alert):**
| Alert | Condition | Action |
|---|---|---|
| New issue spike | >10 events in 1 hour | Email notification |
| First seen error | Any new issue | Email notification |
| Unresolved high-volume | >50 events in 24h | Email notification |
**Release tracking** — Sentry automatically tags errors with the release version via `release=f"orion@{settings.version}"` in `main.py`. This lets you see which deploy introduced a bug.
**Source maps** (optional, post-launch) — if you want JS errors from the admin frontend, add the Sentry browser SDK to your base template. Not needed for launch since most errors will be server-side.
### 19b.6 What Sentry Captures
With the current integration, Sentry automatically captures:
| Data | Source | Example |
|---|---|---|
| Python exceptions | FastAPI + Celery | `TypeError`, `ValidationError`, unhandled 500s |
| Request context | `FastApiIntegration` | URL, method, headers, query params, user IP |
| DB query breadcrumbs | `SqlalchemyIntegration` | SQL queries leading up to the error |
| Celery task failures | `CeleryIntegration` | Task name, args, retry count, worker hostname |
| User info | `send_default_pii=True` | User email and IP (if authenticated) |
| Performance traces | `traces_sample_rate` | End-to-end request timing, DB query duration |
!!! note "Privacy"
`send_default_pii=True` is set in both `main.py` and `celery_config.py`. This sends user emails and IP addresses to Sentry for debugging context. If GDPR compliance requires stricter data handling, set this to `False` and configure Sentry's [Data Scrubbing](https://docs.sentry.io/security-legal-pii/scrubbing/) rules.
---
## Step 19c: Redis Monitoring (Redis Exporter)
Add direct Redis monitoring to Prometheus. Without this, Redis can die silently — Celery tasks stop processing and emails stop sending, but no alert fires.
### Why Not Just cAdvisor?
cAdvisor tells you "the Redis container is running." The Redis exporter tells you "Redis is running, responding to commands, using 45MB memory, has 3 clients connected, and command latency is 0.2ms." It also catches scenarios where the container is running but Redis itself is unhealthy (maxmemory reached, connection limit hit).
### Resource Impact
| Container | RAM | CPU | Image Size |
|---|---|---|---|
| redis-exporter | ~5 MB | negligible | ~10 MB |
### 19c.1 Docker Compose
The `redis-exporter` service has been added to `docker-compose.yml`:
```yaml
redis-exporter:
image: oliver006/redis_exporter:latest
restart: always
profiles:
- full
ports:
- "127.0.0.1:9121:9121"
environment:
REDIS_ADDR: redis://redis:6379
depends_on:
redis:
condition: service_healthy
mem_limit: 32m
networks:
- backend
- monitoring
```
It joins both `backend` (to reach Redis) and `monitoring` (so Prometheus can scrape it).
### 19c.2 Prometheus Scrape Target
Added to `monitoring/prometheus.yml`:
```yaml
- job_name: "redis"
static_configs:
- targets: ["redis-exporter:9121"]
labels:
service: "redis"
```
### 19c.3 Alert Rules
Four Redis-specific alerts added to `monitoring/prometheus/alert.rules.yml`:
| Alert | Condition | Severity | What It Means |
|---|---|---|---|
| `RedisDown` | `redis_up == 0` for 1m | critical | Redis is unreachable — all background tasks stalled |
| `RedisHighMemoryUsage` | >80% of maxmemory for 5m | warning | Queue backlog or memory leak |
| `RedisHighConnectionCount` | >50 clients for 5m | warning | Possible connection leak |
| `RedisRejectedConnections` | Any rejected in 5m | critical | Redis is refusing new connections |
### 19c.4 Deploy
```bash
cd ~/apps/orion
git pull
docker compose --profile full up -d
```
Verify the exporter is running and Prometheus can scrape it:
```bash
# Exporter health
curl -s http://localhost:9121/health
# Redis metrics flowing
curl -s http://localhost:9121/metrics | grep redis_up
# Prometheus target status (should show "redis" as UP)
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"redis"'
```
### 19c.5 Grafana Dashboard
Import the community Redis dashboard:
1. Open `https://grafana.wizard.lu`
2. **Dashboards** → **Import** → ID `763` → Select Prometheus datasource
3. You'll see: memory usage, connected clients, commands/sec, hit rate, key count
### 19c.6 Verification
```bash
# Redis is being monitored
curl -s http://localhost:9121/metrics | grep redis_up
# redis_up 1
# Memory usage
curl -s http://localhost:9121/metrics | grep redis_memory_used_bytes
# redis_memory_used_bytes 1.234e+07 (≈12 MB)
# Connected clients
curl -s http://localhost:9121/metrics | grep redis_connected_clients
# redis_connected_clients 4 (API + celery-worker + celery-beat + flower)
# Alert rules loaded
curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | grep -i redis
```
---
## Step 20: Security Hardening
Docker network segmentation, fail2ban configuration, and automatic security updates.
@@ -2010,6 +2269,7 @@ After Google Wallet is verified working:
| Grafana | 3000 | 3001 (localhost) | `grafana.wizard.lu` |
| Node Exporter | 9100 | 9100 (localhost) | (internal only) |
| cAdvisor | 8080 | 8080 (localhost) | (internal only) |
| Redis Exporter | 9121 | 9121 (localhost) | (internal only) |
| Alertmanager | 9093 | 9093 (localhost) | (internal only) |
| Caddy | — | 80, 443 | (reverse proxy) |