feat(monitoring): add Redis exporter + Sentry docs to deployment guide

- Add redis-exporter container to docker-compose (oliver006/redis_exporter, 32MB) - Add Redis scrape target to Prometheus config - Add 4 Redis alert rules: RedisDown, HighMemory, HighConnections, RejectedConnections - Document Step 19b (Sentry Error Tracking) in Hetzner deployment guide - Document Step 19c (Redis Monitoring) in Hetzner deployment guide - Update resource budget and port reference tables Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:30:18 +01:00
parent ce822af883
commit 35d1559162
54 changed files with 664 additions and 343 deletions
--- a/docs/deployment/hetzner-server-setup.md
+++ b/docs/deployment/hetzner-server-setup.md
@@ -1101,9 +1101,10 @@ Prometheus + Grafana monitoring stack with host and container metrics.
 | grafana | 192 MB | Dashboards (SQLite backend) |
 | node-exporter | 64 MB | Host CPU/RAM/disk metrics |
 | cadvisor | 128 MB | Per-container resource metrics |
-| **Total new** | **640 MB** | |
+| redis-exporter | 32 MB | Redis memory, connections, command stats |
+| **Total new** | **672 MB** | |

-Existing stack ~1.8 GB + 640 MB new = ~2.4 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
+Existing stack ~1.8 GB + 672 MB new = ~2.5 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).

 ### 18.1 DNS Record

@@ -1372,6 +1373,264 @@ This is the professional approach — emails come from the client's domain with

 ---

+## Step 19b: Sentry Error Tracking
+
+Application-level error tracking with [Sentry](https://sentry.io). While Prometheus monitors infrastructure metrics (CPU, memory, HTTP error rates), Sentry captures **individual exceptions** with full stack traces, request context, and breadcrumbs — making it possible to debug production errors without SSH access.
+
+!!! info "How Sentry fits into the monitoring stack"
+    ```
+    ┌──────────────────────────────────────────────────────────────┐
+    │                    Observability Stack                        │
+    ├──────────────────┬──────────────────┬────────────────────────┤
+    │  Prometheus       │  Grafana         │  Sentry                │
+    │  Infrastructure   │  Dashboards      │  Application errors    │
+    │  metrics & alerts │  & visualization │  & performance traces  │
+    ├──────────────────┴──────────────────┴────────────────────────┤
+    │  Prometheus: "API 5xx rate is 3%"                            │
+    │  Sentry:     "TypeError in /api/v1/orders/checkout line 42   │
+    │               request_id=abc123, user_id=7, store=acme"      │
+    └──────────────────────────────────────────────────────────────┘
+    ```
+
+### What's Already Wired
+
+The codebase already initializes Sentry in two places — you just need to provide the DSN:
+
+| Component | File | Integrations |
+|---|---|---|
+| FastAPI (API server) | `main.py:42-58` | `FastApiIntegration`, `SqlalchemyIntegration` |
+| Celery (background workers) | `app/core/celery_config.py:31-39` | `CeleryIntegration` |
+
+Both read from the same `SENTRY_DSN` environment variable. When unset, Sentry is silently skipped.
+
+### 19b.1 Create Sentry Project
+
+1. Sign up at [sentry.io](https://sentry.io) (free Developer plan: **5K errors/month**, 1 user)
+2. Create a new project:
+    - **Platform**: Python → FastAPI
+    - **Project name**: `orion` (or `rewardflow`)
+    - **Team**: default
+3. Copy the **DSN** from the project settings — it looks like:
+   ```
+   https://abc123def456@o123456.ingest.de.sentry.io/7891011
+   ```
+
+!!! tip "Sentry pricing"
+    | Plan | Errors/month | Cost | Notes |
+    |---|---|---|---|
+    | **Developer** (free) | 5,000 | $0 | 1 user, 30-day retention |
+    | **Team** | 50,000 | $26/mo | Unlimited users, 90-day retention |
+    | **Business** | 50,000 | $80/mo | SSO, audit logs, 90-day retention |
+
+    The free plan is sufficient for launch. Upgrade to Team if you exceed 5K errors/month or need multiple team members.
+
+### 19b.2 Configure Environment
+
+Add to `~/apps/orion/.env` on the server:
+
+```bash
+# Sentry Error Tracking
+SENTRY_DSN=https://your-key@o123456.ingest.de.sentry.io/your-project-id
+SENTRY_ENVIRONMENT=production
+SENTRY_TRACES_SAMPLE_RATE=0.1
+```
+
+| Variable | Default | Description |
+|---|---|---|
+| `SENTRY_DSN` | `None` (disabled) | Project DSN from Sentry dashboard |
+| `SENTRY_ENVIRONMENT` | `development` | Tags errors by environment (`production`, `staging`) |
+| `SENTRY_TRACES_SAMPLE_RATE` | `0.1` | Fraction of requests traced for performance (0.1 = 10%) |
+
+!!! warning "Traces sample rate"
+    `0.1` (10%) is a good starting point. At high traffic, lower to `0.01` (1%) to stay within the free plan's span limits. For initial launch with low traffic, you can temporarily set `1.0` (100%) for full visibility.
+
+### 19b.3 Deploy
+
+Restart the API and Celery containers to pick up the new env vars:
+
+```bash
+cd ~/apps/orion
+docker compose --profile full restart api celery-worker celery-beat
+```
+
+Check the API logs to confirm Sentry initialized:
+
+```bash
+docker compose --profile full logs api --tail 20 | grep -i sentry
+```
+
+You should see:
+
+```
+Sentry initialized for environment: production
+```
+
+### 19b.4 Verify
+
+**1. Trigger a test error** by hitting the API with a request that will fail:
+
+```bash
+curl -s https://api.wizard.lu/api/v1/nonexistent-endpoint-sentry-test
+```
+
+**2. Check Sentry dashboard:**
+
+- Go to [sentry.io](https://sentry.io) → your project → **Issues**
+- You should see a `404 Not Found` or similar error appear within seconds
+- Click into it to see the full stack trace, request headers, and breadcrumbs
+
+**3. Verify Celery integration** — check that the Celery worker also reports to Sentry:
+
+```bash
+docker compose --profile full logs celery-worker --tail 10 | grep -i sentry
+```
+
+### 19b.5 Sentry Features to Configure
+
+After verifying the basic setup, configure these in the Sentry web UI:
+
+**Alerts (Sentry → Alerts → Create Alert):**
+
+| Alert | Condition | Action |
+|---|---|---|
+| New issue spike | >10 events in 1 hour | Email notification |
+| First seen error | Any new issue | Email notification |
+| Unresolved high-volume | >50 events in 24h | Email notification |
+
+**Release tracking** — Sentry automatically tags errors with the release version via `release=f"orion@{settings.version}"` in `main.py`. This lets you see which deploy introduced a bug.
+
+**Source maps** (optional, post-launch) — if you want JS errors from the admin frontend, add the Sentry browser SDK to your base template. Not needed for launch since most errors will be server-side.
+
+### 19b.6 What Sentry Captures
+
+With the current integration, Sentry automatically captures:
+
+| Data | Source | Example |
+|---|---|---|
+| Python exceptions | FastAPI + Celery | `TypeError`, `ValidationError`, unhandled 500s |
+| Request context | `FastApiIntegration` | URL, method, headers, query params, user IP |
+| DB query breadcrumbs | `SqlalchemyIntegration` | SQL queries leading up to the error |
+| Celery task failures | `CeleryIntegration` | Task name, args, retry count, worker hostname |
+| User info | `send_default_pii=True` | User email and IP (if authenticated) |
+| Performance traces | `traces_sample_rate` | End-to-end request timing, DB query duration |
+
+!!! note "Privacy"
+    `send_default_pii=True` is set in both `main.py` and `celery_config.py`. This sends user emails and IP addresses to Sentry for debugging context. If GDPR compliance requires stricter data handling, set this to `False` and configure Sentry's [Data Scrubbing](https://docs.sentry.io/security-legal-pii/scrubbing/) rules.
+
+---
+
+## Step 19c: Redis Monitoring (Redis Exporter)
+
+Add direct Redis monitoring to Prometheus. Without this, Redis can die silently — Celery tasks stop processing and emails stop sending, but no alert fires.
+
+### Why Not Just cAdvisor?
+
+cAdvisor tells you "the Redis container is running." The Redis exporter tells you "Redis is running, responding to commands, using 45MB memory, has 3 clients connected, and command latency is 0.2ms." It also catches scenarios where the container is running but Redis itself is unhealthy (maxmemory reached, connection limit hit).
+
+### Resource Impact
+
+| Container | RAM | CPU | Image Size |
+|---|---|---|---|
+| redis-exporter | ~5 MB | negligible | ~10 MB |
+
+### 19c.1 Docker Compose
+
+The `redis-exporter` service has been added to `docker-compose.yml`:
+
+```yaml
+redis-exporter:
+  image: oliver006/redis_exporter:latest
+  restart: always
+  profiles:
+    - full
+  ports:
+    - "127.0.0.1:9121:9121"
+  environment:
+    REDIS_ADDR: redis://redis:6379
+  depends_on:
+    redis:
+      condition: service_healthy
+  mem_limit: 32m
+  networks:
+    - backend
+    - monitoring
+```
+
+It joins both `backend` (to reach Redis) and `monitoring` (so Prometheus can scrape it).
+
+### 19c.2 Prometheus Scrape Target
+
+Added to `monitoring/prometheus.yml`:
+
+```yaml
+- job_name: "redis"
+  static_configs:
+    - targets: ["redis-exporter:9121"]
+      labels:
+        service: "redis"
+```
+
+### 19c.3 Alert Rules
+
+Four Redis-specific alerts added to `monitoring/prometheus/alert.rules.yml`:
+
+| Alert | Condition | Severity | What It Means |
+|---|---|---|---|
+| `RedisDown` | `redis_up == 0` for 1m | critical | Redis is unreachable — all background tasks stalled |
+| `RedisHighMemoryUsage` | >80% of maxmemory for 5m | warning | Queue backlog or memory leak |
+| `RedisHighConnectionCount` | >50 clients for 5m | warning | Possible connection leak |
+| `RedisRejectedConnections` | Any rejected in 5m | critical | Redis is refusing new connections |
+
+### 19c.4 Deploy
+
+```bash
+cd ~/apps/orion
+git pull
+docker compose --profile full up -d
+```
+
+Verify the exporter is running and Prometheus can scrape it:
+
+```bash
+# Exporter health
+curl -s http://localhost:9121/health
+
+# Redis metrics flowing
+curl -s http://localhost:9121/metrics | grep redis_up
+
+# Prometheus target status (should show "redis" as UP)
+curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"redis"'
+```
+
+### 19c.5 Grafana Dashboard
+
+Import the community Redis dashboard:
+
+1. Open `https://grafana.wizard.lu`
+2. **Dashboards** → **Import** → ID `763` → Select Prometheus datasource
+3. You'll see: memory usage, connected clients, commands/sec, hit rate, key count
+
+### 19c.6 Verification
+
+```bash
+# Redis is being monitored
+curl -s http://localhost:9121/metrics | grep redis_up
+# redis_up 1
+
+# Memory usage
+curl -s http://localhost:9121/metrics | grep redis_memory_used_bytes
+# redis_memory_used_bytes 1.234e+07  (≈12 MB)
+
+# Connected clients
+curl -s http://localhost:9121/metrics | grep redis_connected_clients
+# redis_connected_clients 4  (API + celery-worker + celery-beat + flower)
+
+# Alert rules loaded
+curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | grep -i redis
+```
+
+---
+
 ## Step 20: Security Hardening

 Docker network segmentation, fail2ban configuration, and automatic security updates.
@@ -2010,6 +2269,7 @@ After Google Wallet is verified working:
 | Grafana | 3000 | 3001 (localhost) | `grafana.wizard.lu` |
 | Node Exporter | 9100 | 9100 (localhost) | (internal only) |
 | cAdvisor | 8080 | 8080 (localhost) | (internal only) |
+| Redis Exporter | 9121 | 9121 (localhost) | (internal only) |
 | Alertmanager | 9093 | 9093 (localhost) | (internal only) |
 | Caddy | — | 80, 443 | (reverse proxy) |