docs: update observability and deployment docs to match production stack

Update observability.md with production container table, actual init code, and correct env var names. Update docker.md with full 10-service table and backup/monitoring cross-references. Add explicit AAAA records to DNS tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:44:05 +01:00
parent 10aa75aa69
commit 677e5211f9
4 changed files with 115 additions and 47 deletions
--- a/docs/architecture/observability.md
+++ b/docs/architecture/observability.md
@@ -2,6 +2,38 @@

 The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.

+## Production Stack
+
+The full monitoring stack runs as Docker containers alongside the application:
+
+| Container | Image | Port | Purpose |
+|---|---|---|---|
+| prometheus | `prom/prometheus` | 9090 (localhost) | Metrics storage, 15-day retention |
+| grafana | `grafana/grafana` | 3001 (localhost) | Dashboards at `https://grafana.wizard.lu` |
+| node-exporter | `prom/node-exporter` | 9100 (localhost) | Host CPU/RAM/disk metrics |
+| cadvisor | `gcr.io/cadvisor/cadvisor` | 8080 (localhost) | Per-container resource metrics |
+
+All monitoring containers run under `profiles: [full]` in `docker-compose.yml` with memory limits (256 + 192 + 64 + 128 = 640 MB total).
+
+```
+┌──────────────┐     scrape      ┌─────────────────┐
+│  Prometheus  │◄────────────────│  Orion API       │ /metrics
+│  :9090       │◄────────────────│  node-exporter   │ :9100
+│              │◄────────────────│  cAdvisor        │ :8080
+└──────┬───────┘                 └─────────────────┘
+       │ query
+┌──────▼───────┐
+│   Grafana    │──── https://grafana.wizard.lu
+│   :3001      │
+└──────────────┘
+```
+
+Configuration files:
+
+- `monitoring/prometheus.yml` — scrape targets (orion-api, node-exporter, cadvisor, self)
+- `monitoring/grafana/provisioning/datasources/datasource.yml` — auto-provisions Prometheus
+- `monitoring/grafana/provisioning/dashboards/dashboard.yml` — file-based dashboard provider
+
 ## Overview

 ```
@@ -326,47 +358,47 @@ init_observability(

 ### Application Lifespan

+Observability is initialized in `app/core/lifespan.py` and the health router is mounted in `main.py`:
+
 ```python
-# main.py
-from contextlib import asynccontextmanager
-from fastapi import FastAPI
-from app.core.observability import (
-    init_observability,
-    shutdown_observability,
-    health_router,
-    register_module_health_checks,
-)
+# app/core/lifespan.py
+from app.core.config import settings
+from app.core.observability import init_observability, shutdown_observability

@asynccontextmanager
 async def lifespan(app: FastAPI):
    # Startup
    init_observability(
-        enable_metrics=True,
-        sentry_dsn=settings.SENTRY_DSN,
-        environment=settings.ENVIRONMENT,
-        flower_url=settings.FLOWER_URL,
-        grafana_url=settings.GRAFANA_URL,
+        enable_metrics=settings.enable_metrics,
+        sentry_dsn=settings.sentry_dsn,
+        environment=settings.sentry_environment,
+        flower_url=settings.flower_url,
+        grafana_url=settings.grafana_url,
    )
-    register_module_health_checks()
-
    yield
-
    # Shutdown
    shutdown_observability()
-
-app = FastAPI(lifespan=lifespan)
-app.include_router(health_router)
 ```

+```python
+# main.py
+from app.core.observability import health_router
+app.include_router(health_router)  # /metrics, /health/live, /health/ready, /health/tools
+```
+
+Note: `/health` is defined separately in `main.py` with a richer response (DB check, feature list, docs links). The `health_router` provides the Kubernetes-style probes and Prometheus endpoint.
+
 ### Environment Variables

-| Variable | Description | Default |
-|----------|-------------|---------|
-| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
-| `ENVIRONMENT` | Environment name | "development" |
-| `ENABLE_METRICS` | Enable Prometheus metrics | False |
-| `FLOWER_URL` | Flower dashboard URL | None |
-| `GRAFANA_URL` | Grafana dashboard URL | None |
+| Variable | Config field | Description | Default |
+|----------|-------------|-------------|---------|
+| `ENABLE_METRICS` | `enable_metrics` | Enable Prometheus metrics collection | `False` |
+| `GRAFANA_URL` | `grafana_url` | Grafana dashboard URL | `https://grafana.wizard.lu` |
+| `GRAFANA_ADMIN_USER` | — | Grafana admin username (docker-compose only) | `admin` |
+| `GRAFANA_ADMIN_PASSWORD` | — | Grafana admin password (docker-compose only) | `changeme` |
+| `SENTRY_DSN` | `sentry_dsn` | Sentry DSN for error tracking | `None` (disabled) |
+| `SENTRY_ENVIRONMENT` | `sentry_environment` | Environment name for Sentry | `development` |
+| `FLOWER_URL` | `flower_url` | Flower dashboard URL | `http://localhost:5555` |

 ## Kubernetes Integration

@@ -424,6 +456,7 @@ spec:

 ## Related Documentation

+- [Hetzner Server Setup — Step 18](../deployment/hetzner-server-setup.md#step-18-monitoring-observability) - Production monitoring deployment
 - [Module System](module-system.md) - Module health check integration
 - [Background Tasks](background-tasks.md) - Celery monitoring with Flower
 - [Deployment](../deployment/index.md) - Production deployment with monitoring
--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@@ -36,11 +36,20 @@ make docker-down

 ### Current Services

-| Service | Port | Purpose |
-|---------|------|---------|
-| db | 5432 | PostgreSQL database |
-| redis | 6379 | Cache and queue broker |
-| api | 8000 | FastAPI application |
+| Service | Port | Profile | Purpose |
+|---------|------|---------|---------|
+| db | 5432 | (default) | PostgreSQL database |
+| redis | 6379 | (default) | Cache and queue broker |
+| api | 8000 | full | FastAPI application |
+| celery-worker | — | full | Background task processing |
+| celery-beat | — | full | Scheduled task scheduler |
+| flower | 5555 | full | Celery monitoring dashboard |
+| prometheus | 9090 | full | Metrics storage (15-day retention) |
+| grafana | 3001 | full | Dashboards (`https://grafana.wizard.lu`) |
+| node-exporter | 9100 | full | Host CPU/RAM/disk metrics |
+| cadvisor | 8080 | full | Per-container resource metrics |
+
+Use `docker compose --profile full up -d` to start all services, or `docker compose up -d` for just db + redis (local development).

 ---

@@ -368,44 +377,60 @@ docker compose -f docker-compose.prod.yml up -d api

 ## Backups

-### Database Backup
+Automated backup scripts handle daily pg_dump with rotation and optional Cloudflare R2 offsite sync:

 ```bash
-# Create backup
-docker compose -f docker-compose.prod.yml exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz
+# Run backup (Orion + Gitea databases)
+bash scripts/backup.sh

-# Restore backup
-gunzip -c backup_20240115.sql.gz | docker compose -f docker-compose.prod.yml exec -T db psql -U orion_user -d orion_db
+# Run backup with R2 upload
+bash scripts/backup.sh --upload
+
+# Restore from backup
+bash scripts/restore.sh orion ~/backups/orion/daily/orion_20260214_030000.sql.gz
+bash scripts/restore.sh gitea ~/backups/gitea/daily/gitea_20260214_030000.sql.gz
 ```

-### Volume Backup
+Backups are stored in `~/backups/{orion,gitea}/{daily,weekly}/` with 7-day daily and 4-week weekly retention. A systemd timer runs the backup daily at 03:00.
+
+See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups) for full setup instructions.
+
+### Manual Database Backup

 ```bash
-# Backup all volumes
-docker run --rm \
-  -v orion_postgres_data:/data \
-  -v $(pwd)/backups:/backup \
-  alpine tar czf /backup/postgres_$(date +%Y%m%d).tar.gz /data
+# One-off backup
+docker compose exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz
+
+# Restore
+gunzip -c backup_20240115.sql.gz | docker compose exec -T db psql -U orion_user -d orion_db
 ```

 ---

 ## Monitoring

+The monitoring stack (Prometheus, Grafana, node-exporter, cAdvisor) runs under `profiles: [full]`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) for full setup and [Observability Framework](../architecture/observability.md) for the application-level metrics architecture.
+
 ### Resource Usage

 ```bash
-docker stats
+docker stats --no-stream
 ```

 ### Health Checks

 ```bash
 # Check service health
-docker compose -f docker-compose.prod.yml ps
+docker compose --profile full ps

-# Test API health
+# API health
 curl -s http://localhost:8000/health | jq
+
+# Prometheus metrics
+curl -s http://localhost:8000/metrics | head -5
+
+# Prometheus targets
+curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health
 ```

 ---
--- a/docs/deployment/hetzner-server-setup.md
+++ b/docs/deployment/hetzner-server-setup.md
@@ -444,6 +444,8 @@ Before setting up Caddy, point your domain's DNS to the server.
 |---|---|---|---|
 | A | `@` | `91.99.65.229` | 300 |
 | A | `www` | `91.99.65.229` | 300 |
+| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
+| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |

 ### rewardflow.lu (Loyalty+ Platform) — TODO

@@ -451,6 +453,8 @@ Before setting up Caddy, point your domain's DNS to the server.
 |---|---|---|---|
 | A | `@` | `91.99.65.229` | 300 |
 | A | `www` | `91.99.65.229` | 300 |
+| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
+| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |

 ### IPv6 (AAAA) Records — TODO

--- a/docs/deployment/production.md
+++ b/docs/deployment/production.md
@@ -328,7 +328,10 @@ sudo systemctl restart orion orion-celery

 ## Backups

-### Database Backup Script
+!!! tip "Docker deployment"
+    For Docker-based deployments, use the automated backup scripts (`scripts/backup.sh` and `scripts/restore.sh`) with systemd timer. See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups).
+
+### Database Backup Script (VPS without Docker)

 ```bash
 sudo nano /home/orion/backup.sh
@@ -373,6 +376,9 @@ sudo -u orion crontab -e

 ## Monitoring

+!!! tip "Docker deployment"
+    For Docker-based deployments, a full Prometheus + Grafana + node-exporter + cAdvisor stack is included in `docker-compose.yml`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) and [Observability Framework](../architecture/observability.md).
+
 ### Basic Health Check

 ```bash