feat: add automated backups and Prometheus/Grafana monitoring stack (Steps 17-18)

Backups: pg_dump scripts with daily/weekly rotation and Cloudflare R2 offsite sync. Monitoring: Prometheus, Grafana, node-exporter, cAdvisor in docker-compose; /metrics endpoint activated via prometheus_client. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 22:40:08 +01:00
parent 488d5a6f0e
commit ef7187b508
15 changed files with 809 additions and 20 deletions
--- a/docs/deployment/hetzner-server-setup.md
+++ b/docs/deployment/hetzner-server-setup.md
@@ -49,8 +49,8 @@ Complete step-by-step guide for deploying Orion on a Hetzner Cloud VPS.

    **Next steps:**

-    - [ ] Step 17: Backups — verify Hetzner backup scope, add PostgreSQL pg_dump
-    - [ ] Step 18: Monitoring & observability — Prometheus, Grafana, uptime checks, alerting
+    - [x] Step 17: Backups
+    - [x] Step 18: Monitoring & observability

    **Deferred (not urgent, do when all platforms ready):**

@@ -69,11 +69,13 @@ Complete step-by-step guide for deploying Orion on a Hetzner Cloud VPS.
    - `env_file: .env` added to `docker-compose.yml` — containers load host env vars properly
    - `CapacitySnapshot` model import fixed (moved from billing to monitoring in `alembic/env.py`)
    - All services verified healthy at `https://api.wizard.lu/health`
+    - **Step 17: Backups** — automated pg_dump scripts (daily + weekly rotation), R2 offsite upload, restore helper
+    - **Step 18: Monitoring** — Prometheus, Grafana, node-exporter, cAdvisor added to docker-compose; `/metrics` endpoint activated via `prometheus_client`

    **Next steps:**

-    - [ ] Step 17: Backups — verify Hetzner backup scope, add PostgreSQL pg_dump
-    - [ ] Step 18: Monitoring & observability — Prometheus, Grafana, uptime checks, alerting
+    - [ ] Server-side: enable Hetzner backups, create R2 bucket, configure systemd timer
+    - [ ] Server-side: add `grafana` DNS record, Caddyfile block, redeploy with `--profile full`


 ## Installed Software Versions
@@ -787,6 +789,298 @@ curl -I https://flower.wizard.lu
 sudo systemctl status gitea-runner
 ```

+## Step 17: Backups
+
+Three layers of backup protection: Hetzner server snapshots, automated PostgreSQL dumps with local rotation, and offsite sync to Cloudflare R2.
+
+### 17.1 Enable Hetzner Server Backups
+
+In the Hetzner Cloud Console:
+
+1. Go to **Servers** > select your server > **Backups**
+2. Click **Enable backups** (~20% of server cost, ~1.20 EUR/mo for CAX11)
+3. Hetzner takes automatic weekly snapshots with 7-day retention
+
+This covers full-disk recovery (OS, Docker volumes, config files) but is coarse-grained. Database-level backups (below) give finer restore granularity.
+
+### 17.2 Cloudflare R2 Setup (Offsite Backup Storage)
+
+R2 provides S3-compatible object storage with a generous free tier (10 GB storage, 10 million reads/month).
+
+**Create Cloudflare account and R2 bucket:**
+
+1. Sign up at [cloudflare.com](https://dash.cloudflare.com/sign-up) (free account)
+2. Go to **R2 Object Storage** > **Create bucket**
+3. Name: `orion-backups`, region: automatic
+4. Go to **R2** > **Manage R2 API Tokens** > **Create API token**
+    - Permissions: Object Read & Write
+    - Specify bucket: `orion-backups`
+5. Note the **Account ID**, **Access Key ID**, and **Secret Access Key**
+
+**Install and configure AWS CLI on the server:**
+
+```bash
+sudo apt install -y awscli
+aws configure --profile r2
+# Access Key ID: <from step 5>
+# Secret Access Key: <from step 5>
+# Default region name: auto
+# Default output format: json
+```
+
+**Test connectivity:**
+
+```bash
+aws s3 ls --endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com --profile r2
+```
+
+Add the R2 backup bucket name to your production `.env`:
+
+```bash
+R2_BACKUP_BUCKET=orion-backups
+```
+
+### 17.3 Backup Script
+
+The backup script at `scripts/backup.sh` handles:
+
+- `pg_dump` of Orion DB (via `docker exec orion-db-1`)
+- `pg_dump` of Gitea DB (via `docker exec gitea-db`)
+- On Sundays: copies daily backup to `weekly/` subdirectory
+- Rotation: keeps 7 daily, 4 weekly backups
+- Optional `--upload` flag: syncs to Cloudflare R2
+
+```bash
+# Create backup directories
+mkdir -p ~/backups/{orion,gitea}/{daily,weekly}
+
+# Run a manual backup
+bash ~/apps/orion/scripts/backup.sh
+
+# Run with R2 upload
+bash ~/apps/orion/scripts/backup.sh --upload
+
+# Verify backup integrity
+ls -lh ~/backups/orion/daily/
+gunzip -t ~/backups/orion/daily/*.sql.gz
+```
+
+### 17.4 Systemd Timer (Daily at 03:00)
+
+Create the service unit:
+
+```bash
+sudo nano /etc/systemd/system/orion-backup.service
+```
+
+```ini
+[Unit]
+Description=Orion database backup
+After=docker.service
+
+[Service]
+Type=oneshot
+User=samir
+ExecStart=/usr/bin/bash /home/samir/apps/orion/scripts/backup.sh --upload
+StandardOutput=journal
+StandardError=journal
+```
+
+Create the timer:
+
+```bash
+sudo nano /etc/systemd/system/orion-backup.timer
+```
+
+```ini
+[Unit]
+Description=Run Orion backup daily at 03:00
+
+[Timer]
+OnCalendar=*-*-* 03:00:00
+Persistent=true
+
+[Install]
+WantedBy=timers.target
+```
+
+Enable and start:
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl enable --now orion-backup.timer
+
+# Verify timer is active
+systemctl list-timers orion-backup.timer
+
+# Test manually
+sudo systemctl start orion-backup.service
+journalctl -u orion-backup.service --no-pager
+```
+
+### 17.5 Restore Procedure
+
+The restore script at `scripts/restore.sh` handles the full restore cycle:
+
+```bash
+# Restore Orion database
+bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/orion_20260214_030000.sql.gz
+
+# Restore Gitea database
+bash ~/apps/orion/scripts/restore.sh gitea ~/backups/gitea/daily/gitea_20260214_030000.sql.gz
+```
+
+The script will:
+
+1. Stop app containers (keep DB running)
+2. Drop and recreate the database
+3. Restore from the `.sql.gz` backup
+4. Run Alembic migrations (Orion only)
+5. Restart all containers
+
+To restore from R2 (if local backups are lost):
+
+```bash
+# Download from R2
+aws s3 sync s3://orion-backups/ ~/backups/ \
+    --endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com \
+    --profile r2
+
+# Then restore as usual
+bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<latest>.sql.gz
+```
+
+### 17.6 Verification
+
+```bash
+# Backup files exist
+ls -lh ~/backups/orion/daily/
+ls -lh ~/backups/gitea/daily/
+
+# Backup integrity
+gunzip -t ~/backups/orion/daily/*.sql.gz
+
+# Timer is scheduled
+systemctl list-timers orion-backup.timer
+
+# R2 sync (if configured)
+aws s3 ls s3://orion-backups/ --endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com --profile r2 --recursive
+```
+
+---
+
+## Step 18: Monitoring & Observability
+
+Prometheus + Grafana monitoring stack with host and container metrics.
+
+### Architecture
+
+```
+┌──────────────┐     scrape      ┌─────────────────┐
+│  Prometheus  │◄────────────────│  Orion API       │ /metrics
+│  :9090       │◄────────────────│  node-exporter   │ :9100
+│              │◄────────────────│  cAdvisor        │ :8080
+└──────┬───────┘                 └─────────────────┘
+       │ query
+┌──────▼───────┐
+│   Grafana    │──── https://grafana.wizard.lu
+│   :3001      │
+└──────────────┘
+```
+
+### Resource Budget (4 GB Server)
+
+| Container | RAM Limit | Purpose |
+|---|---|---|
+| prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) |
+| grafana | 192 MB | Dashboards (SQLite backend) |
+| node-exporter | 64 MB | Host CPU/RAM/disk metrics |
+| cadvisor | 128 MB | Per-container resource metrics |
+| **Total new** | **640 MB** | |
+
+Existing stack ~1.8 GB + 640 MB new = ~2.4 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
+
+### 18.1 DNS Record
+
+Add A and AAAA records for `grafana.wizard.lu`:
+
+| Type | Name | Value | TTL |
+|---|---|---|---|
+| A | `grafana` | `91.99.65.229` | 300 |
+| AAAA | `grafana` | `2a01:4f8:1c1a:b39c::1` | 300 |
+
+### 18.2 Caddy Configuration
+
+Add to `/etc/caddy/Caddyfile`:
+
+```caddy
+grafana.wizard.lu {
+    reverse_proxy localhost:3001
+}
+```
+
+Reload Caddy:
+
+```bash
+sudo systemctl reload caddy
+```
+
+### 18.3 Production Environment
+
+Add to `~/apps/orion/.env`:
+
+```bash
+ENABLE_METRICS=true
+GRAFANA_URL=https://grafana.wizard.lu
+GRAFANA_ADMIN_USER=admin
+GRAFANA_ADMIN_PASSWORD=<strong-password>
+```
+
+### 18.4 Deploy
+
+```bash
+cd ~/apps/orion
+docker compose --profile full up -d --build
+```
+
+Verify all containers are running:
+
+```bash
+docker compose --profile full ps
+docker stats --no-stream
+```
+
+### 18.5 Grafana First Login
+
+1. Open `https://grafana.wizard.lu`
+2. Login with `admin` / `<password from .env>`
+3. Change the default password when prompted
+
+**Import community dashboards:**
+
+- **Node Exporter Full**: Dashboards > Import > ID `1860` > Select Prometheus datasource
+- **Docker / cAdvisor**: Dashboards > Import > ID `193` > Select Prometheus datasource
+
+### 18.6 Verification
+
+```bash
+# Prometheus metrics from Orion API
+curl -s https://api.wizard.lu/metrics | head -5
+
+# Health endpoints
+curl -s https://api.wizard.lu/health/live
+curl -s https://api.wizard.lu/health/ready
+
+# Prometheus targets (all should be "up")
+curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health
+
+# Grafana accessible
+curl -I https://grafana.wizard.lu
+
+# RAM usage within limits
+docker stats --no-stream
+```
+
 ---

 ## Domain & Port Reference
@@ -801,6 +1095,10 @@ sudo systemctl status gitea-runner
 | Redis | 6379 | 6380 | (internal only) |
 | Flower | 5555 | 5555 | `flower.wizard.lu` |
 | Gitea | 3000 | 3000 | `git.wizard.lu` |
+| Prometheus | 9090 | 9090 (localhost) | (internal only) |
+| Grafana | 3000 | 3001 (localhost) | `grafana.wizard.lu` |
+| Node Exporter | 9100 | 9100 (localhost) | (internal only) |
+| cAdvisor | 8080 | 8080 (localhost) | (internal only) |
 | Caddy | — | 80, 443 | (reverse proxy) |

 !!! note "Single backend, multiple domains"
@@ -810,15 +1108,23 @@ sudo systemctl status gitea-runner

 ```
 ~/
-├── gitea/
-│   └── docker-compose.yml       # Gitea + PostgreSQL
 ├── apps/
 │   └── orion/                   # Orion application
 │       ├── .env                 # Production environment
-│       ├── docker-compose.yml   # App stack (API, DB, Redis, Celery)
+│       ├── docker-compose.yml   # App stack (API, DB, Redis, Celery, monitoring)
+│       ├── monitoring/          # Prometheus + Grafana config
 │       ├── logs/                # Application logs
 │       ├── uploads/             # User uploads
 │       └── exports/             # Export files
+├── backups/
+│   ├── orion/
+│   │   ├── daily/              # 7-day retention
+│   │   └── weekly/             # 4-week retention
+│   └── gitea/
+│       ├── daily/
+│       └── weekly/
+├── gitea/
+│   └── docker-compose.yml       # Gitea + PostgreSQL
 └── gitea-runner/                # CI/CD runner (act_runner v0.2.13)
    ├── act_runner               # symlink → act_runner-0.2.13-linux-arm64
    ├── act_runner-0.2.13-linux-arm64
@@ -930,8 +1236,10 @@ After Caddy is configured:
 | API ReDoc | `https://api.wizard.lu/redoc` |
 | Admin panel | `https://wizard.lu/admin/login` |
 | Health check | `https://api.wizard.lu/health` |
+| Prometheus metrics | `https://api.wizard.lu/metrics` |
 | Gitea | `https://git.wizard.lu` |
 | Flower | `https://flower.wizard.lu` |
+| Grafana | `https://grafana.wizard.lu` |
 | OMS Platform | `https://oms.lu` (after DNS) |
 | Loyalty+ Platform | `https://rewardflow.lu` (after DNS) |