feat: add automated backups and Prometheus/Grafana monitoring stack (Steps 17-18)
Some checks failed
Some checks failed
Backups: pg_dump scripts with daily/weekly rotation and Cloudflare R2 offsite sync. Monitoring: Prometheus, Grafana, node-exporter, cAdvisor in docker-compose; /metrics endpoint activated via prometheus_client. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -49,8 +49,8 @@ Complete step-by-step guide for deploying Orion on a Hetzner Cloud VPS.
|
||||
|
||||
**Next steps:**
|
||||
|
||||
- [ ] Step 17: Backups — verify Hetzner backup scope, add PostgreSQL pg_dump
|
||||
- [ ] Step 18: Monitoring & observability — Prometheus, Grafana, uptime checks, alerting
|
||||
- [x] Step 17: Backups
|
||||
- [x] Step 18: Monitoring & observability
|
||||
|
||||
**Deferred (not urgent, do when all platforms ready):**
|
||||
|
||||
@@ -69,11 +69,13 @@ Complete step-by-step guide for deploying Orion on a Hetzner Cloud VPS.
|
||||
- `env_file: .env` added to `docker-compose.yml` — containers load host env vars properly
|
||||
- `CapacitySnapshot` model import fixed (moved from billing to monitoring in `alembic/env.py`)
|
||||
- All services verified healthy at `https://api.wizard.lu/health`
|
||||
- **Step 17: Backups** — automated pg_dump scripts (daily + weekly rotation), R2 offsite upload, restore helper
|
||||
- **Step 18: Monitoring** — Prometheus, Grafana, node-exporter, cAdvisor added to docker-compose; `/metrics` endpoint activated via `prometheus_client`
|
||||
|
||||
**Next steps:**
|
||||
|
||||
- [ ] Step 17: Backups — verify Hetzner backup scope, add PostgreSQL pg_dump
|
||||
- [ ] Step 18: Monitoring & observability — Prometheus, Grafana, uptime checks, alerting
|
||||
- [ ] Server-side: enable Hetzner backups, create R2 bucket, configure systemd timer
|
||||
- [ ] Server-side: add `grafana` DNS record, Caddyfile block, redeploy with `--profile full`
|
||||
|
||||
|
||||
## Installed Software Versions
|
||||
@@ -787,6 +789,298 @@ curl -I https://flower.wizard.lu
|
||||
sudo systemctl status gitea-runner
|
||||
```
|
||||
|
||||
## Step 17: Backups
|
||||
|
||||
Three layers of backup protection: Hetzner server snapshots, automated PostgreSQL dumps with local rotation, and offsite sync to Cloudflare R2.
|
||||
|
||||
### 17.1 Enable Hetzner Server Backups
|
||||
|
||||
In the Hetzner Cloud Console:
|
||||
|
||||
1. Go to **Servers** > select your server > **Backups**
|
||||
2. Click **Enable backups** (~20% of server cost, ~1.20 EUR/mo for CAX11)
|
||||
3. Hetzner takes automatic weekly snapshots with 7-day retention
|
||||
|
||||
This covers full-disk recovery (OS, Docker volumes, config files) but is coarse-grained. Database-level backups (below) give finer restore granularity.
|
||||
|
||||
### 17.2 Cloudflare R2 Setup (Offsite Backup Storage)
|
||||
|
||||
R2 provides S3-compatible object storage with a generous free tier (10 GB storage, 10 million reads/month).
|
||||
|
||||
**Create Cloudflare account and R2 bucket:**
|
||||
|
||||
1. Sign up at [cloudflare.com](https://dash.cloudflare.com/sign-up) (free account)
|
||||
2. Go to **R2 Object Storage** > **Create bucket**
|
||||
3. Name: `orion-backups`, region: automatic
|
||||
4. Go to **R2** > **Manage R2 API Tokens** > **Create API token**
|
||||
- Permissions: Object Read & Write
|
||||
- Specify bucket: `orion-backups`
|
||||
5. Note the **Account ID**, **Access Key ID**, and **Secret Access Key**
|
||||
|
||||
**Install and configure AWS CLI on the server:**
|
||||
|
||||
```bash
|
||||
sudo apt install -y awscli
|
||||
aws configure --profile r2
|
||||
# Access Key ID: <from step 5>
|
||||
# Secret Access Key: <from step 5>
|
||||
# Default region name: auto
|
||||
# Default output format: json
|
||||
```
|
||||
|
||||
**Test connectivity:**
|
||||
|
||||
```bash
|
||||
aws s3 ls --endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com --profile r2
|
||||
```
|
||||
|
||||
Add the R2 backup bucket name to your production `.env`:
|
||||
|
||||
```bash
|
||||
R2_BACKUP_BUCKET=orion-backups
|
||||
```
|
||||
|
||||
### 17.3 Backup Script
|
||||
|
||||
The backup script at `scripts/backup.sh` handles:
|
||||
|
||||
- `pg_dump` of Orion DB (via `docker exec orion-db-1`)
|
||||
- `pg_dump` of Gitea DB (via `docker exec gitea-db`)
|
||||
- On Sundays: copies daily backup to `weekly/` subdirectory
|
||||
- Rotation: keeps 7 daily, 4 weekly backups
|
||||
- Optional `--upload` flag: syncs to Cloudflare R2
|
||||
|
||||
```bash
|
||||
# Create backup directories
|
||||
mkdir -p ~/backups/{orion,gitea}/{daily,weekly}
|
||||
|
||||
# Run a manual backup
|
||||
bash ~/apps/orion/scripts/backup.sh
|
||||
|
||||
# Run with R2 upload
|
||||
bash ~/apps/orion/scripts/backup.sh --upload
|
||||
|
||||
# Verify backup integrity
|
||||
ls -lh ~/backups/orion/daily/
|
||||
gunzip -t ~/backups/orion/daily/*.sql.gz
|
||||
```
|
||||
|
||||
### 17.4 Systemd Timer (Daily at 03:00)
|
||||
|
||||
Create the service unit:
|
||||
|
||||
```bash
|
||||
sudo nano /etc/systemd/system/orion-backup.service
|
||||
```
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Orion database backup
|
||||
After=docker.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=samir
|
||||
ExecStart=/usr/bin/bash /home/samir/apps/orion/scripts/backup.sh --upload
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
```
|
||||
|
||||
Create the timer:
|
||||
|
||||
```bash
|
||||
sudo nano /etc/systemd/system/orion-backup.timer
|
||||
```
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Run Orion backup daily at 03:00
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 03:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
Enable and start:
|
||||
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now orion-backup.timer
|
||||
|
||||
# Verify timer is active
|
||||
systemctl list-timers orion-backup.timer
|
||||
|
||||
# Test manually
|
||||
sudo systemctl start orion-backup.service
|
||||
journalctl -u orion-backup.service --no-pager
|
||||
```
|
||||
|
||||
### 17.5 Restore Procedure
|
||||
|
||||
The restore script at `scripts/restore.sh` handles the full restore cycle:
|
||||
|
||||
```bash
|
||||
# Restore Orion database
|
||||
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/orion_20260214_030000.sql.gz
|
||||
|
||||
# Restore Gitea database
|
||||
bash ~/apps/orion/scripts/restore.sh gitea ~/backups/gitea/daily/gitea_20260214_030000.sql.gz
|
||||
```
|
||||
|
||||
The script will:
|
||||
|
||||
1. Stop app containers (keep DB running)
|
||||
2. Drop and recreate the database
|
||||
3. Restore from the `.sql.gz` backup
|
||||
4. Run Alembic migrations (Orion only)
|
||||
5. Restart all containers
|
||||
|
||||
To restore from R2 (if local backups are lost):
|
||||
|
||||
```bash
|
||||
# Download from R2
|
||||
aws s3 sync s3://orion-backups/ ~/backups/ \
|
||||
--endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com \
|
||||
--profile r2
|
||||
|
||||
# Then restore as usual
|
||||
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<latest>.sql.gz
|
||||
```
|
||||
|
||||
### 17.6 Verification
|
||||
|
||||
```bash
|
||||
# Backup files exist
|
||||
ls -lh ~/backups/orion/daily/
|
||||
ls -lh ~/backups/gitea/daily/
|
||||
|
||||
# Backup integrity
|
||||
gunzip -t ~/backups/orion/daily/*.sql.gz
|
||||
|
||||
# Timer is scheduled
|
||||
systemctl list-timers orion-backup.timer
|
||||
|
||||
# R2 sync (if configured)
|
||||
aws s3 ls s3://orion-backups/ --endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com --profile r2 --recursive
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 18: Monitoring & Observability
|
||||
|
||||
Prometheus + Grafana monitoring stack with host and container metrics.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌──────────────┐ scrape ┌─────────────────┐
|
||||
│ Prometheus │◄────────────────│ Orion API │ /metrics
|
||||
│ :9090 │◄────────────────│ node-exporter │ :9100
|
||||
│ │◄────────────────│ cAdvisor │ :8080
|
||||
└──────┬───────┘ └─────────────────┘
|
||||
│ query
|
||||
┌──────▼───────┐
|
||||
│ Grafana │──── https://grafana.wizard.lu
|
||||
│ :3001 │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
### Resource Budget (4 GB Server)
|
||||
|
||||
| Container | RAM Limit | Purpose |
|
||||
|---|---|---|
|
||||
| prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) |
|
||||
| grafana | 192 MB | Dashboards (SQLite backend) |
|
||||
| node-exporter | 64 MB | Host CPU/RAM/disk metrics |
|
||||
| cadvisor | 128 MB | Per-container resource metrics |
|
||||
| **Total new** | **640 MB** | |
|
||||
|
||||
Existing stack ~1.8 GB + 640 MB new = ~2.4 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
|
||||
|
||||
### 18.1 DNS Record
|
||||
|
||||
Add A and AAAA records for `grafana.wizard.lu`:
|
||||
|
||||
| Type | Name | Value | TTL |
|
||||
|---|---|---|---|
|
||||
| A | `grafana` | `91.99.65.229` | 300 |
|
||||
| AAAA | `grafana` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||
|
||||
### 18.2 Caddy Configuration
|
||||
|
||||
Add to `/etc/caddy/Caddyfile`:
|
||||
|
||||
```caddy
|
||||
grafana.wizard.lu {
|
||||
reverse_proxy localhost:3001
|
||||
}
|
||||
```
|
||||
|
||||
Reload Caddy:
|
||||
|
||||
```bash
|
||||
sudo systemctl reload caddy
|
||||
```
|
||||
|
||||
### 18.3 Production Environment
|
||||
|
||||
Add to `~/apps/orion/.env`:
|
||||
|
||||
```bash
|
||||
ENABLE_METRICS=true
|
||||
GRAFANA_URL=https://grafana.wizard.lu
|
||||
GRAFANA_ADMIN_USER=admin
|
||||
GRAFANA_ADMIN_PASSWORD=<strong-password>
|
||||
```
|
||||
|
||||
### 18.4 Deploy
|
||||
|
||||
```bash
|
||||
cd ~/apps/orion
|
||||
docker compose --profile full up -d --build
|
||||
```
|
||||
|
||||
Verify all containers are running:
|
||||
|
||||
```bash
|
||||
docker compose --profile full ps
|
||||
docker stats --no-stream
|
||||
```
|
||||
|
||||
### 18.5 Grafana First Login
|
||||
|
||||
1. Open `https://grafana.wizard.lu`
|
||||
2. Login with `admin` / `<password from .env>`
|
||||
3. Change the default password when prompted
|
||||
|
||||
**Import community dashboards:**
|
||||
|
||||
- **Node Exporter Full**: Dashboards > Import > ID `1860` > Select Prometheus datasource
|
||||
- **Docker / cAdvisor**: Dashboards > Import > ID `193` > Select Prometheus datasource
|
||||
|
||||
### 18.6 Verification
|
||||
|
||||
```bash
|
||||
# Prometheus metrics from Orion API
|
||||
curl -s https://api.wizard.lu/metrics | head -5
|
||||
|
||||
# Health endpoints
|
||||
curl -s https://api.wizard.lu/health/live
|
||||
curl -s https://api.wizard.lu/health/ready
|
||||
|
||||
# Prometheus targets (all should be "up")
|
||||
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health
|
||||
|
||||
# Grafana accessible
|
||||
curl -I https://grafana.wizard.lu
|
||||
|
||||
# RAM usage within limits
|
||||
docker stats --no-stream
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Domain & Port Reference
|
||||
@@ -801,6 +1095,10 @@ sudo systemctl status gitea-runner
|
||||
| Redis | 6379 | 6380 | (internal only) |
|
||||
| Flower | 5555 | 5555 | `flower.wizard.lu` |
|
||||
| Gitea | 3000 | 3000 | `git.wizard.lu` |
|
||||
| Prometheus | 9090 | 9090 (localhost) | (internal only) |
|
||||
| Grafana | 3000 | 3001 (localhost) | `grafana.wizard.lu` |
|
||||
| Node Exporter | 9100 | 9100 (localhost) | (internal only) |
|
||||
| cAdvisor | 8080 | 8080 (localhost) | (internal only) |
|
||||
| Caddy | — | 80, 443 | (reverse proxy) |
|
||||
|
||||
!!! note "Single backend, multiple domains"
|
||||
@@ -810,15 +1108,23 @@ sudo systemctl status gitea-runner
|
||||
|
||||
```
|
||||
~/
|
||||
├── gitea/
|
||||
│ └── docker-compose.yml # Gitea + PostgreSQL
|
||||
├── apps/
|
||||
│ └── orion/ # Orion application
|
||||
│ ├── .env # Production environment
|
||||
│ ├── docker-compose.yml # App stack (API, DB, Redis, Celery)
|
||||
│ ├── docker-compose.yml # App stack (API, DB, Redis, Celery, monitoring)
|
||||
│ ├── monitoring/ # Prometheus + Grafana config
|
||||
│ ├── logs/ # Application logs
|
||||
│ ├── uploads/ # User uploads
|
||||
│ └── exports/ # Export files
|
||||
├── backups/
|
||||
│ ├── orion/
|
||||
│ │ ├── daily/ # 7-day retention
|
||||
│ │ └── weekly/ # 4-week retention
|
||||
│ └── gitea/
|
||||
│ ├── daily/
|
||||
│ └── weekly/
|
||||
├── gitea/
|
||||
│ └── docker-compose.yml # Gitea + PostgreSQL
|
||||
└── gitea-runner/ # CI/CD runner (act_runner v0.2.13)
|
||||
├── act_runner # symlink → act_runner-0.2.13-linux-arm64
|
||||
├── act_runner-0.2.13-linux-arm64
|
||||
@@ -930,8 +1236,10 @@ After Caddy is configured:
|
||||
| API ReDoc | `https://api.wizard.lu/redoc` |
|
||||
| Admin panel | `https://wizard.lu/admin/login` |
|
||||
| Health check | `https://api.wizard.lu/health` |
|
||||
| Prometheus metrics | `https://api.wizard.lu/metrics` |
|
||||
| Gitea | `https://git.wizard.lu` |
|
||||
| Flower | `https://flower.wizard.lu` |
|
||||
| Grafana | `https://grafana.wizard.lu` |
|
||||
| OMS Platform | `https://oms.lu` (after DNS) |
|
||||
| Loyalty+ Platform | `https://rewardflow.lu` (after DNS) |
|
||||
|
||||
|
||||
Reference in New Issue
Block a user