- Prometheus alert rules (host, container, API, Celery, target-down) - Alertmanager with email routing (critical 1h, warning 4h repeat) - Docker network segmentation (frontend/backend/monitoring) - Incident response runbook with 8 copy-paste runbooks - Environment variables reference (55+ vars documented) - Hetzner setup docs updated with Steps 19-24 - Launch readiness updated with Feb 2026 infrastructure status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 KiB
Incident Response Runbook
Operational runbook for diagnosing and resolving production incidents on the Orion platform.
!!! info "Server Details"
- Server: Hetzner Cloud CAX11 (4 GB RAM, ARM64)
- IP: 91.99.65.229
- App path: ~/apps/orion
- Docker profile: --profile full
- Reverse proxy: Caddy 2.10.2 (systemd, not containerized)
- Domains: wizard.lu, omsflow.lu, rewardflow.lu
Severity Levels
| Level | Definition | Examples | Response Time | Notification |
|---|---|---|---|---|
| SEV-1 | Platform down, all users affected | API unreachable, database down, server unresponsive | < 15 min | Immediate page |
| SEV-2 | Feature broken, subset of users affected | Celery not processing tasks, one platform domain down, SSL expired | < 1 hour | Slack / email alert |
| SEV-3 | Minor issue, no user impact or minimal degradation | High memory warning, slow queries, disk usage above 70% | < 4 hours | Grafana alert, next business day |
!!! warning "Escalation" If a SEV-2 is not resolved within 2 hours, escalate to SEV-1. If a SEV-3 trends toward impacting users, escalate to SEV-2.
Quick Diagnosis Decision Tree
Follow these steps in order when responding to any incident.
Step 1: Can you reach the server?
ssh samir@91.99.65.229
- Yes -- proceed to Step 2.
- No -- check your local network. Try from a different connection. If still unreachable, check Hetzner Status and open a support ticket. As a last resort, use the Hetzner Cloud Console rescue mode.
Step 2: Is Docker running?
sudo systemctl status docker
- Yes -- proceed to Step 3.
- No -- start Docker:
sudo systemctl start docker
Step 3: Are the containers running?
cd ~/apps/orion && docker compose --profile full ps
Check for containers in Restarting, Exited, or missing entirely. Healthy output shows all containers as Up (healthy) or Up.
- All healthy -- proceed to Step 4.
- Some down -- go to the relevant runbook below (API, Database, Celery, etc.).
- All down -- go to Runbook 7: Full Stack Restart.
Step 4: Is Caddy running?
sudo systemctl status caddy
- Yes -- proceed to Step 5.
- No -- go to Runbook 4: Caddy / SSL / Domain Issues.
Step 5: Are domains resolving?
dig wizard.lu +short
dig api.wizard.lu +short
dig omsflow.lu +short
dig rewardflow.lu +short
All should return 91.99.65.229. If not, check DNS records at your registrar.
Step 6: Is the API responding?
curl -s http://localhost:8001/health | python3 -m json.tool
curl -s https://api.wizard.lu/health
- Both work -- issue may be intermittent. Check Grafana for recent anomalies.
- localhost works, external fails -- Caddy or DNS issue. Go to Runbook 4.
- Neither works -- API is down. Go to Runbook 1.
Runbooks
1. API Container Down / Crash-Looping
!!! danger "SEV-1" API unavailability affects all users on all platforms.
Symptoms: api container shows Restarting or Exited in docker compose ps. External URLs return 502.
Diagnose:
cd ~/apps/orion
# Check container status
docker compose --profile full ps api
# View recent logs (last 100 lines)
docker compose --profile full logs --tail=100 api
# Look for Python exceptions
docker compose --profile full logs api 2>&1 | grep -i "error\|exception\|traceback" | tail -20
Common causes and fixes:
=== "Import / syntax error in code"
The log will show a Python traceback on startup. This usually means a bad deploy.
```bash
# Roll back to previous commit
cd ~/apps/orion
git log --oneline -5
git checkout <previous-good-commit>
docker compose --profile full up -d --build api
```
=== "Database connection refused"
The API cannot reach PostgreSQL. See [Runbook 2](#2-database-issues).
=== "Port conflict"
Another process is using port 8001.
```bash
sudo ss -tlnp | grep 8001
# Kill the conflicting process, then restart
docker compose --profile full restart api
```
=== "Out of memory"
The container was OOM-killed. See [Runbook 3](#3-high-memory-oom).
Recovery:
# Restart the API container
cd ~/apps/orion
docker compose --profile full restart api
# Wait 10 seconds, then verify
sleep 10
docker compose --profile full ps api
curl -s http://localhost:8001/health
2. Database Issues
!!! danger "SEV-1" Database unavailability brings down the entire platform.
Symptoms: API logs show connection refused, could not connect to server, or OperationalError. Health check fails with database errors.
Diagnose:
cd ~/apps/orion
# Check PostgreSQL container
docker compose --profile full ps db
docker compose --profile full logs --tail=50 db
# Test connection from inside the network
docker compose --profile full exec db pg_isready -U orion_user -d orion_db
# Check disk space (PostgreSQL needs space for WAL)
df -h
docker system df
Common causes and fixes:
=== "Container stopped"
```bash
cd ~/apps/orion
docker compose --profile full up -d db
sleep 5
docker compose --profile full exec db pg_isready -U orion_user -d orion_db
# Once healthy, restart the API
docker compose --profile full restart api celery-worker celery-beat
```
=== "Too many connections"
```bash
# Check active connections
docker compose --profile full exec db \
psql -U orion_user -d orion_db -c \
"SELECT count(*) FROM pg_stat_activity;"
# Kill idle connections
docker compose --profile full exec db \
psql -U orion_user -d orion_db -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
```
=== "Disk full (WAL or data)"
See [Runbook 6: Disk Full](#6-disk-full).
=== "Data corruption (last resort)"
If PostgreSQL refuses to start with corruption errors:
```bash
# Stop everything
cd ~/apps/orion
docker compose --profile full down
# Restore from backup (see Runbook 8)
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<latest>.sql.gz
```
Check for slow queries:
docker compose --profile full exec db \
psql -U orion_user -d orion_db -c \
"SELECT pid, now() - query_start AS duration, left(query, 80)
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 10;"
Kill a stuck query:
docker compose --profile full exec db \
psql -U orion_user -d orion_db -c \
"SELECT pg_terminate_backend(<PID>);"
3. High Memory / OOM
!!! warning "SEV-2 (can escalate to SEV-1 if OOM killer fires)" The server has 4 GB RAM. Normal usage is ~2.4 GB. Above 3.2 GB is critical.
Symptoms: Containers restarting unexpectedly. dmesg shows OOM killer. Grafana memory graphs spiking.
Diagnose:
# System memory
free -h
# Per-container memory usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Check for OOM kills
sudo dmesg | grep -i "oom\|killed" | tail -10
# Top processes by memory
ps aux --sort=-%mem | head -15
Immediate relief:
# Clear Docker build cache
docker builder prune -f
# Remove unused images
docker image prune -f
# Remove stopped containers
docker container prune -f
# Nuclear option: remove all unused Docker data
docker system prune -f
If a specific container is the culprit:
cd ~/apps/orion
# Restart the offending container
docker compose --profile full restart <container-name>
# If the API is leaking memory, a restart is the fastest fix
docker compose --profile full restart api
If CI jobs are running (they add ~550 MB temporarily):
# Check if a Gitea Actions runner job is active
sudo systemctl status gitea-runner
# Wait for the job to finish, or stop the runner temporarily
sudo systemctl stop gitea-runner
!!! tip "Long-term fix" If OOM events become frequent, upgrade to CAX21 (8 GB RAM, ~7.50 EUR/mo) via Hetzner Cloud Console > Server > Rescale. The upgrade takes about 2 minutes and preserves all data.
4. Caddy / SSL / Domain Issues
!!! warning "SEV-2" Caddy handles TLS termination and routing for all domains. If Caddy is down, all external access is lost even though the API may be running fine internally.
Symptoms: Sites return connection refused on port 443. SSL certificate errors in the browser. Specific domain not working.
Diagnose:
# Check Caddy status
sudo systemctl status caddy
# View Caddy logs
sudo journalctl -u caddy --since "30 minutes ago" --no-pager
# Test internal API directly (bypasses Caddy)
curl -s http://localhost:8001/health
# Test SSL certificates
curl -vI https://wizard.lu 2>&1 | grep -E "SSL|subject|expire"
curl -vI https://api.wizard.lu 2>&1 | grep -E "SSL|subject|expire"
Common causes and fixes:
=== "Caddy stopped"
```bash
sudo systemctl start caddy
sudo systemctl status caddy
```
=== "Caddyfile syntax error"
```bash
# Validate configuration
sudo caddy validate --config /etc/caddy/Caddyfile
# If invalid, check recent changes
sudo nano /etc/caddy/Caddyfile
# After fixing, reload (not restart, preserves certificates)
sudo systemctl reload caddy
```
=== "SSL certificate issue"
Caddy auto-renews certificates. If renewal fails, it is usually a port 80 or DNS issue.
```bash
# Ensure port 80 is open (needed for ACME HTTP challenge)
sudo ufw status | grep 80
# Check Caddy certificate storage
sudo ls -la /var/lib/caddy/.local/share/caddy/certificates/
# Force certificate renewal by restarting Caddy
sudo systemctl restart caddy
```
=== "DNS not pointing to server"
```bash
dig wizard.lu +short
# Should return 91.99.65.229
# If wrong, update DNS at registrar and wait for propagation
# Temporary: test by adding to /etc/hosts on your local machine
```
Caddyfile reference (at /etc/caddy/Caddyfile):
sudo cat /etc/caddy/Caddyfile
5. Celery Worker Issues
!!! attention "SEV-2" Celery processes background tasks (imports, emails, scheduled jobs). If down, no background work happens, but the platform remains browsable.
Symptoms: Background tasks not executing. Flower shows no active workers. Emails not being sent.
Diagnose:
cd ~/apps/orion
# Check worker and beat containers
docker compose --profile full ps celery-worker celery-beat
# View worker logs
docker compose --profile full logs --tail=50 celery-worker
docker compose --profile full logs --tail=50 celery-beat
# Check Redis (the broker)
docker compose --profile full exec redis redis-cli ping
docker compose --profile full exec redis redis-cli llen celery
# Check Flower for worker status
curl -s http://localhost:5555/api/workers | python3 -m json.tool
Common causes and fixes:
=== "Worker crashed / import error"
```bash
# Check for Python errors in worker logs
docker compose --profile full logs celery-worker 2>&1 | grep -i "error\|exception" | tail -10
# Restart worker
cd ~/apps/orion
docker compose --profile full restart celery-worker celery-beat
```
=== "Redis down"
```bash
# Check Redis container
docker compose --profile full ps redis
docker compose --profile full logs --tail=20 redis
# Restart Redis, then workers
cd ~/apps/orion
docker compose --profile full restart redis
sleep 5
docker compose --profile full restart celery-worker celery-beat
```
=== "Task queue backed up"
```bash
# Check queue length
docker compose --profile full exec redis redis-cli llen celery
# If queue is extremely large and tasks are stale, purge
docker compose --profile full exec api \
celery -A app.core.celery_app purge -f
# Restart worker to pick up fresh
docker compose --profile full restart celery-worker
```
=== "Beat scheduler out of sync"
```bash
# Remove the beat schedule file and restart
docker compose --profile full exec celery-beat rm -f /app/celerybeat-schedule
docker compose --profile full restart celery-beat
```
6. Disk Full
!!! warning "SEV-2 (becomes SEV-1 if PostgreSQL cannot write WAL)" The server has 37 GB disk. Docker images, logs, and database WAL can fill it quickly.
Symptoms: Write errors in logs. PostgreSQL panics. Docker cannot pull images. No space left on device errors.
Diagnose:
# Overall disk usage
df -h /
# Docker disk usage breakdown
docker system df
# Largest directories
sudo du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10
du -sh ~/backups/* 2>/dev/null
du -sh ~/apps/orion/logs/* 2>/dev/null
Immediate cleanup:
# 1. Remove old Docker images and build cache (safe, usually frees 2-5 GB)
docker system prune -af --volumes
# 2. Truncate application logs
cd ~/apps/orion
truncate -s 0 logs/*.log 2>/dev/null
# 3. Remove old backups beyond retention policy
find ~/backups -name "*.sql.gz" -mtime +14 -delete
# 4. Clean systemd journal logs (keep last 3 days)
sudo journalctl --vacuum-time=3d
# 5. Clean apt cache
sudo apt clean
After freeing space:
# Verify space recovered
df -h /
# Restart any containers that failed due to disk full
cd ~/apps/orion
docker compose --profile full up -d
!!! tip "Prevention" Set up a Grafana alert for disk usage > 70%. The node-exporter dashboard (ID 1860) includes disk usage panels. If the server persistently runs low on disk, upgrade to CAX21 (80 GB disk).
7. Full Stack Restart (After Reboot)
!!! info "SEV-2" After a server reboot (planned or unplanned), all services need to come back up in the correct order.
When to use: After a Hetzner maintenance reboot, manual reboot, or kernel upgrade.
Step-by-step recovery:
# 1. Verify Docker is running
sudo systemctl status docker
# If not: sudo systemctl start docker
# 2. Start Gitea (needed for CI, not for the app itself)
cd ~/gitea && docker compose up -d
sleep 5
# 3. Start the Orion stack (db and redis start first due to depends_on)
cd ~/apps/orion
docker compose --profile full up -d
sleep 15
# 4. Verify all containers are healthy
docker compose --profile full ps
# 5. Verify API health
curl -s http://localhost:8001/health | python3 -m json.tool
# 6. Start Caddy (should auto-start, but verify)
sudo systemctl status caddy
# If not running: sudo systemctl start caddy
# 7. Start the Gitea Actions runner
sudo systemctl status gitea-runner
# If not running: sudo systemctl start gitea-runner
# 8. Verify external access
curl -s https://api.wizard.lu/health
curl -I https://wizard.lu
curl -I https://omsflow.lu
curl -I https://rewardflow.lu
# 9. Verify monitoring
curl -I https://grafana.wizard.lu
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -c '"health":"up"'
# 10. Verify backups timer is active
systemctl list-timers orion-backup.timer
!!! note "Boot order"
Docker containers with restart: always will auto-start after Docker starts. Caddy and the Gitea runner are systemd services with WantedBy=multi-user.target and also auto-start. In practice, you mainly need to verify rather than manually start.
8. Restore from Backup (Disaster Recovery)
!!! danger "SEV-1" Use this runbook when the database is corrupted or data is lost and you need to restore from a backup.
Prerequisites: Identify the backup to restore from.
# List available local backups
ls -lh ~/backups/orion/daily/
ls -lh ~/backups/orion/weekly/
# If local backups are gone, download from R2
source ~/apps/orion/.env
aws s3 ls s3://orion-backups/orion/daily/ \
--endpoint-url "https://${R2_ACCOUNT_ID:-$(grep R2_ACCOUNT_ID ~/apps/orion/.env | cut -d= -f2)}.r2.cloudflarestorage.com" \
--profile r2
Download from R2 (if local backups unavailable):
aws s3 sync s3://orion-backups/ ~/backups/ \
--endpoint-url "https://<ACCOUNT_ID>.r2.cloudflarestorage.com" \
--profile r2
Restore using the restore script:
# Restore Orion database
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<backup-file>.sql.gz
The restore script will:
- Stop application containers (API, Celery) while keeping the database running
- Drop and recreate the
orion_dbdatabase - Restore from the
.sql.gzbackup file - Run
alembic upgrade headsto apply any pending migrations - Restart all containers
Verify after restore:
cd ~/apps/orion
# Check API health
curl -s http://localhost:8001/health | python3 -m json.tool
# Verify data integrity (check row counts of key tables)
docker compose --profile full exec db \
psql -U orion_user -d orion_db -c \
"SELECT 'platforms' AS tbl, count(*) FROM platforms
UNION ALL SELECT 'users', count(*) FROM users
UNION ALL SELECT 'stores', count(*) FROM stores;"
# Verify external access
curl -s https://api.wizard.lu/health
Restore Gitea (if needed):
bash ~/apps/orion/scripts/restore.sh gitea ~/backups/gitea/daily/<backup-file>.sql.gz
Full server rebuild from Hetzner snapshot (worst case):
- Go to Hetzner Cloud Console > Servers > Snapshots
- Select the most recent snapshot and click Rebuild from snapshot
- After rebuild, SSH in and verify all services per Runbook 7
Post-Incident Report Template
After resolving any SEV-1 or SEV-2 incident, create a post-incident report. Save reports in a shared location for the team.
# Post-Incident Report: [Brief Title]
**Date**: YYYY-MM-DD
**Severity**: SEV-1 / SEV-2
**Duration**: HH:MM (from detection to resolution)
**Author**: [Name]
## Incident Summary
[1-2 sentence description of what happened and the user impact.]
## Timeline (UTC)
| Time | Event |
|-------|--------------------------------------------|
| HH:MM | Alert triggered / issue reported |
| HH:MM | Responder acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix applied |
| HH:MM | Service fully restored |
## Root Cause
[What caused the incident. Be specific -- e.g., "OOM killer terminated the API
container because a Celery import task loaded 50k products into memory at once."]
## Resolution
[What was done to fix it. Include exact commands if relevant.]
## Impact
- **Users affected**: [number or scope]
- **Data lost**: [none / describe]
- **Downtime**: [duration]
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Preventive measure 1] | [Name] | YYYY-MM-DD | [ ] Open |
| [Preventive measure 2] | [Name] | YYYY-MM-DD | [ ] Open |
## Lessons Learned
[What went well, what could be improved in the response process.]
Useful Monitoring URLs
| Service | URL | Purpose |
|---|---|---|
| Grafana | grafana.wizard.lu | Dashboards for host metrics, container metrics |
| Prometheus | http://localhost:9090 (SSH tunnel) |
Raw metrics queries, target health |
| Prometheus Targets | http://localhost:9090/targets |
Check which scrape targets are up/down |
| API Health | api.wizard.lu/health | Application health check (DB, Redis) |
| API Liveness | api.wizard.lu/health/live | Basic liveness probe |
| API Readiness | api.wizard.lu/health/ready | Readiness probe (includes dependencies) |
| API Metrics | api.wizard.lu/metrics | Prometheus-format application metrics |
| Flower | flower.wizard.lu | Celery task monitoring, worker status |
| Gitea | git.wizard.lu | Git repository and CI pipeline status |
| Main Platform | wizard.lu | Main storefront |
| OMS Platform | omsflow.lu | OMS storefront |
| Loyalty+ Platform | rewardflow.lu | Loyalty+ storefront |
| Hetzner Console | console.hetzner.cloud | Server management, snapshots, rescue mode |
| Hetzner Status | status.hetzner.com | Hetzner infrastructure status |
!!! tip "SSH tunnel for internal services" Prometheus and other internal services are not exposed externally. To access them from your local machine:
```bash
# Prometheus (localhost:9090 on server → localhost:9090 on your machine)
ssh -L 9090:localhost:9090 samir@91.99.65.229
# Then open http://localhost:9090 in your browser
```
Quick Reference: Essential Commands
# SSH into the server
ssh samir@91.99.65.229
# Container status
cd ~/apps/orion && docker compose --profile full ps
# Container resource usage
docker stats --no-stream
# Follow all logs
cd ~/apps/orion && docker compose --profile full logs -f
# Restart a single service
cd ~/apps/orion && docker compose --profile full restart <service>
# Full stack rebuild
cd ~/apps/orion && docker compose --profile full up -d --build
# Caddy status / logs
sudo systemctl status caddy
sudo journalctl -u caddy -f
# System resources
free -h && df -h / && uptime
# Manual deploy
cd ~/apps/orion && bash scripts/deploy.sh
# Manual backup
bash ~/apps/orion/scripts/backup.sh --upload
# Run migrations
cd ~/apps/orion && docker compose --profile full exec -e PYTHONPATH=/app api python -m alembic upgrade heads