sboulahtit/orion

Fork 0

Files

Samir Boulahtit 4bce16fb73

CI / ruff (push) Successful in 11s

Details

CI / pytest (push) Successful in 36m6s

Details

CI / validate (push) Successful in 22s

Details

CI / dependency-scanning (push) Successful in 28s

Details

CI / docs (push) Successful in 37s

Details

CI / deploy (push) Successful in 47s

Details

feat(infra): add alerting, network segmentation, and ops docs (Steps 19-24)

- Prometheus alert rules (host, container, API, Celery, target-down)
- Alertmanager with email routing (critical 1h, warning 4h repeat)
- Docker network segmentation (frontend/backend/monitoring)
- Incident response runbook with 8 copy-paste runbooks
- Environment variables reference (55+ vars documented)
- Hetzner setup docs updated with Steps 19-24
- Launch readiness updated with Feb 2026 infrastructure status

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-15 22:06:54 +01:00

22 KiB

Raw Blame History

Incident Response Runbook

Operational runbook for diagnosing and resolving production incidents on the Orion platform.

!!! info "Server Details" - Server: Hetzner Cloud CAX11 (4 GB RAM, ARM64) - IP: 91.99.65.229 - App path: ~/apps/orion - Docker profile: --profile full - Reverse proxy: Caddy 2.10.2 (systemd, not containerized) - Domains: wizard.lu, omsflow.lu, rewardflow.lu

Severity Levels

Level	Definition	Examples	Response Time	Notification
SEV-1	Platform down, all users affected	API unreachable, database down, server unresponsive	< 15 min	Immediate page
SEV-2	Feature broken, subset of users affected	Celery not processing tasks, one platform domain down, SSL expired	< 1 hour	Slack / email alert
SEV-3	Minor issue, no user impact or minimal degradation	High memory warning, slow queries, disk usage above 70%	< 4 hours	Grafana alert, next business day

!!! warning "Escalation" If a SEV-2 is not resolved within 2 hours, escalate to SEV-1. If a SEV-3 trends toward impacting users, escalate to SEV-2.

Quick Diagnosis Decision Tree

Follow these steps in order when responding to any incident.

Step 1: Can you reach the server?

ssh samir@91.99.65.229

Yes -- proceed to Step 2.
No -- check your local network. Try from a different connection. If still unreachable, check Hetzner Status and open a support ticket. As a last resort, use the Hetzner Cloud Console rescue mode.

Step 2: Is Docker running?

sudo systemctl status docker

Yes -- proceed to Step 3.
No -- start Docker:

sudo systemctl start docker

Step 3: Are the containers running?

cd ~/apps/orion && docker compose --profile full ps

Check for containers in Restarting, Exited, or missing entirely. Healthy output shows all containers as Up (healthy) or Up.

All healthy -- proceed to Step 4.
Some down -- go to the relevant runbook below (API, Database, Celery, etc.).
All down -- go to Runbook 7: Full Stack Restart.

Step 4: Is Caddy running?

sudo systemctl status caddy

Yes -- proceed to Step 5.
No -- go to Runbook 4: Caddy / SSL / Domain Issues.

Step 5: Are domains resolving?

dig wizard.lu +short
dig api.wizard.lu +short
dig omsflow.lu +short
dig rewardflow.lu +short

All should return 91.99.65.229. If not, check DNS records at your registrar.

Step 6: Is the API responding?

curl -s http://localhost:8001/health | python3 -m json.tool
curl -s https://api.wizard.lu/health

Both work -- issue may be intermittent. Check Grafana for recent anomalies.
localhost works, external fails -- Caddy or DNS issue. Go to Runbook 4.
Neither works -- API is down. Go to Runbook 1.

Runbooks

1. API Container Down / Crash-Looping

!!! danger "SEV-1" API unavailability affects all users on all platforms.

Symptoms: api container shows Restarting or Exited in docker compose ps. External URLs return 502.

Diagnose:

cd ~/apps/orion

# Check container status
docker compose --profile full ps api

# View recent logs (last 100 lines)
docker compose --profile full logs --tail=100 api

# Look for Python exceptions
docker compose --profile full logs api 2>&1 | grep -i "error\|exception\|traceback" | tail -20

Common causes and fixes:

=== "Import / syntax error in code"

The log will show a Python traceback on startup. This usually means a bad deploy.

```bash
# Roll back to previous commit
cd ~/apps/orion
git log --oneline -5
git checkout <previous-good-commit>
docker compose --profile full up -d --build api
```

=== "Database connection refused"

The API cannot reach PostgreSQL. See [Runbook 2](#2-database-issues).

=== "Port conflict"

Another process is using port 8001.

```bash
sudo ss -tlnp | grep 8001
# Kill the conflicting process, then restart
docker compose --profile full restart api
```

=== "Out of memory"

The container was OOM-killed. See [Runbook 3](#3-high-memory-oom).

Recovery:

# Restart the API container
cd ~/apps/orion
docker compose --profile full restart api

# Wait 10 seconds, then verify
sleep 10
docker compose --profile full ps api
curl -s http://localhost:8001/health

2. Database Issues

!!! danger "SEV-1" Database unavailability brings down the entire platform.

Symptoms: API logs show connection refused, could not connect to server, or OperationalError. Health check fails with database errors.

Diagnose:

cd ~/apps/orion

# Check PostgreSQL container
docker compose --profile full ps db
docker compose --profile full logs --tail=50 db

# Test connection from inside the network
docker compose --profile full exec db pg_isready -U orion_user -d orion_db

# Check disk space (PostgreSQL needs space for WAL)
df -h
docker system df

Common causes and fixes:

=== "Container stopped"

```bash
cd ~/apps/orion
docker compose --profile full up -d db
sleep 5
docker compose --profile full exec db pg_isready -U orion_user -d orion_db
# Once healthy, restart the API
docker compose --profile full restart api celery-worker celery-beat
```

=== "Too many connections"

```bash
# Check active connections
docker compose --profile full exec db \
    psql -U orion_user -d orion_db -c \
    "SELECT count(*) FROM pg_stat_activity;"

# Kill idle connections
docker compose --profile full exec db \
    psql -U orion_user -d orion_db -c \
    "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
```

=== "Disk full (WAL or data)"

See [Runbook 6: Disk Full](#6-disk-full).

=== "Data corruption (last resort)"

If PostgreSQL refuses to start with corruption errors:

```bash
# Stop everything
cd ~/apps/orion
docker compose --profile full down

# Restore from backup (see Runbook 8)
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<latest>.sql.gz
```

Check for slow queries:

docker compose --profile full exec db \
    psql -U orion_user -d orion_db -c \
    "SELECT pid, now() - query_start AS duration, left(query, 80)
     FROM pg_stat_activity
     WHERE state != 'idle'
     ORDER BY duration DESC
     LIMIT 10;"

Kill a stuck query:

docker compose --profile full exec db \
    psql -U orion_user -d orion_db -c \
    "SELECT pg_terminate_backend(<PID>);"

3. High Memory / OOM

!!! warning "SEV-2 (can escalate to SEV-1 if OOM killer fires)" The server has 4 GB RAM. Normal usage is ~2.4 GB. Above 3.2 GB is critical.

Symptoms: Containers restarting unexpectedly. dmesg shows OOM killer. Grafana memory graphs spiking.

Diagnose:

# System memory
free -h

# Per-container memory usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Check for OOM kills
sudo dmesg | grep -i "oom\|killed" | tail -10

# Top processes by memory
ps aux --sort=-%mem | head -15

Immediate relief:

# Clear Docker build cache
docker builder prune -f

# Remove unused images
docker image prune -f

# Remove stopped containers
docker container prune -f

# Nuclear option: remove all unused Docker data
docker system prune -f

If a specific container is the culprit:

cd ~/apps/orion

# Restart the offending container
docker compose --profile full restart <container-name>

# If the API is leaking memory, a restart is the fastest fix
docker compose --profile full restart api

If CI jobs are running (they add ~550 MB temporarily):

# Check if a Gitea Actions runner job is active
sudo systemctl status gitea-runner
# Wait for the job to finish, or stop the runner temporarily
sudo systemctl stop gitea-runner

!!! tip "Long-term fix" If OOM events become frequent, upgrade to CAX21 (8 GB RAM, ~7.50 EUR/mo) via Hetzner Cloud Console > Server > Rescale. The upgrade takes about 2 minutes and preserves all data.

4. Caddy / SSL / Domain Issues

!!! warning "SEV-2" Caddy handles TLS termination and routing for all domains. If Caddy is down, all external access is lost even though the API may be running fine internally.

Symptoms: Sites return connection refused on port 443. SSL certificate errors in the browser. Specific domain not working.

Diagnose:

# Check Caddy status
sudo systemctl status caddy

# View Caddy logs
sudo journalctl -u caddy --since "30 minutes ago" --no-pager

# Test internal API directly (bypasses Caddy)
curl -s http://localhost:8001/health

# Test SSL certificates
curl -vI https://wizard.lu 2>&1 | grep -E "SSL|subject|expire"
curl -vI https://api.wizard.lu 2>&1 | grep -E "SSL|subject|expire"

Common causes and fixes:

=== "Caddy stopped"

```bash
sudo systemctl start caddy
sudo systemctl status caddy
```

=== "Caddyfile syntax error"

```bash
# Validate configuration
sudo caddy validate --config /etc/caddy/Caddyfile

# If invalid, check recent changes
sudo nano /etc/caddy/Caddyfile

# After fixing, reload (not restart, preserves certificates)
sudo systemctl reload caddy
```

=== "SSL certificate issue"

Caddy auto-renews certificates. If renewal fails, it is usually a port 80 or DNS issue.

```bash
# Ensure port 80 is open (needed for ACME HTTP challenge)
sudo ufw status | grep 80

# Check Caddy certificate storage
sudo ls -la /var/lib/caddy/.local/share/caddy/certificates/

# Force certificate renewal by restarting Caddy
sudo systemctl restart caddy
```

=== "DNS not pointing to server"

```bash
dig wizard.lu +short
# Should return 91.99.65.229

# If wrong, update DNS at registrar and wait for propagation
# Temporary: test by adding to /etc/hosts on your local machine
```

Caddyfile reference (at /etc/caddy/Caddyfile):

sudo cat /etc/caddy/Caddyfile

5. Celery Worker Issues

!!! attention "SEV-2" Celery processes background tasks (imports, emails, scheduled jobs). If down, no background work happens, but the platform remains browsable.

Symptoms: Background tasks not executing. Flower shows no active workers. Emails not being sent.

Diagnose:

cd ~/apps/orion

# Check worker and beat containers
docker compose --profile full ps celery-worker celery-beat

# View worker logs
docker compose --profile full logs --tail=50 celery-worker
docker compose --profile full logs --tail=50 celery-beat

# Check Redis (the broker)
docker compose --profile full exec redis redis-cli ping
docker compose --profile full exec redis redis-cli llen celery

# Check Flower for worker status
curl -s http://localhost:5555/api/workers | python3 -m json.tool

Common causes and fixes:

=== "Worker crashed / import error"

```bash
# Check for Python errors in worker logs
docker compose --profile full logs celery-worker 2>&1 | grep -i "error\|exception" | tail -10

# Restart worker
cd ~/apps/orion
docker compose --profile full restart celery-worker celery-beat
```

=== "Redis down"

```bash
# Check Redis container
docker compose --profile full ps redis
docker compose --profile full logs --tail=20 redis

# Restart Redis, then workers
cd ~/apps/orion
docker compose --profile full restart redis
sleep 5
docker compose --profile full restart celery-worker celery-beat
```

=== "Task queue backed up"

```bash
# Check queue length
docker compose --profile full exec redis redis-cli llen celery

# If queue is extremely large and tasks are stale, purge
docker compose --profile full exec api \
    celery -A app.core.celery_app purge -f

# Restart worker to pick up fresh
docker compose --profile full restart celery-worker
```

=== "Beat scheduler out of sync"

```bash
# Remove the beat schedule file and restart
docker compose --profile full exec celery-beat rm -f /app/celerybeat-schedule
docker compose --profile full restart celery-beat
```

6. Disk Full

!!! warning "SEV-2 (becomes SEV-1 if PostgreSQL cannot write WAL)" The server has 37 GB disk. Docker images, logs, and database WAL can fill it quickly.

Symptoms: Write errors in logs. PostgreSQL panics. Docker cannot pull images. No space left on device errors.

Diagnose:

# Overall disk usage
df -h /

# Docker disk usage breakdown
docker system df

# Largest directories
sudo du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10
du -sh ~/backups/* 2>/dev/null
du -sh ~/apps/orion/logs/* 2>/dev/null

Immediate cleanup:

# 1. Remove old Docker images and build cache (safe, usually frees 2-5 GB)
docker system prune -af --volumes

# 2. Truncate application logs
cd ~/apps/orion
truncate -s 0 logs/*.log 2>/dev/null

# 3. Remove old backups beyond retention policy
find ~/backups -name "*.sql.gz" -mtime +14 -delete

# 4. Clean systemd journal logs (keep last 3 days)
sudo journalctl --vacuum-time=3d

# 5. Clean apt cache
sudo apt clean

After freeing space:

# Verify space recovered
df -h /

# Restart any containers that failed due to disk full
cd ~/apps/orion
docker compose --profile full up -d

!!! tip "Prevention" Set up a Grafana alert for disk usage > 70%. The node-exporter dashboard (ID 1860) includes disk usage panels. If the server persistently runs low on disk, upgrade to CAX21 (80 GB disk).

7. Full Stack Restart (After Reboot)

!!! info "SEV-2" After a server reboot (planned or unplanned), all services need to come back up in the correct order.

When to use: After a Hetzner maintenance reboot, manual reboot, or kernel upgrade.

Step-by-step recovery:

# 1. Verify Docker is running
sudo systemctl status docker
# If not: sudo systemctl start docker

# 2. Start Gitea (needed for CI, not for the app itself)
cd ~/gitea && docker compose up -d
sleep 5

# 3. Start the Orion stack (db and redis start first due to depends_on)
cd ~/apps/orion
docker compose --profile full up -d
sleep 15

# 4. Verify all containers are healthy
docker compose --profile full ps

# 5. Verify API health
curl -s http://localhost:8001/health | python3 -m json.tool

# 6. Start Caddy (should auto-start, but verify)
sudo systemctl status caddy
# If not running: sudo systemctl start caddy

# 7. Start the Gitea Actions runner
sudo systemctl status gitea-runner
# If not running: sudo systemctl start gitea-runner

# 8. Verify external access
curl -s https://api.wizard.lu/health
curl -I https://wizard.lu
curl -I https://omsflow.lu
curl -I https://rewardflow.lu

# 9. Verify monitoring
curl -I https://grafana.wizard.lu
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -c '"health":"up"'

# 10. Verify backups timer is active
systemctl list-timers orion-backup.timer

!!! note "Boot order" Docker containers with restart: always will auto-start after Docker starts. Caddy and the Gitea runner are systemd services with WantedBy=multi-user.target and also auto-start. In practice, you mainly need to verify rather than manually start.

8. Restore from Backup (Disaster Recovery)

!!! danger "SEV-1" Use this runbook when the database is corrupted or data is lost and you need to restore from a backup.

Prerequisites: Identify the backup to restore from.

# List available local backups
ls -lh ~/backups/orion/daily/
ls -lh ~/backups/orion/weekly/

# If local backups are gone, download from R2
source ~/apps/orion/.env
aws s3 ls s3://orion-backups/orion/daily/ \
    --endpoint-url "https://${R2_ACCOUNT_ID:-$(grep R2_ACCOUNT_ID ~/apps/orion/.env | cut -d= -f2)}.r2.cloudflarestorage.com" \
    --profile r2

Download from R2 (if local backups unavailable):

aws s3 sync s3://orion-backups/ ~/backups/ \
    --endpoint-url "https://<ACCOUNT_ID>.r2.cloudflarestorage.com" \
    --profile r2

Restore using the restore script:

# Restore Orion database
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<backup-file>.sql.gz

The restore script will:

Stop application containers (API, Celery) while keeping the database running
Drop and recreate the orion_db database
Restore from the .sql.gz backup file
Run alembic upgrade heads to apply any pending migrations
Restart all containers

Verify after restore:

cd ~/apps/orion

# Check API health
curl -s http://localhost:8001/health | python3 -m json.tool

# Verify data integrity (check row counts of key tables)
docker compose --profile full exec db \
    psql -U orion_user -d orion_db -c \
    "SELECT 'platforms' AS tbl, count(*) FROM platforms
     UNION ALL SELECT 'users', count(*) FROM users
     UNION ALL SELECT 'stores', count(*) FROM stores;"

# Verify external access
curl -s https://api.wizard.lu/health

Restore Gitea (if needed):

bash ~/apps/orion/scripts/restore.sh gitea ~/backups/gitea/daily/<backup-file>.sql.gz

Full server rebuild from Hetzner snapshot (worst case):

Go to Hetzner Cloud Console > Servers > Snapshots
Select the most recent snapshot and click Rebuild from snapshot
After rebuild, SSH in and verify all services per Runbook 7

Post-Incident Report Template

After resolving any SEV-1 or SEV-2 incident, create a post-incident report. Save reports in a shared location for the team.

# Post-Incident Report: [Brief Title]

**Date**: YYYY-MM-DD
**Severity**: SEV-1 / SEV-2
**Duration**: HH:MM (from detection to resolution)
**Author**: [Name]

## Incident Summary

[1-2 sentence description of what happened and the user impact.]

## Timeline (UTC)

| Time  | Event                                      |
|-------|--------------------------------------------|
| HH:MM | Alert triggered / issue reported            |
| HH:MM | Responder acknowledged                      |
| HH:MM | Root cause identified                       |
| HH:MM | Fix applied                                 |
| HH:MM | Service fully restored                      |

## Root Cause

[What caused the incident. Be specific -- e.g., "OOM killer terminated the API
container because a Celery import task loaded 50k products into memory at once."]

## Resolution

[What was done to fix it. Include exact commands if relevant.]

## Impact

- **Users affected**: [number or scope]
- **Data lost**: [none / describe]
- **Downtime**: [duration]

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| [Preventive measure 1] | [Name] | YYYY-MM-DD | [ ] Open |
| [Preventive measure 2] | [Name] | YYYY-MM-DD | [ ] Open |

## Lessons Learned

[What went well, what could be improved in the response process.]

Useful Monitoring URLs

Service	URL	Purpose
Grafana	grafana.wizard.lu	Dashboards for host metrics, container metrics
Prometheus	`http://localhost:9090` (SSH tunnel)	Raw metrics queries, target health
Prometheus Targets	`http://localhost:9090/targets`	Check which scrape targets are up/down
API Health	api.wizard.lu/health	Application health check (DB, Redis)
API Liveness	api.wizard.lu/health/live	Basic liveness probe
API Readiness	api.wizard.lu/health/ready	Readiness probe (includes dependencies)
API Metrics	api.wizard.lu/metrics	Prometheus-format application metrics
Flower	flower.wizard.lu	Celery task monitoring, worker status
Gitea	git.wizard.lu	Git repository and CI pipeline status
Main Platform	wizard.lu	Main storefront
OMS Platform	omsflow.lu	OMS storefront
Loyalty+ Platform	rewardflow.lu	Loyalty+ storefront
Hetzner Console	console.hetzner.cloud	Server management, snapshots, rescue mode
Hetzner Status	status.hetzner.com	Hetzner infrastructure status

!!! tip "SSH tunnel for internal services" Prometheus and other internal services are not exposed externally. To access them from your local machine:

```bash
# Prometheus (localhost:9090 on server → localhost:9090 on your machine)
ssh -L 9090:localhost:9090 samir@91.99.65.229

# Then open http://localhost:9090 in your browser
```

Quick Reference: Essential Commands

# SSH into the server
ssh samir@91.99.65.229

# Container status
cd ~/apps/orion && docker compose --profile full ps

# Container resource usage
docker stats --no-stream

# Follow all logs
cd ~/apps/orion && docker compose --profile full logs -f

# Restart a single service
cd ~/apps/orion && docker compose --profile full restart <service>

# Full stack rebuild
cd ~/apps/orion && docker compose --profile full up -d --build

# Caddy status / logs
sudo systemctl status caddy
sudo journalctl -u caddy -f

# System resources
free -h && df -h / && uptime

# Manual deploy
cd ~/apps/orion && bash scripts/deploy.sh

# Manual backup
bash ~/apps/orion/scripts/backup.sh --upload

# Run migrations
cd ~/apps/orion && docker compose --profile full exec -e PYTHONPATH=/app api python -m alembic upgrade heads

22 KiB Raw Blame History

Incident Response Runbook

Severity Levels

Quick Diagnosis Decision Tree

Step 1: Can you reach the server?

Step 2: Is Docker running?

Step 3: Are the containers running?

Step 4: Is Caddy running?

Step 5: Are domains resolving?

Step 6: Is the API responding?

Runbooks

1. API Container Down / Crash-Looping

2. Database Issues

3. High Memory / OOM

4. Caddy / SSL / Domain Issues

5. Celery Worker Issues

6. Disk Full

7. Full Stack Restart (After Reboot)

8. Restore from Backup (Disaster Recovery)

Post-Incident Report Template

Useful Monitoring URLs

Quick Reference: Essential Commands

22 KiB

Raw Blame History