# Incident Response Runbook Operational runbook for diagnosing and resolving production incidents on the Orion platform. !!! info "Server Details" - **Server**: Hetzner Cloud CAX11 (4 GB RAM, ARM64) - **IP**: `91.99.65.229` - **App path**: `~/apps/orion` - **Docker profile**: `--profile full` - **Reverse proxy**: Caddy 2.10.2 (systemd, not containerized) - **Domains**: wizard.lu, omsflow.lu, rewardflow.lu --- ## Severity Levels | Level | Definition | Examples | Response Time | Notification | |-------|-----------|----------|---------------|--------------| | **SEV-1** | Platform down, all users affected | API unreachable, database down, server unresponsive | **< 15 min** | Immediate page | | **SEV-2** | Feature broken, subset of users affected | Celery not processing tasks, one platform domain down, SSL expired | **< 1 hour** | Slack / email alert | | **SEV-3** | Minor issue, no user impact or minimal degradation | High memory warning, slow queries, disk usage above 70% | **< 4 hours** | Grafana alert, next business day | !!! warning "Escalation" If a SEV-2 is not resolved within 2 hours, escalate to SEV-1. If a SEV-3 trends toward impacting users, escalate to SEV-2. --- ## Quick Diagnosis Decision Tree Follow these steps in order when responding to any incident. ### Step 1: Can you reach the server? ```bash ssh samir@91.99.65.229 ``` - **Yes** -- proceed to Step 2. - **No** -- check your local network. Try from a different connection. If still unreachable, check [Hetzner Status](https://status.hetzner.com/) and open a support ticket. As a last resort, use the Hetzner Cloud Console rescue mode. ### Step 2: Is Docker running? ```bash sudo systemctl status docker ``` - **Yes** -- proceed to Step 3. - **No** -- start Docker: ```bash sudo systemctl start docker ``` ### Step 3: Are the containers running? ```bash cd ~/apps/orion && docker compose --profile full ps ``` Check for containers in `Restarting`, `Exited`, or missing entirely. Healthy output shows all containers as `Up (healthy)` or `Up`. - **All healthy** -- proceed to Step 4. - **Some down** -- go to the relevant runbook below (API, Database, Celery, etc.). - **All down** -- go to [Runbook 7: Full Stack Restart](#7-full-stack-restart-after-reboot). ### Step 4: Is Caddy running? ```bash sudo systemctl status caddy ``` - **Yes** -- proceed to Step 5. - **No** -- go to [Runbook 4: Caddy / SSL / Domain Issues](#4-caddy-ssl-domain-issues). ### Step 5: Are domains resolving? ```bash dig wizard.lu +short dig api.wizard.lu +short dig omsflow.lu +short dig rewardflow.lu +short ``` All should return `91.99.65.229`. If not, check DNS records at your registrar. ### Step 6: Is the API responding? ```bash curl -s http://localhost:8001/health | python3 -m json.tool curl -s https://api.wizard.lu/health ``` - **Both work** -- issue may be intermittent. Check Grafana for recent anomalies. - **localhost works, external fails** -- Caddy or DNS issue. Go to [Runbook 4](#4-caddy-ssl-domain-issues). - **Neither works** -- API is down. Go to [Runbook 1](#1-api-container-down-crash-looping). --- ## Runbooks ### 1. API Container Down / Crash-Looping !!! danger "SEV-1" API unavailability affects all users on all platforms. **Symptoms**: `api` container shows `Restarting` or `Exited` in `docker compose ps`. External URLs return 502. **Diagnose**: ```bash cd ~/apps/orion # Check container status docker compose --profile full ps api # View recent logs (last 100 lines) docker compose --profile full logs --tail=100 api # Look for Python exceptions docker compose --profile full logs api 2>&1 | grep -i "error\|exception\|traceback" | tail -20 ``` **Common causes and fixes**: === "Import / syntax error in code" The log will show a Python traceback on startup. This usually means a bad deploy. ```bash # Roll back to previous commit cd ~/apps/orion git log --oneline -5 git checkout docker compose --profile full up -d --build api ``` === "Database connection refused" The API cannot reach PostgreSQL. See [Runbook 2](#2-database-issues). === "Port conflict" Another process is using port 8001. ```bash sudo ss -tlnp | grep 8001 # Kill the conflicting process, then restart docker compose --profile full restart api ``` === "Out of memory" The container was OOM-killed. See [Runbook 3](#3-high-memory-oom). **Recovery**: ```bash # Restart the API container cd ~/apps/orion docker compose --profile full restart api # Wait 10 seconds, then verify sleep 10 docker compose --profile full ps api curl -s http://localhost:8001/health ``` --- ### 2. Database Issues !!! danger "SEV-1" Database unavailability brings down the entire platform. **Symptoms**: API logs show `connection refused`, `could not connect to server`, or `OperationalError`. Health check fails with database errors. **Diagnose**: ```bash cd ~/apps/orion # Check PostgreSQL container docker compose --profile full ps db docker compose --profile full logs --tail=50 db # Test connection from inside the network docker compose --profile full exec db pg_isready -U orion_user -d orion_db # Check disk space (PostgreSQL needs space for WAL) df -h docker system df ``` **Common causes and fixes**: === "Container stopped" ```bash cd ~/apps/orion docker compose --profile full up -d db sleep 5 docker compose --profile full exec db pg_isready -U orion_user -d orion_db # Once healthy, restart the API docker compose --profile full restart api celery-worker celery-beat ``` === "Too many connections" ```bash # Check active connections docker compose --profile full exec db \ psql -U orion_user -d orion_db -c \ "SELECT count(*) FROM pg_stat_activity;" # Kill idle connections docker compose --profile full exec db \ psql -U orion_user -d orion_db -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';" ``` === "Disk full (WAL or data)" See [Runbook 6: Disk Full](#6-disk-full). === "Data corruption (last resort)" If PostgreSQL refuses to start with corruption errors: ```bash # Stop everything cd ~/apps/orion docker compose --profile full down # Restore from backup (see Runbook 8) bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/.sql.gz ``` **Check for slow queries**: ```bash docker compose --profile full exec db \ psql -U orion_user -d orion_db -c \ "SELECT pid, now() - query_start AS duration, left(query, 80) FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;" ``` **Kill a stuck query**: ```bash docker compose --profile full exec db \ psql -U orion_user -d orion_db -c \ "SELECT pg_terminate_backend();" ``` --- ### 3. High Memory / OOM !!! warning "SEV-2 (can escalate to SEV-1 if OOM killer fires)" The server has 4 GB RAM. Normal usage is ~2.4 GB. Above 3.2 GB is critical. **Symptoms**: Containers restarting unexpectedly. `dmesg` shows OOM killer. Grafana memory graphs spiking. **Diagnose**: ```bash # System memory free -h # Per-container memory usage docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" # Check for OOM kills sudo dmesg | grep -i "oom\|killed" | tail -10 # Top processes by memory ps aux --sort=-%mem | head -15 ``` **Immediate relief**: ```bash # Clear Docker build cache docker builder prune -f # Remove unused images docker image prune -f # Remove stopped containers docker container prune -f # Nuclear option: remove all unused Docker data docker system prune -f ``` **If a specific container is the culprit**: ```bash cd ~/apps/orion # Restart the offending container docker compose --profile full restart # If the API is leaking memory, a restart is the fastest fix docker compose --profile full restart api ``` **If CI jobs are running** (they add ~550 MB temporarily): ```bash # Check if a Gitea Actions runner job is active sudo systemctl status gitea-runner # Wait for the job to finish, or stop the runner temporarily sudo systemctl stop gitea-runner ``` !!! tip "Long-term fix" If OOM events become frequent, upgrade to CAX21 (8 GB RAM, ~7.50 EUR/mo) via **Hetzner Cloud Console > Server > Rescale**. The upgrade takes about 2 minutes and preserves all data. --- ### 4. Caddy / SSL / Domain Issues !!! warning "SEV-2" Caddy handles TLS termination and routing for all domains. If Caddy is down, all external access is lost even though the API may be running fine internally. **Symptoms**: Sites return connection refused on port 443. SSL certificate errors in the browser. Specific domain not working. **Diagnose**: ```bash # Check Caddy status sudo systemctl status caddy # View Caddy logs sudo journalctl -u caddy --since "30 minutes ago" --no-pager # Test internal API directly (bypasses Caddy) curl -s http://localhost:8001/health # Test SSL certificates curl -vI https://wizard.lu 2>&1 | grep -E "SSL|subject|expire" curl -vI https://api.wizard.lu 2>&1 | grep -E "SSL|subject|expire" ``` **Common causes and fixes**: === "Caddy stopped" ```bash sudo systemctl start caddy sudo systemctl status caddy ``` === "Caddyfile syntax error" ```bash # Validate configuration sudo caddy validate --config /etc/caddy/Caddyfile # If invalid, check recent changes sudo nano /etc/caddy/Caddyfile # After fixing, reload (not restart, preserves certificates) sudo systemctl reload caddy ``` === "SSL certificate issue" Caddy auto-renews certificates. If renewal fails, it is usually a port 80 or DNS issue. ```bash # Ensure port 80 is open (needed for ACME HTTP challenge) sudo ufw status | grep 80 # Check Caddy certificate storage sudo ls -la /var/lib/caddy/.local/share/caddy/certificates/ # Force certificate renewal by restarting Caddy sudo systemctl restart caddy ``` === "DNS not pointing to server" ```bash dig wizard.lu +short # Should return 91.99.65.229 # If wrong, update DNS at registrar and wait for propagation # Temporary: test by adding to /etc/hosts on your local machine ``` **Caddyfile reference** (at `/etc/caddy/Caddyfile`): ```bash sudo cat /etc/caddy/Caddyfile ``` --- ### 5. Celery Worker Issues !!! attention "SEV-2" Celery processes background tasks (imports, emails, scheduled jobs). If down, no background work happens, but the platform remains browsable. **Symptoms**: Background tasks not executing. Flower shows no active workers. Emails not being sent. **Diagnose**: ```bash cd ~/apps/orion # Check worker and beat containers docker compose --profile full ps celery-worker celery-beat # View worker logs docker compose --profile full logs --tail=50 celery-worker docker compose --profile full logs --tail=50 celery-beat # Check Redis (the broker) docker compose --profile full exec redis redis-cli ping docker compose --profile full exec redis redis-cli llen celery # Check Flower for worker status curl -s http://localhost:5555/api/workers | python3 -m json.tool ``` **Common causes and fixes**: === "Worker crashed / import error" ```bash # Check for Python errors in worker logs docker compose --profile full logs celery-worker 2>&1 | grep -i "error\|exception" | tail -10 # Restart worker cd ~/apps/orion docker compose --profile full restart celery-worker celery-beat ``` === "Redis down" ```bash # Check Redis container docker compose --profile full ps redis docker compose --profile full logs --tail=20 redis # Restart Redis, then workers cd ~/apps/orion docker compose --profile full restart redis sleep 5 docker compose --profile full restart celery-worker celery-beat ``` === "Task queue backed up" ```bash # Check queue length docker compose --profile full exec redis redis-cli llen celery # If queue is extremely large and tasks are stale, purge docker compose --profile full exec api \ celery -A app.core.celery_app purge -f # Restart worker to pick up fresh docker compose --profile full restart celery-worker ``` === "Beat scheduler out of sync" ```bash # Remove the beat schedule file and restart docker compose --profile full exec celery-beat rm -f /app/celerybeat-schedule docker compose --profile full restart celery-beat ``` --- ### 6. Disk Full !!! warning "SEV-2 (becomes SEV-1 if PostgreSQL cannot write WAL)" The server has 37 GB disk. Docker images, logs, and database WAL can fill it quickly. **Symptoms**: Write errors in logs. PostgreSQL panics. Docker cannot pull images. `No space left on device` errors. **Diagnose**: ```bash # Overall disk usage df -h / # Docker disk usage breakdown docker system df # Largest directories sudo du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10 du -sh ~/backups/* 2>/dev/null du -sh ~/apps/orion/logs/* 2>/dev/null ``` **Immediate cleanup**: ```bash # 1. Remove old Docker images and build cache (safe, usually frees 2-5 GB) docker system prune -af --volumes # 2. Truncate application logs cd ~/apps/orion truncate -s 0 logs/*.log 2>/dev/null # 3. Remove old backups beyond retention policy find ~/backups -name "*.sql.gz" -mtime +14 -delete # 4. Clean systemd journal logs (keep last 3 days) sudo journalctl --vacuum-time=3d # 5. Clean apt cache sudo apt clean ``` **After freeing space**: ```bash # Verify space recovered df -h / # Restart any containers that failed due to disk full cd ~/apps/orion docker compose --profile full up -d ``` !!! tip "Prevention" Set up a Grafana alert for disk usage > 70%. The node-exporter dashboard (ID 1860) includes disk usage panels. If the server persistently runs low on disk, upgrade to CAX21 (80 GB disk). --- ### 7. Full Stack Restart (After Reboot) !!! info "SEV-2" After a server reboot (planned or unplanned), all services need to come back up in the correct order. **When to use**: After a Hetzner maintenance reboot, manual reboot, or kernel upgrade. **Step-by-step recovery**: ```bash # 1. Verify Docker is running sudo systemctl status docker # If not: sudo systemctl start docker # 2. Start Gitea (needed for CI, not for the app itself) cd ~/gitea && docker compose up -d sleep 5 # 3. Start the Orion stack (db and redis start first due to depends_on) cd ~/apps/orion docker compose --profile full up -d sleep 15 # 4. Verify all containers are healthy docker compose --profile full ps # 5. Verify API health curl -s http://localhost:8001/health | python3 -m json.tool # 6. Start Caddy (should auto-start, but verify) sudo systemctl status caddy # If not running: sudo systemctl start caddy # 7. Start the Gitea Actions runner sudo systemctl status gitea-runner # If not running: sudo systemctl start gitea-runner # 8. Verify external access curl -s https://api.wizard.lu/health curl -I https://wizard.lu curl -I https://omsflow.lu curl -I https://rewardflow.lu # 9. Verify monitoring curl -I https://grafana.wizard.lu curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -c '"health":"up"' # 10. Verify backups timer is active systemctl list-timers orion-backup.timer ``` !!! note "Boot order" Docker containers with `restart: always` will auto-start after Docker starts. Caddy and the Gitea runner are systemd services with `WantedBy=multi-user.target` and also auto-start. In practice, you mainly need to verify rather than manually start. --- ### 8. Restore from Backup (Disaster Recovery) !!! danger "SEV-1" Use this runbook when the database is corrupted or data is lost and you need to restore from a backup. **Prerequisites**: Identify the backup to restore from. ```bash # List available local backups ls -lh ~/backups/orion/daily/ ls -lh ~/backups/orion/weekly/ # If local backups are gone, download from R2 source ~/apps/orion/.env aws s3 ls s3://orion-backups/orion/daily/ \ --endpoint-url "https://${R2_ACCOUNT_ID:-$(grep R2_ACCOUNT_ID ~/apps/orion/.env | cut -d= -f2)}.r2.cloudflarestorage.com" \ --profile r2 ``` **Download from R2 (if local backups unavailable)**: ```bash aws s3 sync s3://orion-backups/ ~/backups/ \ --endpoint-url "https://.r2.cloudflarestorage.com" \ --profile r2 ``` **Restore using the restore script**: ```bash # Restore Orion database bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/.sql.gz ``` The restore script will: 1. Stop application containers (API, Celery) while keeping the database running 2. Drop and recreate the `orion_db` database 3. Restore from the `.sql.gz` backup file 4. Run `alembic upgrade heads` to apply any pending migrations 5. Restart all containers **Verify after restore**: ```bash cd ~/apps/orion # Check API health curl -s http://localhost:8001/health | python3 -m json.tool # Verify data integrity (check row counts of key tables) docker compose --profile full exec db \ psql -U orion_user -d orion_db -c \ "SELECT 'platforms' AS tbl, count(*) FROM platforms UNION ALL SELECT 'users', count(*) FROM users UNION ALL SELECT 'stores', count(*) FROM stores;" # Verify external access curl -s https://api.wizard.lu/health ``` **Restore Gitea (if needed)**: ```bash bash ~/apps/orion/scripts/restore.sh gitea ~/backups/gitea/daily/.sql.gz ``` **Full server rebuild from Hetzner snapshot** (worst case): 1. Go to **Hetzner Cloud Console > Servers > Snapshots** 2. Select the most recent snapshot and click **Rebuild from snapshot** 3. After rebuild, SSH in and verify all services per [Runbook 7](#7-full-stack-restart-after-reboot) --- ## Post-Incident Report Template After resolving any SEV-1 or SEV-2 incident, create a post-incident report. Save reports in a shared location for the team. ```markdown # Post-Incident Report: [Brief Title] **Date**: YYYY-MM-DD **Severity**: SEV-1 / SEV-2 **Duration**: HH:MM (from detection to resolution) **Author**: [Name] ## Incident Summary [1-2 sentence description of what happened and the user impact.] ## Timeline (UTC) | Time | Event | |-------|--------------------------------------------| | HH:MM | Alert triggered / issue reported | | HH:MM | Responder acknowledged | | HH:MM | Root cause identified | | HH:MM | Fix applied | | HH:MM | Service fully restored | ## Root Cause [What caused the incident. Be specific -- e.g., "OOM killer terminated the API container because a Celery import task loaded 50k products into memory at once."] ## Resolution [What was done to fix it. Include exact commands if relevant.] ## Impact - **Users affected**: [number or scope] - **Data lost**: [none / describe] - **Downtime**: [duration] ## Action Items | Action | Owner | Due Date | Status | |--------|-------|----------|--------| | [Preventive measure 1] | [Name] | YYYY-MM-DD | [ ] Open | | [Preventive measure 2] | [Name] | YYYY-MM-DD | [ ] Open | ## Lessons Learned [What went well, what could be improved in the response process.] ``` --- ## Useful Monitoring URLs | Service | URL | Purpose | |---------|-----|---------| | **Grafana** | [grafana.wizard.lu](https://grafana.wizard.lu) | Dashboards for host metrics, container metrics | | **Prometheus** | `http://localhost:9090` (SSH tunnel) | Raw metrics queries, target health | | **Prometheus Targets** | `http://localhost:9090/targets` | Check which scrape targets are up/down | | **API Health** | [api.wizard.lu/health](https://api.wizard.lu/health) | Application health check (DB, Redis) | | **API Liveness** | [api.wizard.lu/health/live](https://api.wizard.lu/health/live) | Basic liveness probe | | **API Readiness** | [api.wizard.lu/health/ready](https://api.wizard.lu/health/ready) | Readiness probe (includes dependencies) | | **API Metrics** | [api.wizard.lu/metrics](https://api.wizard.lu/metrics) | Prometheus-format application metrics | | **Flower** | [flower.wizard.lu](https://flower.wizard.lu) | Celery task monitoring, worker status | | **Gitea** | [git.wizard.lu](https://git.wizard.lu) | Git repository and CI pipeline status | | **Main Platform** | [wizard.lu](https://wizard.lu) | Main storefront | | **OMS Platform** | [omsflow.lu](https://omsflow.lu) | OMS storefront | | **Loyalty+ Platform** | [rewardflow.lu](https://rewardflow.lu) | Loyalty+ storefront | | **Hetzner Console** | [console.hetzner.cloud](https://console.hetzner.cloud) | Server management, snapshots, rescue mode | | **Hetzner Status** | [status.hetzner.com](https://status.hetzner.com) | Hetzner infrastructure status | !!! tip "SSH tunnel for internal services" Prometheus and other internal services are not exposed externally. To access them from your local machine: ```bash # Prometheus (localhost:9090 on server → localhost:9090 on your machine) ssh -L 9090:localhost:9090 samir@91.99.65.229 # Then open http://localhost:9090 in your browser ``` --- ## Quick Reference: Essential Commands ```bash # SSH into the server ssh samir@91.99.65.229 # Container status cd ~/apps/orion && docker compose --profile full ps # Container resource usage docker stats --no-stream # Follow all logs cd ~/apps/orion && docker compose --profile full logs -f # Restart a single service cd ~/apps/orion && docker compose --profile full restart # Full stack rebuild cd ~/apps/orion && docker compose --profile full up -d --build # Caddy status / logs sudo systemctl status caddy sudo journalctl -u caddy -f # System resources free -h && df -h / && uptime # Manual deploy cd ~/apps/orion && bash scripts/deploy.sh # Manual backup bash ~/apps/orion/scripts/backup.sh --upload # Run migrations cd ~/apps/orion && docker compose --profile full exec -e PYTHONPATH=/app api python -m alembic upgrade heads ```