All checks were successful
- Prometheus alert rules (host, container, API, Celery, target-down) - Alertmanager with email routing (critical 1h, warning 4h repeat) - Docker network segmentation (frontend/backend/monitoring) - Incident response runbook with 8 copy-paste runbooks - Environment variables reference (55+ vars documented) - Hetzner setup docs updated with Steps 19-24 - Launch readiness updated with Feb 2026 infrastructure status Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
794 lines
22 KiB
Markdown
794 lines
22 KiB
Markdown
# Incident Response Runbook
|
|
|
|
Operational runbook for diagnosing and resolving production incidents on the Orion platform.
|
|
|
|
!!! info "Server Details"
|
|
- **Server**: Hetzner Cloud CAX11 (4 GB RAM, ARM64)
|
|
- **IP**: `91.99.65.229`
|
|
- **App path**: `~/apps/orion`
|
|
- **Docker profile**: `--profile full`
|
|
- **Reverse proxy**: Caddy 2.10.2 (systemd, not containerized)
|
|
- **Domains**: wizard.lu, omsflow.lu, rewardflow.lu
|
|
|
|
---
|
|
|
|
## Severity Levels
|
|
|
|
| Level | Definition | Examples | Response Time | Notification |
|
|
|-------|-----------|----------|---------------|--------------|
|
|
| **SEV-1** | Platform down, all users affected | API unreachable, database down, server unresponsive | **< 15 min** | Immediate page |
|
|
| **SEV-2** | Feature broken, subset of users affected | Celery not processing tasks, one platform domain down, SSL expired | **< 1 hour** | Slack / email alert |
|
|
| **SEV-3** | Minor issue, no user impact or minimal degradation | High memory warning, slow queries, disk usage above 70% | **< 4 hours** | Grafana alert, next business day |
|
|
|
|
!!! warning "Escalation"
|
|
If a SEV-2 is not resolved within 2 hours, escalate to SEV-1. If a SEV-3 trends toward impacting users, escalate to SEV-2.
|
|
|
|
---
|
|
|
|
## Quick Diagnosis Decision Tree
|
|
|
|
Follow these steps in order when responding to any incident.
|
|
|
|
### Step 1: Can you reach the server?
|
|
|
|
```bash
|
|
ssh samir@91.99.65.229
|
|
```
|
|
|
|
- **Yes** -- proceed to Step 2.
|
|
- **No** -- check your local network. Try from a different connection. If still unreachable, check [Hetzner Status](https://status.hetzner.com/) and open a support ticket. As a last resort, use the Hetzner Cloud Console rescue mode.
|
|
|
|
### Step 2: Is Docker running?
|
|
|
|
```bash
|
|
sudo systemctl status docker
|
|
```
|
|
|
|
- **Yes** -- proceed to Step 3.
|
|
- **No** -- start Docker:
|
|
|
|
```bash
|
|
sudo systemctl start docker
|
|
```
|
|
|
|
### Step 3: Are the containers running?
|
|
|
|
```bash
|
|
cd ~/apps/orion && docker compose --profile full ps
|
|
```
|
|
|
|
Check for containers in `Restarting`, `Exited`, or missing entirely. Healthy output shows all containers as `Up (healthy)` or `Up`.
|
|
|
|
- **All healthy** -- proceed to Step 4.
|
|
- **Some down** -- go to the relevant runbook below (API, Database, Celery, etc.).
|
|
- **All down** -- go to [Runbook 7: Full Stack Restart](#7-full-stack-restart-after-reboot).
|
|
|
|
### Step 4: Is Caddy running?
|
|
|
|
```bash
|
|
sudo systemctl status caddy
|
|
```
|
|
|
|
- **Yes** -- proceed to Step 5.
|
|
- **No** -- go to [Runbook 4: Caddy / SSL / Domain Issues](#4-caddy-ssl-domain-issues).
|
|
|
|
### Step 5: Are domains resolving?
|
|
|
|
```bash
|
|
dig wizard.lu +short
|
|
dig api.wizard.lu +short
|
|
dig omsflow.lu +short
|
|
dig rewardflow.lu +short
|
|
```
|
|
|
|
All should return `91.99.65.229`. If not, check DNS records at your registrar.
|
|
|
|
### Step 6: Is the API responding?
|
|
|
|
```bash
|
|
curl -s http://localhost:8001/health | python3 -m json.tool
|
|
curl -s https://api.wizard.lu/health
|
|
```
|
|
|
|
- **Both work** -- issue may be intermittent. Check Grafana for recent anomalies.
|
|
- **localhost works, external fails** -- Caddy or DNS issue. Go to [Runbook 4](#4-caddy-ssl-domain-issues).
|
|
- **Neither works** -- API is down. Go to [Runbook 1](#1-api-container-down-crash-looping).
|
|
|
|
---
|
|
|
|
## Runbooks
|
|
|
|
### 1. API Container Down / Crash-Looping
|
|
|
|
!!! danger "SEV-1"
|
|
API unavailability affects all users on all platforms.
|
|
|
|
**Symptoms**: `api` container shows `Restarting` or `Exited` in `docker compose ps`. External URLs return 502.
|
|
|
|
**Diagnose**:
|
|
|
|
```bash
|
|
cd ~/apps/orion
|
|
|
|
# Check container status
|
|
docker compose --profile full ps api
|
|
|
|
# View recent logs (last 100 lines)
|
|
docker compose --profile full logs --tail=100 api
|
|
|
|
# Look for Python exceptions
|
|
docker compose --profile full logs api 2>&1 | grep -i "error\|exception\|traceback" | tail -20
|
|
```
|
|
|
|
**Common causes and fixes**:
|
|
|
|
=== "Import / syntax error in code"
|
|
|
|
The log will show a Python traceback on startup. This usually means a bad deploy.
|
|
|
|
```bash
|
|
# Roll back to previous commit
|
|
cd ~/apps/orion
|
|
git log --oneline -5
|
|
git checkout <previous-good-commit>
|
|
docker compose --profile full up -d --build api
|
|
```
|
|
|
|
=== "Database connection refused"
|
|
|
|
The API cannot reach PostgreSQL. See [Runbook 2](#2-database-issues).
|
|
|
|
=== "Port conflict"
|
|
|
|
Another process is using port 8001.
|
|
|
|
```bash
|
|
sudo ss -tlnp | grep 8001
|
|
# Kill the conflicting process, then restart
|
|
docker compose --profile full restart api
|
|
```
|
|
|
|
=== "Out of memory"
|
|
|
|
The container was OOM-killed. See [Runbook 3](#3-high-memory-oom).
|
|
|
|
**Recovery**:
|
|
|
|
```bash
|
|
# Restart the API container
|
|
cd ~/apps/orion
|
|
docker compose --profile full restart api
|
|
|
|
# Wait 10 seconds, then verify
|
|
sleep 10
|
|
docker compose --profile full ps api
|
|
curl -s http://localhost:8001/health
|
|
```
|
|
|
|
---
|
|
|
|
### 2. Database Issues
|
|
|
|
!!! danger "SEV-1"
|
|
Database unavailability brings down the entire platform.
|
|
|
|
**Symptoms**: API logs show `connection refused`, `could not connect to server`, or `OperationalError`. Health check fails with database errors.
|
|
|
|
**Diagnose**:
|
|
|
|
```bash
|
|
cd ~/apps/orion
|
|
|
|
# Check PostgreSQL container
|
|
docker compose --profile full ps db
|
|
docker compose --profile full logs --tail=50 db
|
|
|
|
# Test connection from inside the network
|
|
docker compose --profile full exec db pg_isready -U orion_user -d orion_db
|
|
|
|
# Check disk space (PostgreSQL needs space for WAL)
|
|
df -h
|
|
docker system df
|
|
```
|
|
|
|
**Common causes and fixes**:
|
|
|
|
=== "Container stopped"
|
|
|
|
```bash
|
|
cd ~/apps/orion
|
|
docker compose --profile full up -d db
|
|
sleep 5
|
|
docker compose --profile full exec db pg_isready -U orion_user -d orion_db
|
|
# Once healthy, restart the API
|
|
docker compose --profile full restart api celery-worker celery-beat
|
|
```
|
|
|
|
=== "Too many connections"
|
|
|
|
```bash
|
|
# Check active connections
|
|
docker compose --profile full exec db \
|
|
psql -U orion_user -d orion_db -c \
|
|
"SELECT count(*) FROM pg_stat_activity;"
|
|
|
|
# Kill idle connections
|
|
docker compose --profile full exec db \
|
|
psql -U orion_user -d orion_db -c \
|
|
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
|
|
```
|
|
|
|
=== "Disk full (WAL or data)"
|
|
|
|
See [Runbook 6: Disk Full](#6-disk-full).
|
|
|
|
=== "Data corruption (last resort)"
|
|
|
|
If PostgreSQL refuses to start with corruption errors:
|
|
|
|
```bash
|
|
# Stop everything
|
|
cd ~/apps/orion
|
|
docker compose --profile full down
|
|
|
|
# Restore from backup (see Runbook 8)
|
|
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<latest>.sql.gz
|
|
```
|
|
|
|
**Check for slow queries**:
|
|
|
|
```bash
|
|
docker compose --profile full exec db \
|
|
psql -U orion_user -d orion_db -c \
|
|
"SELECT pid, now() - query_start AS duration, left(query, 80)
|
|
FROM pg_stat_activity
|
|
WHERE state != 'idle'
|
|
ORDER BY duration DESC
|
|
LIMIT 10;"
|
|
```
|
|
|
|
**Kill a stuck query**:
|
|
|
|
```bash
|
|
docker compose --profile full exec db \
|
|
psql -U orion_user -d orion_db -c \
|
|
"SELECT pg_terminate_backend(<PID>);"
|
|
```
|
|
|
|
---
|
|
|
|
### 3. High Memory / OOM
|
|
|
|
!!! warning "SEV-2 (can escalate to SEV-1 if OOM killer fires)"
|
|
The server has 4 GB RAM. Normal usage is ~2.4 GB. Above 3.2 GB is critical.
|
|
|
|
**Symptoms**: Containers restarting unexpectedly. `dmesg` shows OOM killer. Grafana memory graphs spiking.
|
|
|
|
**Diagnose**:
|
|
|
|
```bash
|
|
# System memory
|
|
free -h
|
|
|
|
# Per-container memory usage
|
|
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
|
|
|
|
# Check for OOM kills
|
|
sudo dmesg | grep -i "oom\|killed" | tail -10
|
|
|
|
# Top processes by memory
|
|
ps aux --sort=-%mem | head -15
|
|
```
|
|
|
|
**Immediate relief**:
|
|
|
|
```bash
|
|
# Clear Docker build cache
|
|
docker builder prune -f
|
|
|
|
# Remove unused images
|
|
docker image prune -f
|
|
|
|
# Remove stopped containers
|
|
docker container prune -f
|
|
|
|
# Nuclear option: remove all unused Docker data
|
|
docker system prune -f
|
|
```
|
|
|
|
**If a specific container is the culprit**:
|
|
|
|
```bash
|
|
cd ~/apps/orion
|
|
|
|
# Restart the offending container
|
|
docker compose --profile full restart <container-name>
|
|
|
|
# If the API is leaking memory, a restart is the fastest fix
|
|
docker compose --profile full restart api
|
|
```
|
|
|
|
**If CI jobs are running** (they add ~550 MB temporarily):
|
|
|
|
```bash
|
|
# Check if a Gitea Actions runner job is active
|
|
sudo systemctl status gitea-runner
|
|
# Wait for the job to finish, or stop the runner temporarily
|
|
sudo systemctl stop gitea-runner
|
|
```
|
|
|
|
!!! tip "Long-term fix"
|
|
If OOM events become frequent, upgrade to CAX21 (8 GB RAM, ~7.50 EUR/mo) via **Hetzner Cloud Console > Server > Rescale**. The upgrade takes about 2 minutes and preserves all data.
|
|
|
|
---
|
|
|
|
### 4. Caddy / SSL / Domain Issues
|
|
|
|
!!! warning "SEV-2"
|
|
Caddy handles TLS termination and routing for all domains. If Caddy is down, all external access is lost even though the API may be running fine internally.
|
|
|
|
**Symptoms**: Sites return connection refused on port 443. SSL certificate errors in the browser. Specific domain not working.
|
|
|
|
**Diagnose**:
|
|
|
|
```bash
|
|
# Check Caddy status
|
|
sudo systemctl status caddy
|
|
|
|
# View Caddy logs
|
|
sudo journalctl -u caddy --since "30 minutes ago" --no-pager
|
|
|
|
# Test internal API directly (bypasses Caddy)
|
|
curl -s http://localhost:8001/health
|
|
|
|
# Test SSL certificates
|
|
curl -vI https://wizard.lu 2>&1 | grep -E "SSL|subject|expire"
|
|
curl -vI https://api.wizard.lu 2>&1 | grep -E "SSL|subject|expire"
|
|
```
|
|
|
|
**Common causes and fixes**:
|
|
|
|
=== "Caddy stopped"
|
|
|
|
```bash
|
|
sudo systemctl start caddy
|
|
sudo systemctl status caddy
|
|
```
|
|
|
|
=== "Caddyfile syntax error"
|
|
|
|
```bash
|
|
# Validate configuration
|
|
sudo caddy validate --config /etc/caddy/Caddyfile
|
|
|
|
# If invalid, check recent changes
|
|
sudo nano /etc/caddy/Caddyfile
|
|
|
|
# After fixing, reload (not restart, preserves certificates)
|
|
sudo systemctl reload caddy
|
|
```
|
|
|
|
=== "SSL certificate issue"
|
|
|
|
Caddy auto-renews certificates. If renewal fails, it is usually a port 80 or DNS issue.
|
|
|
|
```bash
|
|
# Ensure port 80 is open (needed for ACME HTTP challenge)
|
|
sudo ufw status | grep 80
|
|
|
|
# Check Caddy certificate storage
|
|
sudo ls -la /var/lib/caddy/.local/share/caddy/certificates/
|
|
|
|
# Force certificate renewal by restarting Caddy
|
|
sudo systemctl restart caddy
|
|
```
|
|
|
|
=== "DNS not pointing to server"
|
|
|
|
```bash
|
|
dig wizard.lu +short
|
|
# Should return 91.99.65.229
|
|
|
|
# If wrong, update DNS at registrar and wait for propagation
|
|
# Temporary: test by adding to /etc/hosts on your local machine
|
|
```
|
|
|
|
**Caddyfile reference** (at `/etc/caddy/Caddyfile`):
|
|
|
|
```bash
|
|
sudo cat /etc/caddy/Caddyfile
|
|
```
|
|
|
|
---
|
|
|
|
### 5. Celery Worker Issues
|
|
|
|
!!! attention "SEV-2"
|
|
Celery processes background tasks (imports, emails, scheduled jobs). If down, no background work happens, but the platform remains browsable.
|
|
|
|
**Symptoms**: Background tasks not executing. Flower shows no active workers. Emails not being sent.
|
|
|
|
**Diagnose**:
|
|
|
|
```bash
|
|
cd ~/apps/orion
|
|
|
|
# Check worker and beat containers
|
|
docker compose --profile full ps celery-worker celery-beat
|
|
|
|
# View worker logs
|
|
docker compose --profile full logs --tail=50 celery-worker
|
|
docker compose --profile full logs --tail=50 celery-beat
|
|
|
|
# Check Redis (the broker)
|
|
docker compose --profile full exec redis redis-cli ping
|
|
docker compose --profile full exec redis redis-cli llen celery
|
|
|
|
# Check Flower for worker status
|
|
curl -s http://localhost:5555/api/workers | python3 -m json.tool
|
|
```
|
|
|
|
**Common causes and fixes**:
|
|
|
|
=== "Worker crashed / import error"
|
|
|
|
```bash
|
|
# Check for Python errors in worker logs
|
|
docker compose --profile full logs celery-worker 2>&1 | grep -i "error\|exception" | tail -10
|
|
|
|
# Restart worker
|
|
cd ~/apps/orion
|
|
docker compose --profile full restart celery-worker celery-beat
|
|
```
|
|
|
|
=== "Redis down"
|
|
|
|
```bash
|
|
# Check Redis container
|
|
docker compose --profile full ps redis
|
|
docker compose --profile full logs --tail=20 redis
|
|
|
|
# Restart Redis, then workers
|
|
cd ~/apps/orion
|
|
docker compose --profile full restart redis
|
|
sleep 5
|
|
docker compose --profile full restart celery-worker celery-beat
|
|
```
|
|
|
|
=== "Task queue backed up"
|
|
|
|
```bash
|
|
# Check queue length
|
|
docker compose --profile full exec redis redis-cli llen celery
|
|
|
|
# If queue is extremely large and tasks are stale, purge
|
|
docker compose --profile full exec api \
|
|
celery -A app.core.celery_app purge -f
|
|
|
|
# Restart worker to pick up fresh
|
|
docker compose --profile full restart celery-worker
|
|
```
|
|
|
|
=== "Beat scheduler out of sync"
|
|
|
|
```bash
|
|
# Remove the beat schedule file and restart
|
|
docker compose --profile full exec celery-beat rm -f /app/celerybeat-schedule
|
|
docker compose --profile full restart celery-beat
|
|
```
|
|
|
|
---
|
|
|
|
### 6. Disk Full
|
|
|
|
!!! warning "SEV-2 (becomes SEV-1 if PostgreSQL cannot write WAL)"
|
|
The server has 37 GB disk. Docker images, logs, and database WAL can fill it quickly.
|
|
|
|
**Symptoms**: Write errors in logs. PostgreSQL panics. Docker cannot pull images. `No space left on device` errors.
|
|
|
|
**Diagnose**:
|
|
|
|
```bash
|
|
# Overall disk usage
|
|
df -h /
|
|
|
|
# Docker disk usage breakdown
|
|
docker system df
|
|
|
|
# Largest directories
|
|
sudo du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10
|
|
du -sh ~/backups/* 2>/dev/null
|
|
du -sh ~/apps/orion/logs/* 2>/dev/null
|
|
```
|
|
|
|
**Immediate cleanup**:
|
|
|
|
```bash
|
|
# 1. Remove old Docker images and build cache (safe, usually frees 2-5 GB)
|
|
docker system prune -af --volumes
|
|
|
|
# 2. Truncate application logs
|
|
cd ~/apps/orion
|
|
truncate -s 0 logs/*.log 2>/dev/null
|
|
|
|
# 3. Remove old backups beyond retention policy
|
|
find ~/backups -name "*.sql.gz" -mtime +14 -delete
|
|
|
|
# 4. Clean systemd journal logs (keep last 3 days)
|
|
sudo journalctl --vacuum-time=3d
|
|
|
|
# 5. Clean apt cache
|
|
sudo apt clean
|
|
```
|
|
|
|
**After freeing space**:
|
|
|
|
```bash
|
|
# Verify space recovered
|
|
df -h /
|
|
|
|
# Restart any containers that failed due to disk full
|
|
cd ~/apps/orion
|
|
docker compose --profile full up -d
|
|
```
|
|
|
|
!!! tip "Prevention"
|
|
Set up a Grafana alert for disk usage > 70%. The node-exporter dashboard (ID 1860) includes disk usage panels. If the server persistently runs low on disk, upgrade to CAX21 (80 GB disk).
|
|
|
|
---
|
|
|
|
### 7. Full Stack Restart (After Reboot)
|
|
|
|
!!! info "SEV-2"
|
|
After a server reboot (planned or unplanned), all services need to come back up in the correct order.
|
|
|
|
**When to use**: After a Hetzner maintenance reboot, manual reboot, or kernel upgrade.
|
|
|
|
**Step-by-step recovery**:
|
|
|
|
```bash
|
|
# 1. Verify Docker is running
|
|
sudo systemctl status docker
|
|
# If not: sudo systemctl start docker
|
|
|
|
# 2. Start Gitea (needed for CI, not for the app itself)
|
|
cd ~/gitea && docker compose up -d
|
|
sleep 5
|
|
|
|
# 3. Start the Orion stack (db and redis start first due to depends_on)
|
|
cd ~/apps/orion
|
|
docker compose --profile full up -d
|
|
sleep 15
|
|
|
|
# 4. Verify all containers are healthy
|
|
docker compose --profile full ps
|
|
|
|
# 5. Verify API health
|
|
curl -s http://localhost:8001/health | python3 -m json.tool
|
|
|
|
# 6. Start Caddy (should auto-start, but verify)
|
|
sudo systemctl status caddy
|
|
# If not running: sudo systemctl start caddy
|
|
|
|
# 7. Start the Gitea Actions runner
|
|
sudo systemctl status gitea-runner
|
|
# If not running: sudo systemctl start gitea-runner
|
|
|
|
# 8. Verify external access
|
|
curl -s https://api.wizard.lu/health
|
|
curl -I https://wizard.lu
|
|
curl -I https://omsflow.lu
|
|
curl -I https://rewardflow.lu
|
|
|
|
# 9. Verify monitoring
|
|
curl -I https://grafana.wizard.lu
|
|
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -c '"health":"up"'
|
|
|
|
# 10. Verify backups timer is active
|
|
systemctl list-timers orion-backup.timer
|
|
```
|
|
|
|
!!! note "Boot order"
|
|
Docker containers with `restart: always` will auto-start after Docker starts. Caddy and the Gitea runner are systemd services with `WantedBy=multi-user.target` and also auto-start. In practice, you mainly need to verify rather than manually start.
|
|
|
|
---
|
|
|
|
### 8. Restore from Backup (Disaster Recovery)
|
|
|
|
!!! danger "SEV-1"
|
|
Use this runbook when the database is corrupted or data is lost and you need to restore from a backup.
|
|
|
|
**Prerequisites**: Identify the backup to restore from.
|
|
|
|
```bash
|
|
# List available local backups
|
|
ls -lh ~/backups/orion/daily/
|
|
ls -lh ~/backups/orion/weekly/
|
|
|
|
# If local backups are gone, download from R2
|
|
source ~/apps/orion/.env
|
|
aws s3 ls s3://orion-backups/orion/daily/ \
|
|
--endpoint-url "https://${R2_ACCOUNT_ID:-$(grep R2_ACCOUNT_ID ~/apps/orion/.env | cut -d= -f2)}.r2.cloudflarestorage.com" \
|
|
--profile r2
|
|
```
|
|
|
|
**Download from R2 (if local backups unavailable)**:
|
|
|
|
```bash
|
|
aws s3 sync s3://orion-backups/ ~/backups/ \
|
|
--endpoint-url "https://<ACCOUNT_ID>.r2.cloudflarestorage.com" \
|
|
--profile r2
|
|
```
|
|
|
|
**Restore using the restore script**:
|
|
|
|
```bash
|
|
# Restore Orion database
|
|
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<backup-file>.sql.gz
|
|
```
|
|
|
|
The restore script will:
|
|
|
|
1. Stop application containers (API, Celery) while keeping the database running
|
|
2. Drop and recreate the `orion_db` database
|
|
3. Restore from the `.sql.gz` backup file
|
|
4. Run `alembic upgrade heads` to apply any pending migrations
|
|
5. Restart all containers
|
|
|
|
**Verify after restore**:
|
|
|
|
```bash
|
|
cd ~/apps/orion
|
|
|
|
# Check API health
|
|
curl -s http://localhost:8001/health | python3 -m json.tool
|
|
|
|
# Verify data integrity (check row counts of key tables)
|
|
docker compose --profile full exec db \
|
|
psql -U orion_user -d orion_db -c \
|
|
"SELECT 'platforms' AS tbl, count(*) FROM platforms
|
|
UNION ALL SELECT 'users', count(*) FROM users
|
|
UNION ALL SELECT 'stores', count(*) FROM stores;"
|
|
|
|
# Verify external access
|
|
curl -s https://api.wizard.lu/health
|
|
```
|
|
|
|
**Restore Gitea (if needed)**:
|
|
|
|
```bash
|
|
bash ~/apps/orion/scripts/restore.sh gitea ~/backups/gitea/daily/<backup-file>.sql.gz
|
|
```
|
|
|
|
**Full server rebuild from Hetzner snapshot** (worst case):
|
|
|
|
1. Go to **Hetzner Cloud Console > Servers > Snapshots**
|
|
2. Select the most recent snapshot and click **Rebuild from snapshot**
|
|
3. After rebuild, SSH in and verify all services per [Runbook 7](#7-full-stack-restart-after-reboot)
|
|
|
|
---
|
|
|
|
## Post-Incident Report Template
|
|
|
|
After resolving any SEV-1 or SEV-2 incident, create a post-incident report. Save reports in a shared location for the team.
|
|
|
|
```markdown
|
|
# Post-Incident Report: [Brief Title]
|
|
|
|
**Date**: YYYY-MM-DD
|
|
**Severity**: SEV-1 / SEV-2
|
|
**Duration**: HH:MM (from detection to resolution)
|
|
**Author**: [Name]
|
|
|
|
## Incident Summary
|
|
|
|
[1-2 sentence description of what happened and the user impact.]
|
|
|
|
## Timeline (UTC)
|
|
|
|
| Time | Event |
|
|
|-------|--------------------------------------------|
|
|
| HH:MM | Alert triggered / issue reported |
|
|
| HH:MM | Responder acknowledged |
|
|
| HH:MM | Root cause identified |
|
|
| HH:MM | Fix applied |
|
|
| HH:MM | Service fully restored |
|
|
|
|
## Root Cause
|
|
|
|
[What caused the incident. Be specific -- e.g., "OOM killer terminated the API
|
|
container because a Celery import task loaded 50k products into memory at once."]
|
|
|
|
## Resolution
|
|
|
|
[What was done to fix it. Include exact commands if relevant.]
|
|
|
|
## Impact
|
|
|
|
- **Users affected**: [number or scope]
|
|
- **Data lost**: [none / describe]
|
|
- **Downtime**: [duration]
|
|
|
|
## Action Items
|
|
|
|
| Action | Owner | Due Date | Status |
|
|
|--------|-------|----------|--------|
|
|
| [Preventive measure 1] | [Name] | YYYY-MM-DD | [ ] Open |
|
|
| [Preventive measure 2] | [Name] | YYYY-MM-DD | [ ] Open |
|
|
|
|
## Lessons Learned
|
|
|
|
[What went well, what could be improved in the response process.]
|
|
```
|
|
|
|
---
|
|
|
|
## Useful Monitoring URLs
|
|
|
|
| Service | URL | Purpose |
|
|
|---------|-----|---------|
|
|
| **Grafana** | [grafana.wizard.lu](https://grafana.wizard.lu) | Dashboards for host metrics, container metrics |
|
|
| **Prometheus** | `http://localhost:9090` (SSH tunnel) | Raw metrics queries, target health |
|
|
| **Prometheus Targets** | `http://localhost:9090/targets` | Check which scrape targets are up/down |
|
|
| **API Health** | [api.wizard.lu/health](https://api.wizard.lu/health) | Application health check (DB, Redis) |
|
|
| **API Liveness** | [api.wizard.lu/health/live](https://api.wizard.lu/health/live) | Basic liveness probe |
|
|
| **API Readiness** | [api.wizard.lu/health/ready](https://api.wizard.lu/health/ready) | Readiness probe (includes dependencies) |
|
|
| **API Metrics** | [api.wizard.lu/metrics](https://api.wizard.lu/metrics) | Prometheus-format application metrics |
|
|
| **Flower** | [flower.wizard.lu](https://flower.wizard.lu) | Celery task monitoring, worker status |
|
|
| **Gitea** | [git.wizard.lu](https://git.wizard.lu) | Git repository and CI pipeline status |
|
|
| **Main Platform** | [wizard.lu](https://wizard.lu) | Main storefront |
|
|
| **OMS Platform** | [omsflow.lu](https://omsflow.lu) | OMS storefront |
|
|
| **Loyalty+ Platform** | [rewardflow.lu](https://rewardflow.lu) | Loyalty+ storefront |
|
|
| **Hetzner Console** | [console.hetzner.cloud](https://console.hetzner.cloud) | Server management, snapshots, rescue mode |
|
|
| **Hetzner Status** | [status.hetzner.com](https://status.hetzner.com) | Hetzner infrastructure status |
|
|
|
|
!!! tip "SSH tunnel for internal services"
|
|
Prometheus and other internal services are not exposed externally. To access them from your local machine:
|
|
|
|
```bash
|
|
# Prometheus (localhost:9090 on server → localhost:9090 on your machine)
|
|
ssh -L 9090:localhost:9090 samir@91.99.65.229
|
|
|
|
# Then open http://localhost:9090 in your browser
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Reference: Essential Commands
|
|
|
|
```bash
|
|
# SSH into the server
|
|
ssh samir@91.99.65.229
|
|
|
|
# Container status
|
|
cd ~/apps/orion && docker compose --profile full ps
|
|
|
|
# Container resource usage
|
|
docker stats --no-stream
|
|
|
|
# Follow all logs
|
|
cd ~/apps/orion && docker compose --profile full logs -f
|
|
|
|
# Restart a single service
|
|
cd ~/apps/orion && docker compose --profile full restart <service>
|
|
|
|
# Full stack rebuild
|
|
cd ~/apps/orion && docker compose --profile full up -d --build
|
|
|
|
# Caddy status / logs
|
|
sudo systemctl status caddy
|
|
sudo journalctl -u caddy -f
|
|
|
|
# System resources
|
|
free -h && df -h / && uptime
|
|
|
|
# Manual deploy
|
|
cd ~/apps/orion && bash scripts/deploy.sh
|
|
|
|
# Manual backup
|
|
bash ~/apps/orion/scripts/backup.sh --upload
|
|
|
|
# Run migrations
|
|
cd ~/apps/orion && docker compose --profile full exec -e PYTHONPATH=/app api python -m alembic upgrade heads
|
|
```
|