feat(infra): add launch readiness quick wins

- Add mem_limit to all 6 app containers (db: 512m, redis: 128m, api: 512m, celery-worker: 512m, celery-beat: 128m, flower: 128m) - Restrict Flower port to localhost (127.0.0.1:5555:5555) - Add PostgreSQL and Redis health checks to /health/ready endpoint with individual check details (name, status, latency) - Add scaling guide with metrics, thresholds, Hetzner pricing - Add server verification script (12 infrastructure checks) - Update hetzner-server-setup.md with progress and pending tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 10:24:20 +01:00
parent 8ee8c398ce
commit 10fdf91dfa
6 changed files with 604 additions and 5 deletions
--- a/docs/deployment/hetzner-server-setup.md
+++ b/docs/deployment/hetzner-server-setup.md
@@ -132,6 +132,22 @@ Complete step-by-step guide for deploying Orion on a Hetzner Cloud VPS.

    **Steps 1–24 fully deployed and operational.**

+!!! success "Progress — 2026-02-16 (continued)"
+    **Launch readiness — code changes:**
+
+    - **Memory limits** added to all 6 app containers in `docker-compose.yml` (db: 512m, redis: 128m, api: 512m, celery-worker: 512m, celery-beat: 128m, flower: 128m)
+    - **Flower port** restricted to localhost only (`127.0.0.1:5555:5555`) — access via Caddy reverse proxy
+    - **Infrastructure health checks** — `/health/ready` now checks PostgreSQL (`SELECT 1`) and Redis (`ping`) with individual check details and latency
+    - **Scaling guide** — practical playbook at `docs/deployment/scaling-guide.md` (metrics, thresholds, Hetzner pricing, timeline)
+    - **Server verification script** — `scripts/verify-server.sh` checks all 12 infrastructure components
+
+    **Pending server-side tasks:**
+
+    - [ ] Deploy fail2ban Caddy auth jail (documented in Step 20, config ready but not yet applied)
+    - [ ] Change Flower password from default (`FLOWER_PASSWORD` in `.env`)
+    - [ ] Verify unattended-upgrades is active (`sudo unattended-upgrades --dry-run`)
+    - [ ] Run `scripts/verify-server.sh` on server to validate all infrastructure
+

 ## Installed Software Versions

--- a/docs/deployment/scaling-guide.md
+++ b/docs/deployment/scaling-guide.md
@@ -0,0 +1,267 @@
+# Scaling Guide
+
+Practical playbook for scaling Orion from a single CAX11 server to a multi-server architecture.
+
+---
+
+## Current Setup
+
+| Component | Spec |
+|-----------|------|
+| Server | Hetzner CAX11 (ARM64) |
+| vCPU | 2 |
+| RAM | 4 GB |
+| Disk | 40 GB SSD |
+| Cost | ~4.50 EUR/mo |
+
+### Container Memory Budget
+
+| Container | Limit | Purpose |
+|-----------|-------|---------|
+| db | 512 MB | PostgreSQL 15 |
+| redis | 128 MB | Task broker + cache |
+| api | 512 MB | FastAPI (Uvicorn) |
+| celery-worker | 512 MB | Background tasks |
+| celery-beat | 128 MB | Task scheduler |
+| flower | 128 MB | Celery monitoring |
+| **App subtotal** | **1,920 MB** | |
+| prometheus | 256 MB | Metrics (15-day retention) |
+| grafana | 192 MB | Dashboards |
+| node-exporter | 64 MB | Host metrics |
+| cadvisor | 128 MB | Container metrics |
+| alertmanager | 32 MB | Alert routing |
+| **Monitoring subtotal** | **672 MB** | |
+| **Total containers** | **2,592 MB** | |
+| OS + Caddy + Gitea + CI | ~1,400 MB | Remaining headroom |
+
+---
+
+## Key Metrics to Watch
+
+Monitor these in Grafana (or via `curl` to Prometheus query API).
+
+### RAM Usage
+
+```promql
+# Host memory usage percentage
+(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
+
+# Per-container memory usage
+container_memory_usage_bytes{name=~"orion.*"} / 1024 / 1024
+```
+
+**Threshold**: Alert at >85% host RAM. Scale at sustained >80%.
+
+### CPU Usage
+
+```promql
+# Host CPU usage (1-minute average)
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Per-container CPU
+rate(container_cpu_usage_seconds_total{name=~"orion.*"}[5m]) * 100
+```
+
+**Threshold**: Alert at >80% for 5 minutes. Scale at sustained >70%.
+
+### Disk Usage
+
+```promql
+# Disk usage percentage
+(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
+```
+
+**Threshold**: Alert at >80%. Critical at >90%. Scale disk or clean up.
+
+### API Latency
+
+```promql
+# P95 response time (if using prometheus_client histograms)
+histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
+```
+
+**Threshold**: Alert at P95 >2s. Investigate at P95 >1s.
+
+### Database Connections
+
+```promql
+# Active PostgreSQL connections (requires pg_stat_activity export)
+pg_stat_activity_count
+```
+
+**Threshold**: Default pool is 10 + 20 overflow = 30 max. Alert at >20 active.
+
+### Redis Memory
+
+```promql
+# Redis used memory
+redis_memory_used_bytes
+```
+
+**Threshold**: Alert at >100 MB (of 128 MB limit). Scale Redis limit or add eviction policy.
+
+---
+
+## When to Scale
+
+```
+Is RAM consistently >80%?
+├── YES → Upgrade server (CAX11 → CAX21)
+└── NO
+    Is API P95 latency >2s?
+    ├── YES → Is it DB queries?
+    │   ├── YES → Add PgBouncer or increase pool size
+    │   └── NO → Add Uvicorn workers or upgrade CPU
+    └── NO
+        Is disk >80%?
+        ├── YES → Clean logs/backups or upgrade disk
+        └── NO
+            Are Celery tasks queuing >100 for >10min?
+            ├── YES → Add celery-worker replicas
+            └── NO → No scaling needed
+```
+
+---
+
+## Scaling Actions
+
+### 1. Server Upgrade (Vertical Scaling)
+
+The fastest path. Hetzner allows live upgrades with a ~2 minute restart.
+
+```bash
+# In Hetzner Cloud Console:
+# Servers > your server > Rescale > select new plan > Rescale
+```
+
+After rescale, update memory limits in `docker-compose.yml` to use the additional RAM, then restart:
+
+```bash
+cd ~/apps/orion
+docker compose --profile full up -d
+```
+
+### 2. Add PgBouncer (Connection Pooling)
+
+When database connections become a bottleneck (>20 active connections):
+
+```yaml
+# Add to docker-compose.yml
+pgbouncer:
+  image: edoburu/pgbouncer:latest
+  restart: always
+  environment:
+    DATABASE_URL: postgresql://orion_user:secure_password@db:5432/orion_db
+    POOL_MODE: transaction
+    MAX_CLIENT_CONN: 100
+    DEFAULT_POOL_SIZE: 20
+  mem_limit: 64m
+  networks:
+    - backend
+```
+
+Update `DATABASE_URL` in API and Celery to point to PgBouncer instead of `db` directly.
+
+### 3. Redis Hardening
+
+Set a `maxmemory` policy to prevent OOM:
+
+```yaml
+# In docker-compose.yml, add command to redis service
+redis:
+  command: redis-server --maxmemory 100mb --maxmemory-policy allkeys-lru
+```
+
+### 4. Separate Database Server
+
+When the database needs its own resources (typically >50 stores):
+
+1. Create a new Hetzner server (CAX11 or CAX21) for PostgreSQL
+2. Move the `db` service to the new server
+3. Update `DATABASE_URL` to point to the DB server's IP
+4. Set up pg_hba.conf to allow connections from the app server
+5. Keep Redis on the app server (latency-sensitive)
+
+### 5. Multi-Worker API
+
+Scale Uvicorn workers for higher request throughput:
+
+```yaml
+# In docker-compose.yml, update api command
+api:
+  command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
+```
+
+Rule of thumb: `workers = 2 * CPU cores + 1`. On CAX21 (4 vCPU): 9 workers max, but start with 4.
+
+### 6. Celery Worker Replicas
+
+For heavy background task loads, scale horizontally:
+
+```bash
+docker compose --profile full up -d --scale celery-worker=3
+```
+
+Each replica adds ~512 MB RAM. Ensure the server has headroom.
+
+---
+
+## Hetzner ARM (CAX) Pricing
+
+All prices are monthly, excl. VAT. ARM servers offer the best price/performance for Docker workloads.
+
+| Plan | vCPU | RAM | Disk | Price | Suitable For |
+|------|------|-----|------|-------|-------------|
+| CAX11 | 2 | 4 GB | 40 GB | ~4.50 EUR | 1 client, up to 24 stores |
+| CAX21 | 4 | 8 GB | 80 GB | ~7.50 EUR | 2-3 clients, up to 75 stores |
+| CAX31 | 8 | 16 GB | 160 GB | ~14.50 EUR | 5-10 clients, up to 200 stores |
+| CAX41 | 16 | 32 GB | 320 GB | ~27.50 EUR | 10-25 clients, up to 500 stores |
+
+!!! tip "Upgrade path"
+    Hetzner allows upgrading to a larger plan with a ~2 minute restart. No data migration needed. Always upgrade vertically first before adding horizontal complexity.
+
+---
+
+## Timeline
+
+### Launch (Now)
+
+- **Server**: CAX11 (4 GB)
+- **Clients**: 1
+- **Stores**: up to 24
+- **Actions**: Memory limits set, monitoring active, alerts configured
+
+### Early Growth (1-3 months)
+
+- **Monitor**: RAM usage, API latency, disk growth
+- **Trigger**: RAM consistently >80% or disk >70%
+- **Action**: Upgrade to CAX21 (8 GB, ~7.50 EUR/mo)
+- **Increase**: memory limits for db (1 GB), api (1 GB), celery-worker (1 GB)
+
+### Growth (3-6 months)
+
+- **Trigger**: 3+ clients, >75 stores, or DB queries slowing down
+- **Actions**:
+    - Add PgBouncer for connection pooling
+    - Increase Uvicorn workers to 4
+    - Consider Redis maxmemory policy
+- **Server**: CAX21 or CAX31 depending on load
+
+### Scale (6-12 months)
+
+- **Trigger**: 10+ clients, >200 stores
+- **Actions**:
+    - Separate database to its own server
+    - Scale Celery workers (2-3 replicas)
+    - Upgrade app server to CAX31
+    - Consider CDN for static assets
+
+### Enterprise (12+ months)
+
+- **Trigger**: 25+ clients, >500 stores, SLA requirements
+- **Actions**:
+    - Multi-server architecture (app, DB, Redis, workers)
+    - PostgreSQL read replicas
+    - Redis Sentinel for HA
+    - Load balancer for API
+    - Consider Kubernetes if operational complexity is justified