# Scaling Guide Practical playbook for scaling Orion from a single CAX11 server to a multi-server architecture. --- ## Current Setup | Component | Spec | |-----------|------| | Server | Hetzner CAX11 (ARM64) | | vCPU | 2 | | RAM | 4 GB | | Disk | 40 GB SSD | | Cost | ~4.50 EUR/mo | ### Container Memory Budget | Container | Limit | Purpose | |-----------|-------|---------| | db | 256 MB | PostgreSQL 15 | | redis | 128 MB | Task broker (maxmemory 100mb, allkeys-lru) | | api | 512 MB | FastAPI (Uvicorn) | | celery-worker | 768 MB | Background tasks (concurrency=2) | | celery-beat | 128 MB | Task scheduler | | flower | 192 MB | Celery monitoring | | **App subtotal** | **1,984 MB** | | | prometheus | 256 MB | Metrics (15-day retention) | | grafana | 192 MB | Dashboards | | node-exporter | 64 MB | Host metrics | | cadvisor | 128 MB | Container metrics | | redis-exporter | 32 MB | Redis metrics | | alertmanager | 32 MB | Alert routing | | **Monitoring subtotal** | **704 MB** | | | **Total containers** | **2,688 MB** | | | OS + Caddy + Gitea + CI | ~1,300 MB | Remaining headroom | --- ## Key Metrics to Watch Monitor these in Grafana (or via `curl` to Prometheus query API). ### RAM Usage ```promql # Host memory usage percentage (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 # Per-container memory usage container_memory_usage_bytes{name=~"orion.*"} / 1024 / 1024 ``` **Threshold**: Alert at >85% host RAM. Scale at sustained >80%. ### CPU Usage ```promql # Host CPU usage (1-minute average) 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Per-container CPU rate(container_cpu_usage_seconds_total{name=~"orion.*"}[5m]) * 100 ``` **Threshold**: Alert at >80% for 5 minutes. Scale at sustained >70%. ### Disk Usage ```promql # Disk usage percentage (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 ``` **Threshold**: Alert at >80%. Critical at >90%. Scale disk or clean up. ### API Latency ```promql # P95 response time (if using prometheus_client histograms) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) ``` **Threshold**: Alert at P95 >2s. Investigate at P95 >1s. ### Database Connections ```promql # Active PostgreSQL connections (requires pg_stat_activity export) pg_stat_activity_count ``` **Threshold**: Default pool is 10 + 20 overflow = 30 max. Alert at >20 active. ### Redis Memory ```promql # Redis used memory redis_memory_used_bytes ``` **Threshold**: Alert at >100 MB (of 128 MB limit). Scale Redis limit or add eviction policy. --- ## When to Scale ``` Is RAM consistently >80%? ├── YES → Upgrade server (CAX11 → CAX21) └── NO Is API P95 latency >2s? ├── YES → Is it DB queries? │ ├── YES → Add PgBouncer or increase pool size │ └── NO → Add Uvicorn workers or upgrade CPU └── NO Is disk >80%? ├── YES → Clean logs/backups or upgrade disk └── NO Are Celery tasks queuing >100 for >10min? ├── YES → Add celery-worker replicas └── NO → No scaling needed ``` --- ## Scaling Actions ### 1. Server Upgrade (Vertical Scaling) The fastest path. Hetzner allows live upgrades with a ~2 minute restart. ```bash # In Hetzner Cloud Console: # Servers > your server > Rescale > select new plan > Rescale ``` After rescale, update memory limits in `docker-compose.yml` to use the additional RAM, then restart: ```bash cd ~/apps/orion docker compose --profile full up -d ``` ### 2. Add PgBouncer (Connection Pooling) When database connections become a bottleneck (>20 active connections): ```yaml # Add to docker-compose.yml pgbouncer: image: edoburu/pgbouncer:latest restart: always environment: DATABASE_URL: postgresql://orion_user:secure_password@db:5432/orion_db POOL_MODE: transaction MAX_CLIENT_CONN: 100 DEFAULT_POOL_SIZE: 20 mem_limit: 64m networks: - backend ``` Update `DATABASE_URL` in API and Celery to point to PgBouncer instead of `db` directly. ### 3. Redis Hardening Redis `maxmemory` is already configured in `docker-compose.yml`: ```yaml redis: command: redis-server --maxmemory 100mb --maxmemory-policy allkeys-lru ``` If Redis usage grows beyond 80%, the `RedisHighMemoryUsage` alert fires. Increase `maxmemory` and `mem_limit` together. ### 4. Separate Database Server When the database needs its own resources (typically >50 stores): 1. Create a new Hetzner server (CAX11 or CAX21) for PostgreSQL 2. Move the `db` service to the new server 3. Update `DATABASE_URL` to point to the DB server's IP 4. Set up pg_hba.conf to allow connections from the app server 5. Keep Redis on the app server (latency-sensitive) ### 5. Multi-Worker API Scale Uvicorn workers for higher request throughput: ```yaml # In docker-compose.yml, update api command api: command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4 ``` Rule of thumb: `workers = 2 * CPU cores + 1`. On CAX21 (4 vCPU): 9 workers max, but start with 4. ### 6. Celery Worker Replicas For heavy background task loads, scale horizontally: ```bash docker compose --profile full up -d --scale celery-worker=3 ``` Each replica adds ~512 MB RAM. Ensure the server has headroom. --- ## Hetzner ARM (CAX) Pricing All prices are monthly, excl. VAT. ARM servers offer the best price/performance for Docker workloads. | Plan | vCPU | RAM | Disk | Price | Suitable For | |------|------|-----|------|-------|-------------| | CAX11 | 2 | 4 GB | 40 GB | ~4.50 EUR | 1 client, up to 24 stores | | CAX21 | 4 | 8 GB | 80 GB | ~7.50 EUR | 2-3 clients, up to 75 stores | | CAX31 | 8 | 16 GB | 160 GB | ~14.50 EUR | 5-10 clients, up to 200 stores | | CAX41 | 16 | 32 GB | 320 GB | ~27.50 EUR | 10-25 clients, up to 500 stores | !!! tip "Upgrade path" Hetzner allows upgrading to a larger plan with a ~2 minute restart. No data migration needed. Always upgrade vertically first before adding horizontal complexity. --- ## Timeline ### Launch (Now) - **Server**: CAX11 (4 GB) - **Clients**: 1 - **Stores**: up to 24 - **Actions**: Memory limits set, monitoring active, alerts configured ### Early Growth (1-3 months) - **Monitor**: RAM usage, API latency, disk growth - **Trigger**: RAM consistently >80% or disk >70% - **Action**: Upgrade to CAX21 (8 GB, ~7.50 EUR/mo) - **Increase**: memory limits for db (1 GB), api (1 GB), celery-worker (1 GB) ### Growth (3-6 months) - **Trigger**: 3+ clients, >75 stores, or DB queries slowing down - **Actions**: - Add PgBouncer for connection pooling - Increase Uvicorn workers to 4 - Consider Redis maxmemory policy - **Server**: CAX21 or CAX31 depending on load ### Scale (6-12 months) - **Trigger**: 10+ clients, >200 stores - **Actions**: - Separate database to its own server - Scale Celery workers (2-3 replicas) - Upgrade app server to CAX31 - Consider CDN for static assets ### Enterprise (12+ months) - **Trigger**: 25+ clients, >500 stores, SLA requirements - **Actions**: - Multi-server architecture (app, DB, Redis, workers) - PostgreSQL read replicas - Redis Sentinel for HA - Load balancer for API - Consider Kubernetes if operational complexity is justified