Files
orion/docs/deployment/scaling-guide.md
Samir Boulahtit 67260e9322
All checks were successful
CI / ruff (push) Successful in 11s
CI / pytest (push) Successful in 36m0s
CI / validate (push) Successful in 23s
CI / dependency-scanning (push) Successful in 31s
CI / docs (push) Successful in 39s
CI / deploy (push) Successful in 48s
docs: update progress — server fully ready (44/44 checks pass)
- Mark all server-side tasks as complete (fail2ban, Flower password,
  unattended-upgrades, verification script)
- Correct memory limits: celery-beat and flower bumped to 256m after OOM
- Update scaling guide memory budget to match actual limits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 11:04:43 +01:00

7.0 KiB

Scaling Guide

Practical playbook for scaling Orion from a single CAX11 server to a multi-server architecture.


Current Setup

Component Spec
Server Hetzner CAX11 (ARM64)
vCPU 2
RAM 4 GB
Disk 40 GB SSD
Cost ~4.50 EUR/mo

Container Memory Budget

Container Limit Purpose
db 512 MB PostgreSQL 15
redis 128 MB Task broker + cache
api 512 MB FastAPI (Uvicorn)
celery-worker 512 MB Background tasks
celery-beat 256 MB Task scheduler
flower 256 MB Celery monitoring
App subtotal 2,176 MB
prometheus 256 MB Metrics (15-day retention)
grafana 192 MB Dashboards
node-exporter 64 MB Host metrics
cadvisor 128 MB Container metrics
alertmanager 32 MB Alert routing
Monitoring subtotal 672 MB
Total containers 2,848 MB
OS + Caddy + Gitea + CI ~1,150 MB Remaining headroom

Key Metrics to Watch

Monitor these in Grafana (or via curl to Prometheus query API).

RAM Usage

# Host memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Per-container memory usage
container_memory_usage_bytes{name=~"orion.*"} / 1024 / 1024

Threshold: Alert at >85% host RAM. Scale at sustained >80%.

CPU Usage

# Host CPU usage (1-minute average)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-container CPU
rate(container_cpu_usage_seconds_total{name=~"orion.*"}[5m]) * 100

Threshold: Alert at >80% for 5 minutes. Scale at sustained >70%.

Disk Usage

# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

Threshold: Alert at >80%. Critical at >90%. Scale disk or clean up.

API Latency

# P95 response time (if using prometheus_client histograms)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Threshold: Alert at P95 >2s. Investigate at P95 >1s.

Database Connections

# Active PostgreSQL connections (requires pg_stat_activity export)
pg_stat_activity_count

Threshold: Default pool is 10 + 20 overflow = 30 max. Alert at >20 active.

Redis Memory

# Redis used memory
redis_memory_used_bytes

Threshold: Alert at >100 MB (of 128 MB limit). Scale Redis limit or add eviction policy.


When to Scale

Is RAM consistently >80%?
├── YES → Upgrade server (CAX11 → CAX21)
└── NO
    Is API P95 latency >2s?
    ├── YES → Is it DB queries?
    │   ├── YES → Add PgBouncer or increase pool size
    │   └── NO → Add Uvicorn workers or upgrade CPU
    └── NO
        Is disk >80%?
        ├── YES → Clean logs/backups or upgrade disk
        └── NO
            Are Celery tasks queuing >100 for >10min?
            ├── YES → Add celery-worker replicas
            └── NO → No scaling needed

Scaling Actions

1. Server Upgrade (Vertical Scaling)

The fastest path. Hetzner allows live upgrades with a ~2 minute restart.

# In Hetzner Cloud Console:
# Servers > your server > Rescale > select new plan > Rescale

After rescale, update memory limits in docker-compose.yml to use the additional RAM, then restart:

cd ~/apps/orion
docker compose --profile full up -d

2. Add PgBouncer (Connection Pooling)

When database connections become a bottleneck (>20 active connections):

# Add to docker-compose.yml
pgbouncer:
  image: edoburu/pgbouncer:latest
  restart: always
  environment:
    DATABASE_URL: postgresql://orion_user:secure_password@db:5432/orion_db
    POOL_MODE: transaction
    MAX_CLIENT_CONN: 100
    DEFAULT_POOL_SIZE: 20
  mem_limit: 64m
  networks:
    - backend

Update DATABASE_URL in API and Celery to point to PgBouncer instead of db directly.

3. Redis Hardening

Set a maxmemory policy to prevent OOM:

# In docker-compose.yml, add command to redis service
redis:
  command: redis-server --maxmemory 100mb --maxmemory-policy allkeys-lru

4. Separate Database Server

When the database needs its own resources (typically >50 stores):

  1. Create a new Hetzner server (CAX11 or CAX21) for PostgreSQL
  2. Move the db service to the new server
  3. Update DATABASE_URL to point to the DB server's IP
  4. Set up pg_hba.conf to allow connections from the app server
  5. Keep Redis on the app server (latency-sensitive)

5. Multi-Worker API

Scale Uvicorn workers for higher request throughput:

# In docker-compose.yml, update api command
api:
  command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Rule of thumb: workers = 2 * CPU cores + 1. On CAX21 (4 vCPU): 9 workers max, but start with 4.

6. Celery Worker Replicas

For heavy background task loads, scale horizontally:

docker compose --profile full up -d --scale celery-worker=3

Each replica adds ~512 MB RAM. Ensure the server has headroom.


Hetzner ARM (CAX) Pricing

All prices are monthly, excl. VAT. ARM servers offer the best price/performance for Docker workloads.

Plan vCPU RAM Disk Price Suitable For
CAX11 2 4 GB 40 GB ~4.50 EUR 1 client, up to 24 stores
CAX21 4 8 GB 80 GB ~7.50 EUR 2-3 clients, up to 75 stores
CAX31 8 16 GB 160 GB ~14.50 EUR 5-10 clients, up to 200 stores
CAX41 16 32 GB 320 GB ~27.50 EUR 10-25 clients, up to 500 stores

!!! tip "Upgrade path" Hetzner allows upgrading to a larger plan with a ~2 minute restart. No data migration needed. Always upgrade vertically first before adding horizontal complexity.


Timeline

Launch (Now)

  • Server: CAX11 (4 GB)
  • Clients: 1
  • Stores: up to 24
  • Actions: Memory limits set, monitoring active, alerts configured

Early Growth (1-3 months)

  • Monitor: RAM usage, API latency, disk growth
  • Trigger: RAM consistently >80% or disk >70%
  • Action: Upgrade to CAX21 (8 GB, ~7.50 EUR/mo)
  • Increase: memory limits for db (1 GB), api (1 GB), celery-worker (1 GB)

Growth (3-6 months)

  • Trigger: 3+ clients, >75 stores, or DB queries slowing down
  • Actions:
    • Add PgBouncer for connection pooling
    • Increase Uvicorn workers to 4
    • Consider Redis maxmemory policy
  • Server: CAX21 or CAX31 depending on load

Scale (6-12 months)

  • Trigger: 10+ clients, >200 stores
  • Actions:
    • Separate database to its own server
    • Scale Celery workers (2-3 replicas)
    • Upgrade app server to CAX31
    • Consider CDN for static assets

Enterprise (12+ months)

  • Trigger: 25+ clients, >500 stores, SLA requirements
  • Actions:
    • Multi-server architecture (app, DB, Redis, workers)
    • PostgreSQL read replicas
    • Redis Sentinel for HA
    • Load balancer for API
    • Consider Kubernetes if operational complexity is justified