sboulahtit/orion

Fork 0

Files

Samir Boulahtit 67260e9322

CI / ruff (push) Successful in 11s

Details

CI / pytest (push) Successful in 36m0s

Details

CI / validate (push) Successful in 23s

Details

CI / dependency-scanning (push) Successful in 31s

Details

CI / docs (push) Successful in 39s

Details

CI / deploy (push) Successful in 48s

Details

docs: update progress — server fully ready (44/44 checks pass)

- Mark all server-side tasks as complete (fail2ban, Flower password,
  unattended-upgrades, verification script)
- Correct memory limits: celery-beat and flower bumped to 256m after OOM
- Update scaling guide memory budget to match actual limits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-17 11:04:43 +01:00

7.0 KiB

Raw Blame History

Scaling Guide

Practical playbook for scaling Orion from a single CAX11 server to a multi-server architecture.

Current Setup

Component	Spec
Server	Hetzner CAX11 (ARM64)
vCPU	2
RAM	4 GB
Disk	40 GB SSD
Cost	~4.50 EUR/mo

Container Memory Budget

Container	Limit	Purpose
db	512 MB	PostgreSQL 15
redis	128 MB	Task broker + cache
api	512 MB	FastAPI (Uvicorn)
celery-worker	512 MB	Background tasks
celery-beat	256 MB	Task scheduler
flower	256 MB	Celery monitoring
App subtotal	2,176 MB
prometheus	256 MB	Metrics (15-day retention)
grafana	192 MB	Dashboards
node-exporter	64 MB	Host metrics
cadvisor	128 MB	Container metrics
alertmanager	32 MB	Alert routing
Monitoring subtotal	672 MB
Total containers	2,848 MB
OS + Caddy + Gitea + CI	~1,150 MB	Remaining headroom

Key Metrics to Watch

Monitor these in Grafana (or via curl to Prometheus query API).

RAM Usage

# Host memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Per-container memory usage
container_memory_usage_bytes{name=~"orion.*"} / 1024 / 1024

Threshold: Alert at >85% host RAM. Scale at sustained >80%.

CPU Usage

# Host CPU usage (1-minute average)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-container CPU
rate(container_cpu_usage_seconds_total{name=~"orion.*"}[5m]) * 100

Threshold: Alert at >80% for 5 minutes. Scale at sustained >70%.

Disk Usage

# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

Threshold: Alert at >80%. Critical at >90%. Scale disk or clean up.

API Latency

# P95 response time (if using prometheus_client histograms)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Threshold: Alert at P95 >2s. Investigate at P95 >1s.

Database Connections

# Active PostgreSQL connections (requires pg_stat_activity export)
pg_stat_activity_count

Threshold: Default pool is 10 + 20 overflow = 30 max. Alert at >20 active.

Redis Memory

# Redis used memory
redis_memory_used_bytes

Threshold: Alert at >100 MB (of 128 MB limit). Scale Redis limit or add eviction policy.

When to Scale

Is RAM consistently >80%?
├── YES → Upgrade server (CAX11 → CAX21)
└── NO
    Is API P95 latency >2s?
    ├── YES → Is it DB queries?
    │   ├── YES → Add PgBouncer or increase pool size
    │   └── NO → Add Uvicorn workers or upgrade CPU
    └── NO
        Is disk >80%?
        ├── YES → Clean logs/backups or upgrade disk
        └── NO
            Are Celery tasks queuing >100 for >10min?
            ├── YES → Add celery-worker replicas
            └── NO → No scaling needed

Scaling Actions

1. Server Upgrade (Vertical Scaling)

The fastest path. Hetzner allows live upgrades with a ~2 minute restart.

# In Hetzner Cloud Console:
# Servers > your server > Rescale > select new plan > Rescale

After rescale, update memory limits in docker-compose.yml to use the additional RAM, then restart:

cd ~/apps/orion
docker compose --profile full up -d

2. Add PgBouncer (Connection Pooling)

When database connections become a bottleneck (>20 active connections):

# Add to docker-compose.yml
pgbouncer:
  image: edoburu/pgbouncer:latest
  restart: always
  environment:
    DATABASE_URL: postgresql://orion_user:secure_password@db:5432/orion_db
    POOL_MODE: transaction
    MAX_CLIENT_CONN: 100
    DEFAULT_POOL_SIZE: 20
  mem_limit: 64m
  networks:
    - backend

Update DATABASE_URL in API and Celery to point to PgBouncer instead of db directly.

3. Redis Hardening

Set a maxmemory policy to prevent OOM:

# In docker-compose.yml, add command to redis service
redis:
  command: redis-server --maxmemory 100mb --maxmemory-policy allkeys-lru

4. Separate Database Server

When the database needs its own resources (typically >50 stores):

Create a new Hetzner server (CAX11 or CAX21) for PostgreSQL
Move the db service to the new server
Update DATABASE_URL to point to the DB server's IP
Set up pg_hba.conf to allow connections from the app server
Keep Redis on the app server (latency-sensitive)

5. Multi-Worker API

Scale Uvicorn workers for higher request throughput:

# In docker-compose.yml, update api command
api:
  command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Rule of thumb: workers = 2 * CPU cores + 1. On CAX21 (4 vCPU): 9 workers max, but start with 4.

6. Celery Worker Replicas

For heavy background task loads, scale horizontally:

docker compose --profile full up -d --scale celery-worker=3

Each replica adds ~512 MB RAM. Ensure the server has headroom.

Hetzner ARM (CAX) Pricing

All prices are monthly, excl. VAT. ARM servers offer the best price/performance for Docker workloads.

Plan	vCPU	RAM	Disk	Price	Suitable For
CAX11	2	4 GB	40 GB	~4.50 EUR	1 client, up to 24 stores
CAX21	4	8 GB	80 GB	~7.50 EUR	2-3 clients, up to 75 stores
CAX31	8	16 GB	160 GB	~14.50 EUR	5-10 clients, up to 200 stores
CAX41	16	32 GB	320 GB	~27.50 EUR	10-25 clients, up to 500 stores

!!! tip "Upgrade path" Hetzner allows upgrading to a larger plan with a ~2 minute restart. No data migration needed. Always upgrade vertically first before adding horizontal complexity.

Timeline

Launch (Now)

Server: CAX11 (4 GB)
Clients: 1
Stores: up to 24
Actions: Memory limits set, monitoring active, alerts configured

Early Growth (1-3 months)

Monitor: RAM usage, API latency, disk growth
Trigger: RAM consistently >80% or disk >70%
Action: Upgrade to CAX21 (8 GB, ~7.50 EUR/mo)
Increase: memory limits for db (1 GB), api (1 GB), celery-worker (1 GB)

Growth (3-6 months)

Trigger: 3+ clients, >75 stores, or DB queries slowing down
Actions:
- Add PgBouncer for connection pooling
- Increase Uvicorn workers to 4
- Consider Redis maxmemory policy
Server: CAX21 or CAX31 depending on load

Scale (6-12 months)

Trigger: 10+ clients, >200 stores
Actions:
- Separate database to its own server
- Scale Celery workers (2-3 replicas)
- Upgrade app server to CAX31
- Consider CDN for static assets

Enterprise (12+ months)

Trigger: 25+ clients, >500 stores, SLA requirements
Actions:
- Multi-server architecture (app, DB, Redis, workers)
- PostgreSQL read replicas
- Redis Sentinel for HA
- Load balancer for API
- Consider Kubernetes if operational complexity is justified

7.0 KiB Raw Blame History

Scaling Guide

Current Setup

Container Memory Budget

Key Metrics to Watch

RAM Usage

CPU Usage

Disk Usage

API Latency

Database Connections

Redis Memory

When to Scale

Scaling Actions

1. Server Upgrade (Vertical Scaling)

2. Add PgBouncer (Connection Pooling)

3. Redis Hardening

4. Separate Database Server

5. Multi-Worker API

6. Celery Worker Replicas

Hetzner ARM (CAX) Pricing

Timeline

Launch (Now)

Early Growth (1-3 months)

Growth (3-6 months)

Scale (6-12 months)

Enterprise (12+ months)

7.0 KiB

Raw Blame History