orion/docs/deployment/scaling-guide.md

# Scaling Guide

Practical playbook for scaling Orion from a single CAX11 server to a multi-server architecture.

---

## Current Setup

| Component | Spec |
|-----------|------|
| Server | Hetzner CAX11 (ARM64) |
| vCPU | 2 |
| RAM | 4 GB |
| Disk | 40 GB SSD |
| Cost | ~4.50 EUR/mo |

### Container Memory Budget

| Container | Limit | Purpose |
|-----------|-------|---------|
| db | 512 MB | PostgreSQL 15 |
| redis | 128 MB | Task broker + cache |
| api | 512 MB | FastAPI (Uvicorn) |
| celery-worker | 512 MB | Background tasks |
| celery-beat | 256 MB | Task scheduler |
| flower | 256 MB | Celery monitoring |
| **App subtotal** | **2,176 MB** | |
| prometheus | 256 MB | Metrics (15-day retention) |
| grafana | 192 MB | Dashboards |
| node-exporter | 64 MB | Host metrics |
| cadvisor | 128 MB | Container metrics |
| alertmanager | 32 MB | Alert routing |
| **Monitoring subtotal** | **672 MB** | |
| **Total containers** | **2,848 MB** | |
| OS + Caddy + Gitea + CI | ~1,150 MB | Remaining headroom |

---

## Key Metrics to Watch

Monitor these in Grafana (or via `curl` to Prometheus query API).

### RAM Usage

```promql
# Host memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Per-container memory usage
container_memory_usage_bytes{name=~"orion.*"} / 1024 / 1024
```

**Threshold**: Alert at >85% host RAM. Scale at sustained >80%.

### CPU Usage

```promql
# Host CPU usage (1-minute average)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-container CPU
rate(container_cpu_usage_seconds_total{name=~"orion.*"}[5m]) * 100
```

**Threshold**: Alert at >80% for 5 minutes. Scale at sustained >70%.

### Disk Usage

```promql
# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
```

**Threshold**: Alert at >80%. Critical at >90%. Scale disk or clean up.

### API Latency

```promql
# P95 response time (if using prometheus_client histograms)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```

**Threshold**: Alert at P95 >2s. Investigate at P95 >1s.

### Database Connections

```promql
# Active PostgreSQL connections (requires pg_stat_activity export)
pg_stat_activity_count
```

**Threshold**: Default pool is 10 + 20 overflow = 30 max. Alert at >20 active.

### Redis Memory

```promql
# Redis used memory
redis_memory_used_bytes
```

**Threshold**: Alert at >100 MB (of 128 MB limit). Scale Redis limit or add eviction policy.

---

## When to Scale

```
Is RAM consistently >80%?
├── YES → Upgrade server (CAX11 → CAX21)
└── NO
    Is API P95 latency >2s?
    ├── YES → Is it DB queries?
    │   ├── YES → Add PgBouncer or increase pool size
    │   └── NO → Add Uvicorn workers or upgrade CPU
    └── NO
        Is disk >80%?
        ├── YES → Clean logs/backups or upgrade disk
        └── NO
            Are Celery tasks queuing >100 for >10min?
            ├── YES → Add celery-worker replicas
            └── NO → No scaling needed
```

---

## Scaling Actions

### 1. Server Upgrade (Vertical Scaling)

The fastest path. Hetzner allows live upgrades with a ~2 minute restart.

```bash
# In Hetzner Cloud Console:
# Servers > your server > Rescale > select new plan > Rescale
```

After rescale, update memory limits in `docker-compose.yml` to use the additional RAM, then restart:

```bash
cd ~/apps/orion
docker compose --profile full up -d
```

### 2. Add PgBouncer (Connection Pooling)

When database connections become a bottleneck (>20 active connections):

```yaml
# Add to docker-compose.yml
pgbouncer:
  image: edoburu/pgbouncer:latest
  restart: always
  environment:
    DATABASE_URL: postgresql://orion_user:secure_password@db:5432/orion_db
    POOL_MODE: transaction
    MAX_CLIENT_CONN: 100
    DEFAULT_POOL_SIZE: 20
  mem_limit: 64m
  networks:
    - backend
```

Update `DATABASE_URL` in API and Celery to point to PgBouncer instead of `db` directly.

### 3. Redis Hardening

Set a `maxmemory` policy to prevent OOM:

```yaml
# In docker-compose.yml, add command to redis service
redis:
  command: redis-server --maxmemory 100mb --maxmemory-policy allkeys-lru
```

### 4. Separate Database Server

When the database needs its own resources (typically >50 stores):

1. Create a new Hetzner server (CAX11 or CAX21) for PostgreSQL
2. Move the `db` service to the new server
3. Update `DATABASE_URL` to point to the DB server's IP
4. Set up pg_hba.conf to allow connections from the app server
5. Keep Redis on the app server (latency-sensitive)

### 5. Multi-Worker API

Scale Uvicorn workers for higher request throughput:

```yaml
# In docker-compose.yml, update api command
api:
  command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
```

Rule of thumb: `workers = 2 * CPU cores + 1`. On CAX21 (4 vCPU): 9 workers max, but start with 4.

### 6. Celery Worker Replicas

For heavy background task loads, scale horizontally:

```bash
docker compose --profile full up -d --scale celery-worker=3
```

Each replica adds ~512 MB RAM. Ensure the server has headroom.

---

## Hetzner ARM (CAX) Pricing

All prices are monthly, excl. VAT. ARM servers offer the best price/performance for Docker workloads.

| Plan | vCPU | RAM | Disk | Price | Suitable For |
|------|------|-----|------|-------|-------------|
| CAX11 | 2 | 4 GB | 40 GB | ~4.50 EUR | 1 client, up to 24 stores |
| CAX21 | 4 | 8 GB | 80 GB | ~7.50 EUR | 2-3 clients, up to 75 stores |
| CAX31 | 8 | 16 GB | 160 GB | ~14.50 EUR | 5-10 clients, up to 200 stores |
| CAX41 | 16 | 32 GB | 320 GB | ~27.50 EUR | 10-25 clients, up to 500 stores |

!!! tip "Upgrade path"
    Hetzner allows upgrading to a larger plan with a ~2 minute restart. No data migration needed. Always upgrade vertically first before adding horizontal complexity.

---

## Timeline

### Launch (Now)

- **Server**: CAX11 (4 GB)
- **Clients**: 1
- **Stores**: up to 24
- **Actions**: Memory limits set, monitoring active, alerts configured

### Early Growth (1-3 months)

- **Monitor**: RAM usage, API latency, disk growth
- **Trigger**: RAM consistently >80% or disk >70%
- **Action**: Upgrade to CAX21 (8 GB, ~7.50 EUR/mo)
- **Increase**: memory limits for db (1 GB), api (1 GB), celery-worker (1 GB)

### Growth (3-6 months)

- **Trigger**: 3+ clients, >75 stores, or DB queries slowing down
- **Actions**:
    - Add PgBouncer for connection pooling
    - Increase Uvicorn workers to 4
    - Consider Redis maxmemory policy
- **Server**: CAX21 or CAX31 depending on load

### Scale (6-12 months)

- **Trigger**: 10+ clients, >200 stores
- **Actions**:
    - Separate database to its own server
    - Scale Celery workers (2-3 replicas)
    - Upgrade app server to CAX31
    - Consider CDN for static assets

### Enterprise (12+ months)

- **Trigger**: 25+ clients, >500 stores, SLA requirements
- **Actions**:
    - Multi-server architecture (app, DB, Redis, workers)
    - PostgreSQL read replicas
    - Redis Sentinel for HA
    - Load balancer for API
    - Consider Kubernetes if operational complexity is justified