feat(infra): add launch readiness quick wins
Some checks failed
Some checks failed
- Add mem_limit to all 6 app containers (db: 512m, redis: 128m, api: 512m, celery-worker: 512m, celery-beat: 128m, flower: 128m) - Restrict Flower port to localhost (127.0.0.1:5555:5555) - Add PostgreSQL and Redis health checks to /health/ready endpoint with individual check details (name, status, latency) - Add scaling guide with metrics, thresholds, Hetzner pricing - Add server verification script (12 infrastructure checks) - Update hetzner-server-setup.md with progress and pending tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -132,6 +132,22 @@ Complete step-by-step guide for deploying Orion on a Hetzner Cloud VPS.
|
||||
|
||||
**Steps 1–24 fully deployed and operational.**
|
||||
|
||||
!!! success "Progress — 2026-02-16 (continued)"
|
||||
**Launch readiness — code changes:**
|
||||
|
||||
- **Memory limits** added to all 6 app containers in `docker-compose.yml` (db: 512m, redis: 128m, api: 512m, celery-worker: 512m, celery-beat: 128m, flower: 128m)
|
||||
- **Flower port** restricted to localhost only (`127.0.0.1:5555:5555`) — access via Caddy reverse proxy
|
||||
- **Infrastructure health checks** — `/health/ready` now checks PostgreSQL (`SELECT 1`) and Redis (`ping`) with individual check details and latency
|
||||
- **Scaling guide** — practical playbook at `docs/deployment/scaling-guide.md` (metrics, thresholds, Hetzner pricing, timeline)
|
||||
- **Server verification script** — `scripts/verify-server.sh` checks all 12 infrastructure components
|
||||
|
||||
**Pending server-side tasks:**
|
||||
|
||||
- [ ] Deploy fail2ban Caddy auth jail (documented in Step 20, config ready but not yet applied)
|
||||
- [ ] Change Flower password from default (`FLOWER_PASSWORD` in `.env`)
|
||||
- [ ] Verify unattended-upgrades is active (`sudo unattended-upgrades --dry-run`)
|
||||
- [ ] Run `scripts/verify-server.sh` on server to validate all infrastructure
|
||||
|
||||
|
||||
## Installed Software Versions
|
||||
|
||||
|
||||
267
docs/deployment/scaling-guide.md
Normal file
267
docs/deployment/scaling-guide.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Scaling Guide
|
||||
|
||||
Practical playbook for scaling Orion from a single CAX11 server to a multi-server architecture.
|
||||
|
||||
---
|
||||
|
||||
## Current Setup
|
||||
|
||||
| Component | Spec |
|
||||
|-----------|------|
|
||||
| Server | Hetzner CAX11 (ARM64) |
|
||||
| vCPU | 2 |
|
||||
| RAM | 4 GB |
|
||||
| Disk | 40 GB SSD |
|
||||
| Cost | ~4.50 EUR/mo |
|
||||
|
||||
### Container Memory Budget
|
||||
|
||||
| Container | Limit | Purpose |
|
||||
|-----------|-------|---------|
|
||||
| db | 512 MB | PostgreSQL 15 |
|
||||
| redis | 128 MB | Task broker + cache |
|
||||
| api | 512 MB | FastAPI (Uvicorn) |
|
||||
| celery-worker | 512 MB | Background tasks |
|
||||
| celery-beat | 128 MB | Task scheduler |
|
||||
| flower | 128 MB | Celery monitoring |
|
||||
| **App subtotal** | **1,920 MB** | |
|
||||
| prometheus | 256 MB | Metrics (15-day retention) |
|
||||
| grafana | 192 MB | Dashboards |
|
||||
| node-exporter | 64 MB | Host metrics |
|
||||
| cadvisor | 128 MB | Container metrics |
|
||||
| alertmanager | 32 MB | Alert routing |
|
||||
| **Monitoring subtotal** | **672 MB** | |
|
||||
| **Total containers** | **2,592 MB** | |
|
||||
| OS + Caddy + Gitea + CI | ~1,400 MB | Remaining headroom |
|
||||
|
||||
---
|
||||
|
||||
## Key Metrics to Watch
|
||||
|
||||
Monitor these in Grafana (or via `curl` to Prometheus query API).
|
||||
|
||||
### RAM Usage
|
||||
|
||||
```promql
|
||||
# Host memory usage percentage
|
||||
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
|
||||
|
||||
# Per-container memory usage
|
||||
container_memory_usage_bytes{name=~"orion.*"} / 1024 / 1024
|
||||
```
|
||||
|
||||
**Threshold**: Alert at >85% host RAM. Scale at sustained >80%.
|
||||
|
||||
### CPU Usage
|
||||
|
||||
```promql
|
||||
# Host CPU usage (1-minute average)
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Per-container CPU
|
||||
rate(container_cpu_usage_seconds_total{name=~"orion.*"}[5m]) * 100
|
||||
```
|
||||
|
||||
**Threshold**: Alert at >80% for 5 minutes. Scale at sustained >70%.
|
||||
|
||||
### Disk Usage
|
||||
|
||||
```promql
|
||||
# Disk usage percentage
|
||||
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
|
||||
```
|
||||
|
||||
**Threshold**: Alert at >80%. Critical at >90%. Scale disk or clean up.
|
||||
|
||||
### API Latency
|
||||
|
||||
```promql
|
||||
# P95 response time (if using prometheus_client histograms)
|
||||
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
||||
```
|
||||
|
||||
**Threshold**: Alert at P95 >2s. Investigate at P95 >1s.
|
||||
|
||||
### Database Connections
|
||||
|
||||
```promql
|
||||
# Active PostgreSQL connections (requires pg_stat_activity export)
|
||||
pg_stat_activity_count
|
||||
```
|
||||
|
||||
**Threshold**: Default pool is 10 + 20 overflow = 30 max. Alert at >20 active.
|
||||
|
||||
### Redis Memory
|
||||
|
||||
```promql
|
||||
# Redis used memory
|
||||
redis_memory_used_bytes
|
||||
```
|
||||
|
||||
**Threshold**: Alert at >100 MB (of 128 MB limit). Scale Redis limit or add eviction policy.
|
||||
|
||||
---
|
||||
|
||||
## When to Scale
|
||||
|
||||
```
|
||||
Is RAM consistently >80%?
|
||||
├── YES → Upgrade server (CAX11 → CAX21)
|
||||
└── NO
|
||||
Is API P95 latency >2s?
|
||||
├── YES → Is it DB queries?
|
||||
│ ├── YES → Add PgBouncer or increase pool size
|
||||
│ └── NO → Add Uvicorn workers or upgrade CPU
|
||||
└── NO
|
||||
Is disk >80%?
|
||||
├── YES → Clean logs/backups or upgrade disk
|
||||
└── NO
|
||||
Are Celery tasks queuing >100 for >10min?
|
||||
├── YES → Add celery-worker replicas
|
||||
└── NO → No scaling needed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Scaling Actions
|
||||
|
||||
### 1. Server Upgrade (Vertical Scaling)
|
||||
|
||||
The fastest path. Hetzner allows live upgrades with a ~2 minute restart.
|
||||
|
||||
```bash
|
||||
# In Hetzner Cloud Console:
|
||||
# Servers > your server > Rescale > select new plan > Rescale
|
||||
```
|
||||
|
||||
After rescale, update memory limits in `docker-compose.yml` to use the additional RAM, then restart:
|
||||
|
||||
```bash
|
||||
cd ~/apps/orion
|
||||
docker compose --profile full up -d
|
||||
```
|
||||
|
||||
### 2. Add PgBouncer (Connection Pooling)
|
||||
|
||||
When database connections become a bottleneck (>20 active connections):
|
||||
|
||||
```yaml
|
||||
# Add to docker-compose.yml
|
||||
pgbouncer:
|
||||
image: edoburu/pgbouncer:latest
|
||||
restart: always
|
||||
environment:
|
||||
DATABASE_URL: postgresql://orion_user:secure_password@db:5432/orion_db
|
||||
POOL_MODE: transaction
|
||||
MAX_CLIENT_CONN: 100
|
||||
DEFAULT_POOL_SIZE: 20
|
||||
mem_limit: 64m
|
||||
networks:
|
||||
- backend
|
||||
```
|
||||
|
||||
Update `DATABASE_URL` in API and Celery to point to PgBouncer instead of `db` directly.
|
||||
|
||||
### 3. Redis Hardening
|
||||
|
||||
Set a `maxmemory` policy to prevent OOM:
|
||||
|
||||
```yaml
|
||||
# In docker-compose.yml, add command to redis service
|
||||
redis:
|
||||
command: redis-server --maxmemory 100mb --maxmemory-policy allkeys-lru
|
||||
```
|
||||
|
||||
### 4. Separate Database Server
|
||||
|
||||
When the database needs its own resources (typically >50 stores):
|
||||
|
||||
1. Create a new Hetzner server (CAX11 or CAX21) for PostgreSQL
|
||||
2. Move the `db` service to the new server
|
||||
3. Update `DATABASE_URL` to point to the DB server's IP
|
||||
4. Set up pg_hba.conf to allow connections from the app server
|
||||
5. Keep Redis on the app server (latency-sensitive)
|
||||
|
||||
### 5. Multi-Worker API
|
||||
|
||||
Scale Uvicorn workers for higher request throughput:
|
||||
|
||||
```yaml
|
||||
# In docker-compose.yml, update api command
|
||||
api:
|
||||
command: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4
|
||||
```
|
||||
|
||||
Rule of thumb: `workers = 2 * CPU cores + 1`. On CAX21 (4 vCPU): 9 workers max, but start with 4.
|
||||
|
||||
### 6. Celery Worker Replicas
|
||||
|
||||
For heavy background task loads, scale horizontally:
|
||||
|
||||
```bash
|
||||
docker compose --profile full up -d --scale celery-worker=3
|
||||
```
|
||||
|
||||
Each replica adds ~512 MB RAM. Ensure the server has headroom.
|
||||
|
||||
---
|
||||
|
||||
## Hetzner ARM (CAX) Pricing
|
||||
|
||||
All prices are monthly, excl. VAT. ARM servers offer the best price/performance for Docker workloads.
|
||||
|
||||
| Plan | vCPU | RAM | Disk | Price | Suitable For |
|
||||
|------|------|-----|------|-------|-------------|
|
||||
| CAX11 | 2 | 4 GB | 40 GB | ~4.50 EUR | 1 client, up to 24 stores |
|
||||
| CAX21 | 4 | 8 GB | 80 GB | ~7.50 EUR | 2-3 clients, up to 75 stores |
|
||||
| CAX31 | 8 | 16 GB | 160 GB | ~14.50 EUR | 5-10 clients, up to 200 stores |
|
||||
| CAX41 | 16 | 32 GB | 320 GB | ~27.50 EUR | 10-25 clients, up to 500 stores |
|
||||
|
||||
!!! tip "Upgrade path"
|
||||
Hetzner allows upgrading to a larger plan with a ~2 minute restart. No data migration needed. Always upgrade vertically first before adding horizontal complexity.
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
### Launch (Now)
|
||||
|
||||
- **Server**: CAX11 (4 GB)
|
||||
- **Clients**: 1
|
||||
- **Stores**: up to 24
|
||||
- **Actions**: Memory limits set, monitoring active, alerts configured
|
||||
|
||||
### Early Growth (1-3 months)
|
||||
|
||||
- **Monitor**: RAM usage, API latency, disk growth
|
||||
- **Trigger**: RAM consistently >80% or disk >70%
|
||||
- **Action**: Upgrade to CAX21 (8 GB, ~7.50 EUR/mo)
|
||||
- **Increase**: memory limits for db (1 GB), api (1 GB), celery-worker (1 GB)
|
||||
|
||||
### Growth (3-6 months)
|
||||
|
||||
- **Trigger**: 3+ clients, >75 stores, or DB queries slowing down
|
||||
- **Actions**:
|
||||
- Add PgBouncer for connection pooling
|
||||
- Increase Uvicorn workers to 4
|
||||
- Consider Redis maxmemory policy
|
||||
- **Server**: CAX21 or CAX31 depending on load
|
||||
|
||||
### Scale (6-12 months)
|
||||
|
||||
- **Trigger**: 10+ clients, >200 stores
|
||||
- **Actions**:
|
||||
- Separate database to its own server
|
||||
- Scale Celery workers (2-3 replicas)
|
||||
- Upgrade app server to CAX31
|
||||
- Consider CDN for static assets
|
||||
|
||||
### Enterprise (12+ months)
|
||||
|
||||
- **Trigger**: 25+ clients, >500 stores, SLA requirements
|
||||
- **Actions**:
|
||||
- Multi-server architecture (app, DB, Redis, workers)
|
||||
- PostgreSQL read replicas
|
||||
- Redis Sentinel for HA
|
||||
- Load balancer for API
|
||||
- Consider Kubernetes if operational complexity is justified
|
||||
Reference in New Issue
Block a user