chore(ops): prune build cache in deploy.sh + document rescale & disk maintenance
All checks were successful
CI / ruff (push) Successful in 44s
CI / pytest (push) Successful in 2h39m22s
CI / validate (push) Successful in 32s
CI / dependency-scanning (push) Successful in 34s
CI / docs (push) Successful in 54s
CI / deploy (push) Successful in 3m15s

deploy.sh already pruned old images but never build cache — the larger half
of disk creep from CI rebuilds (root fs hit 83% on prod). Add
`docker builder prune --filter until=168h` alongside the existing image prune
so cleanup happens every deploy, version-controlled, no host cron.

Docs (hetzner-server-setup.md, Maintenance section):
- New "Rescaling / Upgrading the Server" — when/why, same-arch (Arm/CAX) +
  CPU-RAM-only vs irreversible disk-expand constraints, poweroff→rescale→
  power-on→verify steps, and the Arm-capacity-unavailable-in-DC caveat.
- New "Disk Maintenance (Docker Pruning)" — emergency manual prune + the
  automated deploy.sh approach.
- Fixed stale Resource Budget: cadvisor 128→192 MB (matches compose),
  total 672→736 MB, and "live-upgrade" wording (rescale needs a power-off).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-01 23:36:53 +02:00
parent ac7850b880
commit 64c8a0ec2c
2 changed files with 131 additions and 5 deletions

View File

@@ -1513,11 +1513,11 @@ Prometheus + Grafana monitoring stack with host and container metrics.
| prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) | | prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) |
| grafana | 192 MB | Dashboards (SQLite backend) | | grafana | 192 MB | Dashboards (SQLite backend) |
| node-exporter | 64 MB | Host CPU/RAM/disk metrics | | node-exporter | 64 MB | Host CPU/RAM/disk metrics |
| cadvisor | 128 MB | Per-container resource metrics | | cadvisor | 192 MB | Per-container resource metrics |
| redis-exporter | 32 MB | Redis memory, connections, command stats | | redis-exporter | 32 MB | Redis memory, connections, command stats |
| **Total new** | **672 MB** | | | **Total new** | **736 MB** | |
Existing stack ~1.8 GB + 672 MB new = ~2.5 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart). Existing stack ~1.8 GB + 736 MB new = ~2.5 GB. Leaves ~1.5 GB for OS. If too tight, rescale to CAX21 (4 vCPU / 8 GB, ~7.99 EUR/mo) — note this requires a brief **power-off** (it is not a live resize); see [Rescaling / Upgrading the Server](#rescaling-upgrading-the-server-cpu-ram) for the full procedure and the Arm-capacity caveat.
### 18.1 DNS Record ### 18.1 DNS Record
@@ -3021,6 +3021,129 @@ cd ~/apps/orion && bash scripts/deploy.sh
The script handles stashing local changes, pulling, rebuilding containers, running migrations, and health checks. The script handles stashing local changes, pulling, rebuilding containers, running migrations, and health checks.
### Rescaling / Upgrading the Server (CPU & RAM)
When the box runs short on RAM/CPU, rescale it to a larger plan. The disk is
handled separately (see [Disk Maintenance](#disk-maintenance-docker-pruning))
— rescaling for CPU/RAM does **not** grow the disk, and it usually doesn't need
to.
**When to rescale (the signals that justify it):**
- `free -h` shows swap heavily/fully used and almost no free RAM.
- Container memory limits are routinely maxed (`orion-alertmanager-1`,
`orion-cadvisor-1` near 100% of their `mem_limit` — visible in
`docker stats --no-stream`).
- `HostHighCpuUsage` fires repeatedly during CI bursts (Gitea Actions builds
eat ~half the host CPU on a 2-vCPU box).
**Two important constraints (shown on the Rescale screen):**
1. **Same architecture only.** Our servers are **Arm64 / Ampere (CAX series)**.
You can rescale only between CAX plans (CAX11 → CAX21 → CAX31 → CAX41). All
the x86 plans (CX / CPX / CCX) will be **greyed out** — that's expected, not
an error.
2. **"CPU and RAM only" vs "expand disk".** Always choose **CPU and RAM only**
unless you specifically need more disk. Expanding the disk is
**irreversible** — you can never downgrade to a plan with a smaller disk
afterwards. CPU+RAM-only rescales stay reversible.
**Procedure:**
```bash
# 1. On the server — stop the stack cleanly so containers shut down gracefully.
# (Do NOT force-off from the console; let docker stop properly.)
sudo poweroff # your SSH session will drop — expected
```
```text
# 2. In console.hetzner.cloud → Servers → ubuntu-4gb-nbg1-1 → Rescale:
# - wait until the server shows as "off"
# - select the target CAX plan (e.g. CAX21: 4 vCPU / 8 GB)
# - leave the option on "CPU and RAM only"
# - confirm the rescale (takes ~2 min)
# 3. Power the server back on.
```
```bash
# 4. After it boots (~1-2 min), SSH back in and verify:
nproc # vCPU count matches new plan
free -h # new RAM total; swap no longer maxed
docker compose -f docker-compose.yml --profile full ps # all containers healthy
docker stats --no-stream # real headroom now
```
Containers auto-start on boot via their restart policies, so you normally don't
need to bring anything up manually — just confirm they came back healthy and the
public sites return 200 (`curl -sI https://rewardflow.lu`).
!!! warning "Arm capacity can be unavailable in-datacenter"
Hetzner's Arm/Ampere (CAX) stock is limited and fluctuates per location. If
the target CAX plans are greyed out with a tooltip like *"not available,
please choose another location or type"*, it means there's **no Arm capacity
in nbg1 (Nuremberg) right now** — nothing is wrong on your end. **Power the
server back on immediately** (don't leave production down waiting) and
**retry the rescale later** — capacity comes and goes over hours. Rescaling
cannot change the datacenter; if nbg1 stays dry for days, the only
alternative is a snapshot → new server in another DC → restore (a full
migration — avoid unless truly necessary).
**Post-rescale follow-ups:** once on a bigger box, bump the monitoring
containers' memory limits in `docker-compose.yml` to give them headroom (commit
to the repo and `git pull` on the server — don't hand-edit, since the file is
git-tracked):
- `cadvisor` `mem_limit: 192m` → `256m` (chronic flapping/restarts at 192m
cause `TargetDown` floods)
- `alertmanager` `mem_limit: 32m` → `64m`
Then update the [Resource Budget](#resource-budget-4-gb-server) table to the new
totals, and lift any silences placed during the pre-rescale alert flood (see
Alertmanager UI at `http://localhost:9093/#/silences`).
### Disk Maintenance (Docker Pruning)
The root filesystem fills over time because **Gitea CI rebuilds images on every
run**, leaving behind dangling image layers and (more significantly) Docker
**build cache**. Left unchecked this trips the `HostHighDiskUsage` alert
(threshold 80%). On a typical incident the split was ~14 GB unused images +
~9 GB build cache reclaimable on a 38 GB disk.
**Emergency manual prune** (run on the server when `HostHighDiskUsage` fires):
```bash
df -h / # check current usage
docker builder prune -f # remove all reclaimable build cache (the big one)
docker image prune -af # remove all images not used by a running container
docker system df # confirm what was reclaimed
df -h / # verify usage dropped
```
This is non-destructive: running containers, volumes, and the database are
untouched. Pruned images are re-pulled/rebuilt on the next deploy.
**The proper (automated) way:** `scripts/deploy.sh` already prunes old images at
the end of every deploy:
```bash
docker image prune -f --filter "until=72h"
```
…but it does **not** prune build cache, which is the larger offender. To stop
unbounded growth, add a build-cache prune alongside it in
`scripts/deploy.sh` (keeps the last 7 days so CI stays fast):
```bash
# in scripts/deploy.sh, step 6 ("Clean up old Docker images"):
docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true # <- add this
```
Because cleanup then runs on every deploy and lives in version control, there's
no host-level cron to remember. (A weekly `/etc/cron.weekly/docker-prune` is an
alternative, but the deploy-script approach is preferred — it's
version-controlled and scoped to this repo.)
### View logs ### View logs
```bash ```bash

View File

@@ -88,9 +88,12 @@ if ! curl -sf "$HEALTH_URL" > /dev/null 2>&1; then
exit 4 exit 4
fi fi
# ── 6. Clean up old Docker images ─────────────────────────────────────────── # ── 6. Clean up old Docker images + build cache ─────────────────────────────
log "Pruning unused Docker images …" # Image prune alone leaves CI build cache to grow unbounded (the larger half of
# disk creep). Prune both; keep the last week of cache so CI stays fast.
log "Pruning unused Docker images and build cache …"
docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true
log "Deploy complete." log "Deploy complete."
exit 0 exit 0