chore(ops): prune build cache in deploy.sh + document rescale & disk maintenance

deploy.sh already pruned old images but never build cache — the larger half of disk creep from CI rebuilds (root fs hit 83% on prod). Add `docker builder prune --filter until=168h` alongside the existing image prune so cleanup happens every deploy, version-controlled, no host cron. Docs (hetzner-server-setup.md, Maintenance section): - New "Rescaling / Upgrading the Server" — when/why, same-arch (Arm/CAX) + CPU-RAM-only vs irreversible disk-expand constraints, poweroff→rescale→ power-on→verify steps, and the Arm-capacity-unavailable-in-DC caveat. - New "Disk Maintenance (Docker Pruning)" — emergency manual prune + the automated deploy.sh approach. - Fixed stale Resource Budget: cadvisor 128→192 MB (matches compose), total 672→736 MB, and "live-upgrade" wording (rescale needs a power-off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 23:36:53 +02:00
parent ac7850b880
commit 64c8a0ec2c
2 changed files with 131 additions and 5 deletions
--- a/docs/deployment/hetzner-server-setup.md
+++ b/docs/deployment/hetzner-server-setup.md
@@ -1513,11 +1513,11 @@ Prometheus + Grafana monitoring stack with host and container metrics.
 | prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) |
 | grafana | 192 MB | Dashboards (SQLite backend) |
 | node-exporter | 64 MB | Host CPU/RAM/disk metrics |
-| cadvisor | 128 MB | Per-container resource metrics |
+| cadvisor | 192 MB | Per-container resource metrics |
 | redis-exporter | 32 MB | Redis memory, connections, command stats |
-| **Total new** | **672 MB** | |
+| **Total new** | **736 MB** | |
-Existing stack ~1.8 GB + 672 MB new = ~2.5 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
+Existing stack ~1.8 GB + 736 MB new = ~2.5 GB. Leaves ~1.5 GB for OS. If too tight, rescale to CAX21 (4 vCPU / 8 GB, ~7.99 EUR/mo) — note this requires a brief **power-off** (it is not a live resize); see [Rescaling / Upgrading the Server](#rescaling-upgrading-the-server-cpu-ram) for the full procedure and the Arm-capacity caveat.
 ### 18.1 DNS Record
@@ -3021,6 +3021,129 @@ cd ~/apps/orion && bash scripts/deploy.sh
 The script handles stashing local changes, pulling, rebuilding containers, running migrations, and health checks.
 ### Rescaling / Upgrading the Server (CPU & RAM)
 When the box runs short on RAM/CPU, rescale it to a larger plan. The disk is
 handled separately (see [Disk Maintenance](#disk-maintenance-docker-pruning))
 — rescaling for CPU/RAM does **not** grow the disk, and it usually doesn't need
 to.
 **When to rescale (the signals that justify it):**
 - `free -h` shows swap heavily/fully used and almost no free RAM.
 - Container memory limits are routinely maxed (`orion-alertmanager-1`,
  `orion-cadvisor-1` near 100% of their `mem_limit` — visible in
  `docker stats --no-stream`).
 - `HostHighCpuUsage` fires repeatedly during CI bursts (Gitea Actions builds
  eat ~half the host CPU on a 2-vCPU box).
 **Two important constraints (shown on the Rescale screen):**
 1. **Same architecture only.** Our servers are **Arm64 / Ampere (CAX series)**.
   You can rescale only between CAX plans (CAX11 → CAX21 → CAX31 → CAX41). All
   the x86 plans (CX / CPX / CCX) will be **greyed out** — that's expected, not
   an error.
 2. **"CPU and RAM only" vs "expand disk".** Always choose **CPU and RAM only**
   unless you specifically need more disk. Expanding the disk is
   **irreversible** — you can never downgrade to a plan with a smaller disk
   afterwards. CPU+RAM-only rescales stay reversible.
 **Procedure:**
 ```bash
 # 1. On the server — stop the stack cleanly so containers shut down gracefully.
 #    (Do NOT force-off from the console; let docker stop properly.)
 sudo poweroff          # your SSH session will drop — expected
 ```
 ```text
 # 2. In console.hetzner.cloud → Servers → ubuntu-4gb-nbg1-1 → Rescale:
 #    - wait until the server shows as "off"
 #    - select the target CAX plan (e.g. CAX21: 4 vCPU / 8 GB)
 #    - leave the option on "CPU and RAM only"
 #    - confirm the rescale (takes ~2 min)
 # 3. Power the server back on.
 ```
 ```bash
 # 4. After it boots (~1-2 min), SSH back in and verify:
 nproc                                                  # vCPU count matches new plan
 free -h                                                # new RAM total; swap no longer maxed
 docker compose -f docker-compose.yml --profile full ps # all containers healthy
 docker stats --no-stream                               # real headroom now
 ```
 Containers auto-start on boot via their restart policies, so you normally don't
 need to bring anything up manually — just confirm they came back healthy and the
 public sites return 200 (`curl -sI https://rewardflow.lu`).
 !!! warning "Arm capacity can be unavailable in-datacenter"
    Hetzner's Arm/Ampere (CAX) stock is limited and fluctuates per location. If
    the target CAX plans are greyed out with a tooltip like *"not available,
    please choose another location or type"*, it means there's **no Arm capacity
    in nbg1 (Nuremberg) right now** — nothing is wrong on your end. **Power the
    server back on immediately** (don't leave production down waiting) and
    **retry the rescale later** — capacity comes and goes over hours. Rescaling
    cannot change the datacenter; if nbg1 stays dry for days, the only
    alternative is a snapshot → new server in another DC → restore (a full
    migration — avoid unless truly necessary).
 **Post-rescale follow-ups:** once on a bigger box, bump the monitoring
 containers' memory limits in `docker-compose.yml` to give them headroom (commit
 to the repo and `git pull` on the server — don't hand-edit, since the file is
 git-tracked):
 - `cadvisor` `mem_limit: 192m` → `256m` (chronic flapping/restarts at 192m
  cause `TargetDown` floods)
 - `alertmanager` `mem_limit: 32m` → `64m`
 Then update the [Resource Budget](#resource-budget-4-gb-server) table to the new
 totals, and lift any silences placed during the pre-rescale alert flood (see
 Alertmanager UI at `http://localhost:9093/#/silences`).
 ### Disk Maintenance (Docker Pruning)
 The root filesystem fills over time because **Gitea CI rebuilds images on every
 run**, leaving behind dangling image layers and (more significantly) Docker
 **build cache**. Left unchecked this trips the `HostHighDiskUsage` alert
 (threshold 80%). On a typical incident the split was ~14 GB unused images +
 ~9 GB build cache reclaimable on a 38 GB disk.
 **Emergency manual prune** (run on the server when `HostHighDiskUsage` fires):
 ```bash
 df -h /                  # check current usage
 docker builder prune -f  # remove all reclaimable build cache (the big one)
 docker image prune -af   # remove all images not used by a running container
 docker system df         # confirm what was reclaimed
 df -h /                  # verify usage dropped
 ```
 This is non-destructive: running containers, volumes, and the database are
 untouched. Pruned images are re-pulled/rebuilt on the next deploy.
 **The proper (automated) way:** `scripts/deploy.sh` already prunes old images at
 the end of every deploy:
 ```bash
 docker image prune -f --filter "until=72h"
 ```
 …but it does **not** prune build cache, which is the larger offender. To stop
 unbounded growth, add a build-cache prune alongside it in
 `scripts/deploy.sh` (keeps the last 7 days so CI stays fast):
 ```bash
 # in scripts/deploy.sh, step 6 ("Clean up old Docker images"):
 docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
 docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true   # <- add this
 ```
 Because cleanup then runs on every deploy and lives in version control, there's
 no host-level cron to remember. (A weekly `/etc/cron.weekly/docker-prune` is an
 alternative, but the deploy-script approach is preferred — it's
 version-controlled and scoped to this repo.)
 ### View logs
 ```bash
--- a/scripts/deploy.sh
+++ b/scripts/deploy.sh
@@ -88,9 +88,12 @@ if ! curl -sf "$HEALTH_URL" > /dev/null 2>&1; then
    exit 4
 fi
-# ── 6. Clean up old Docker images ───────────────────────────────────────────
+# ── 6. Clean up old Docker images + build cache ─────────────────────────────
-log "Pruning unused Docker images …"
+# Image prune alone leaves CI build cache to grow unbounded (the larger half of
 # disk creep). Prune both; keep the last week of cache so CI stays fast.
 log "Pruning unused Docker images and build cache …"
 docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
 docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true
 log "Deploy complete."
 exit 0