chore(ops): prune build cache in deploy.sh + document rescale & disk maintenance

deploy.sh already pruned old images but never build cache — the larger half of disk creep from CI rebuilds (root fs hit 83% on prod). Add `docker builder prune --filter until=168h` alongside the existing image prune so cleanup happens every deploy, version-controlled, no host cron. Docs (hetzner-server-setup.md, Maintenance section): - New "Rescaling / Upgrading the Server" — when/why, same-arch (Arm/CAX) + CPU-RAM-only vs irreversible disk-expand constraints, poweroff→rescale→ power-on→verify steps, and the Arm-capacity-unavailable-in-DC caveat. - New "Disk Maintenance (Docker Pruning)" — emergency manual prune + the automated deploy.sh approach. - Fixed stale Resource Budget: cadvisor 128→192 MB (matches compose), total 672→736 MB, and "live-upgrade" wording (rescale needs a power-off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 23:36:53 +02:00
parent ac7850b880
commit 64c8a0ec2c
2 changed files with 131 additions and 5 deletions
--- a/docs/deployment/hetzner-server-setup.md
+++ b/docs/deployment/hetzner-server-setup.md
@@ -1513,11 +1513,11 @@ Prometheus + Grafana monitoring stack with host and container metrics.
 | prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) |
 | grafana | 192 MB | Dashboards (SQLite backend) |
 | node-exporter | 64 MB | Host CPU/RAM/disk metrics |
-| cadvisor | 128 MB | Per-container resource metrics |
+| cadvisor | 192 MB | Per-container resource metrics |
 | redis-exporter | 32 MB | Redis memory, connections, command stats |
-| **Total new** | **672 MB** | |
+| **Total new** | **736 MB** | |

-Existing stack ~1.8 GB + 672 MB new = ~2.5 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
+Existing stack ~1.8 GB + 736 MB new = ~2.5 GB. Leaves ~1.5 GB for OS. If too tight, rescale to CAX21 (4 vCPU / 8 GB, ~7.99 EUR/mo) — note this requires a brief **power-off** (it is not a live resize); see [Rescaling / Upgrading the Server](#rescaling-upgrading-the-server-cpu-ram) for the full procedure and the Arm-capacity caveat.

 ### 18.1 DNS Record

@@ -3021,6 +3021,129 @@ cd ~/apps/orion && bash scripts/deploy.sh

 The script handles stashing local changes, pulling, rebuilding containers, running migrations, and health checks.

+### Rescaling / Upgrading the Server (CPU & RAM)
+
+When the box runs short on RAM/CPU, rescale it to a larger plan. The disk is
+handled separately (see [Disk Maintenance](#disk-maintenance-docker-pruning))
+— rescaling for CPU/RAM does **not** grow the disk, and it usually doesn't need
+to.
+
+**When to rescale (the signals that justify it):**
+
+- `free -h` shows swap heavily/fully used and almost no free RAM.
+- Container memory limits are routinely maxed (`orion-alertmanager-1`,
+  `orion-cadvisor-1` near 100% of their `mem_limit` — visible in
+  `docker stats --no-stream`).
+- `HostHighCpuUsage` fires repeatedly during CI bursts (Gitea Actions builds
+  eat ~half the host CPU on a 2-vCPU box).
+
+**Two important constraints (shown on the Rescale screen):**
+
+1. **Same architecture only.** Our servers are **Arm64 / Ampere (CAX series)**.
+   You can rescale only between CAX plans (CAX11 → CAX21 → CAX31 → CAX41). All
+   the x86 plans (CX / CPX / CCX) will be **greyed out** — that's expected, not
+   an error.
+2. **"CPU and RAM only" vs "expand disk".** Always choose **CPU and RAM only**
+   unless you specifically need more disk. Expanding the disk is
+   **irreversible** — you can never downgrade to a plan with a smaller disk
+   afterwards. CPU+RAM-only rescales stay reversible.
+
+**Procedure:**
+
+```bash
+# 1. On the server — stop the stack cleanly so containers shut down gracefully.
+#    (Do NOT force-off from the console; let docker stop properly.)
+sudo poweroff          # your SSH session will drop — expected
+```
+
+```text
+# 2. In console.hetzner.cloud → Servers → ubuntu-4gb-nbg1-1 → Rescale:
+#    - wait until the server shows as "off"
+#    - select the target CAX plan (e.g. CAX21: 4 vCPU / 8 GB)
+#    - leave the option on "CPU and RAM only"
+#    - confirm the rescale (takes ~2 min)
+# 3. Power the server back on.
+```
+
+```bash
+# 4. After it boots (~1-2 min), SSH back in and verify:
+nproc                                                  # vCPU count matches new plan
+free -h                                                # new RAM total; swap no longer maxed
+docker compose -f docker-compose.yml --profile full ps # all containers healthy
+docker stats --no-stream                               # real headroom now
+```
+
+Containers auto-start on boot via their restart policies, so you normally don't
+need to bring anything up manually — just confirm they came back healthy and the
+public sites return 200 (`curl -sI https://rewardflow.lu`).
+
+!!! warning "Arm capacity can be unavailable in-datacenter"
+    Hetzner's Arm/Ampere (CAX) stock is limited and fluctuates per location. If
+    the target CAX plans are greyed out with a tooltip like *"not available,
+    please choose another location or type"*, it means there's **no Arm capacity
+    in nbg1 (Nuremberg) right now** — nothing is wrong on your end. **Power the
+    server back on immediately** (don't leave production down waiting) and
+    **retry the rescale later** — capacity comes and goes over hours. Rescaling
+    cannot change the datacenter; if nbg1 stays dry for days, the only
+    alternative is a snapshot → new server in another DC → restore (a full
+    migration — avoid unless truly necessary).
+
+**Post-rescale follow-ups:** once on a bigger box, bump the monitoring
+containers' memory limits in `docker-compose.yml` to give them headroom (commit
+to the repo and `git pull` on the server — don't hand-edit, since the file is
+git-tracked):
+
+- `cadvisor` `mem_limit: 192m` → `256m` (chronic flapping/restarts at 192m
+  cause `TargetDown` floods)
+- `alertmanager` `mem_limit: 32m` → `64m`
+
+Then update the [Resource Budget](#resource-budget-4-gb-server) table to the new
+totals, and lift any silences placed during the pre-rescale alert flood (see
+Alertmanager UI at `http://localhost:9093/#/silences`).
+
+### Disk Maintenance (Docker Pruning)
+
+The root filesystem fills over time because **Gitea CI rebuilds images on every
+run**, leaving behind dangling image layers and (more significantly) Docker
+**build cache**. Left unchecked this trips the `HostHighDiskUsage` alert
+(threshold 80%). On a typical incident the split was ~14 GB unused images +
+~9 GB build cache reclaimable on a 38 GB disk.
+
+**Emergency manual prune** (run on the server when `HostHighDiskUsage` fires):
+
+```bash
+df -h /                  # check current usage
+docker builder prune -f  # remove all reclaimable build cache (the big one)
+docker image prune -af   # remove all images not used by a running container
+docker system df         # confirm what was reclaimed
+df -h /                  # verify usage dropped
+```
+
+This is non-destructive: running containers, volumes, and the database are
+untouched. Pruned images are re-pulled/rebuilt on the next deploy.
+
+**The proper (automated) way:** `scripts/deploy.sh` already prunes old images at
+the end of every deploy:
+
+```bash
+docker image prune -f --filter "until=72h"
+```
+
+…but it does **not** prune build cache, which is the larger offender. To stop
+unbounded growth, add a build-cache prune alongside it in
+`scripts/deploy.sh` (keeps the last 7 days so CI stays fast):
+
+```bash
+# in scripts/deploy.sh, step 6 ("Clean up old Docker images"):
+docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
+docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true   # <- add this
+```
+
+Because cleanup then runs on every deploy and lives in version control, there's
+no host-level cron to remember. (A weekly `/etc/cron.weekly/docker-prune` is an
+alternative, but the deploy-script approach is preferred — it's
+version-controlled and scoped to this repo.)
+
 ### View logs

 ```bash
--- a/scripts/deploy.sh
+++ b/scripts/deploy.sh
@@ -88,9 +88,12 @@ if ! curl -sf "$HEALTH_URL" > /dev/null 2>&1; then
    exit 4
 fi

-# ── 6. Clean up old Docker images ───────────────────────────────────────────
-log "Pruning unused Docker images …"
+# ── 6. Clean up old Docker images + build cache ─────────────────────────────
+# Image prune alone leaves CI build cache to grow unbounded (the larger half of
+# disk creep). Prune both; keep the last week of cache so CI stays fast.
+log "Pruning unused Docker images and build cache …"
 docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
+docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true

 log "Deploy complete."
 exit 0