From 64c8a0ec2c5380d37a6cc8d9e5a24828ba28a1ee Mon Sep 17 00:00:00 2001 From: Samir Boulahtit Date: Mon, 1 Jun 2026 23:36:53 +0200 Subject: [PATCH] chore(ops): prune build cache in deploy.sh + document rescale & disk maintenance MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit deploy.sh already pruned old images but never build cache — the larger half of disk creep from CI rebuilds (root fs hit 83% on prod). Add `docker builder prune --filter until=168h` alongside the existing image prune so cleanup happens every deploy, version-controlled, no host cron. Docs (hetzner-server-setup.md, Maintenance section): - New "Rescaling / Upgrading the Server" — when/why, same-arch (Arm/CAX) + CPU-RAM-only vs irreversible disk-expand constraints, poweroff→rescale→ power-on→verify steps, and the Arm-capacity-unavailable-in-DC caveat. - New "Disk Maintenance (Docker Pruning)" — emergency manual prune + the automated deploy.sh approach. - Fixed stale Resource Budget: cadvisor 128→192 MB (matches compose), total 672→736 MB, and "live-upgrade" wording (rescale needs a power-off). Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/deployment/hetzner-server-setup.md | 129 +++++++++++++++++++++++- scripts/deploy.sh | 7 +- 2 files changed, 131 insertions(+), 5 deletions(-) diff --git a/docs/deployment/hetzner-server-setup.md b/docs/deployment/hetzner-server-setup.md index d5676921..9762a1c2 100644 --- a/docs/deployment/hetzner-server-setup.md +++ b/docs/deployment/hetzner-server-setup.md @@ -1513,11 +1513,11 @@ Prometheus + Grafana monitoring stack with host and container metrics. | prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) | | grafana | 192 MB | Dashboards (SQLite backend) | | node-exporter | 64 MB | Host CPU/RAM/disk metrics | -| cadvisor | 128 MB | Per-container resource metrics | +| cadvisor | 192 MB | Per-container resource metrics | | redis-exporter | 32 MB | Redis memory, connections, command stats | -| **Total new** | **672 MB** | | +| **Total new** | **736 MB** | | -Existing stack ~1.8 GB + 672 MB new = ~2.5 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart). +Existing stack ~1.8 GB + 736 MB new = ~2.5 GB. Leaves ~1.5 GB for OS. If too tight, rescale to CAX21 (4 vCPU / 8 GB, ~7.99 EUR/mo) — note this requires a brief **power-off** (it is not a live resize); see [Rescaling / Upgrading the Server](#rescaling-upgrading-the-server-cpu-ram) for the full procedure and the Arm-capacity caveat. ### 18.1 DNS Record @@ -3021,6 +3021,129 @@ cd ~/apps/orion && bash scripts/deploy.sh The script handles stashing local changes, pulling, rebuilding containers, running migrations, and health checks. +### Rescaling / Upgrading the Server (CPU & RAM) + +When the box runs short on RAM/CPU, rescale it to a larger plan. The disk is +handled separately (see [Disk Maintenance](#disk-maintenance-docker-pruning)) +— rescaling for CPU/RAM does **not** grow the disk, and it usually doesn't need +to. + +**When to rescale (the signals that justify it):** + +- `free -h` shows swap heavily/fully used and almost no free RAM. +- Container memory limits are routinely maxed (`orion-alertmanager-1`, + `orion-cadvisor-1` near 100% of their `mem_limit` — visible in + `docker stats --no-stream`). +- `HostHighCpuUsage` fires repeatedly during CI bursts (Gitea Actions builds + eat ~half the host CPU on a 2-vCPU box). + +**Two important constraints (shown on the Rescale screen):** + +1. **Same architecture only.** Our servers are **Arm64 / Ampere (CAX series)**. + You can rescale only between CAX plans (CAX11 → CAX21 → CAX31 → CAX41). All + the x86 plans (CX / CPX / CCX) will be **greyed out** — that's expected, not + an error. +2. **"CPU and RAM only" vs "expand disk".** Always choose **CPU and RAM only** + unless you specifically need more disk. Expanding the disk is + **irreversible** — you can never downgrade to a plan with a smaller disk + afterwards. CPU+RAM-only rescales stay reversible. + +**Procedure:** + +```bash +# 1. On the server — stop the stack cleanly so containers shut down gracefully. +# (Do NOT force-off from the console; let docker stop properly.) +sudo poweroff # your SSH session will drop — expected +``` + +```text +# 2. In console.hetzner.cloud → Servers → ubuntu-4gb-nbg1-1 → Rescale: +# - wait until the server shows as "off" +# - select the target CAX plan (e.g. CAX21: 4 vCPU / 8 GB) +# - leave the option on "CPU and RAM only" +# - confirm the rescale (takes ~2 min) +# 3. Power the server back on. +``` + +```bash +# 4. After it boots (~1-2 min), SSH back in and verify: +nproc # vCPU count matches new plan +free -h # new RAM total; swap no longer maxed +docker compose -f docker-compose.yml --profile full ps # all containers healthy +docker stats --no-stream # real headroom now +``` + +Containers auto-start on boot via their restart policies, so you normally don't +need to bring anything up manually — just confirm they came back healthy and the +public sites return 200 (`curl -sI https://rewardflow.lu`). + +!!! warning "Arm capacity can be unavailable in-datacenter" + Hetzner's Arm/Ampere (CAX) stock is limited and fluctuates per location. If + the target CAX plans are greyed out with a tooltip like *"not available, + please choose another location or type"*, it means there's **no Arm capacity + in nbg1 (Nuremberg) right now** — nothing is wrong on your end. **Power the + server back on immediately** (don't leave production down waiting) and + **retry the rescale later** — capacity comes and goes over hours. Rescaling + cannot change the datacenter; if nbg1 stays dry for days, the only + alternative is a snapshot → new server in another DC → restore (a full + migration — avoid unless truly necessary). + +**Post-rescale follow-ups:** once on a bigger box, bump the monitoring +containers' memory limits in `docker-compose.yml` to give them headroom (commit +to the repo and `git pull` on the server — don't hand-edit, since the file is +git-tracked): + +- `cadvisor` `mem_limit: 192m` → `256m` (chronic flapping/restarts at 192m + cause `TargetDown` floods) +- `alertmanager` `mem_limit: 32m` → `64m` + +Then update the [Resource Budget](#resource-budget-4-gb-server) table to the new +totals, and lift any silences placed during the pre-rescale alert flood (see +Alertmanager UI at `http://localhost:9093/#/silences`). + +### Disk Maintenance (Docker Pruning) + +The root filesystem fills over time because **Gitea CI rebuilds images on every +run**, leaving behind dangling image layers and (more significantly) Docker +**build cache**. Left unchecked this trips the `HostHighDiskUsage` alert +(threshold 80%). On a typical incident the split was ~14 GB unused images + +~9 GB build cache reclaimable on a 38 GB disk. + +**Emergency manual prune** (run on the server when `HostHighDiskUsage` fires): + +```bash +df -h / # check current usage +docker builder prune -f # remove all reclaimable build cache (the big one) +docker image prune -af # remove all images not used by a running container +docker system df # confirm what was reclaimed +df -h / # verify usage dropped +``` + +This is non-destructive: running containers, volumes, and the database are +untouched. Pruned images are re-pulled/rebuilt on the next deploy. + +**The proper (automated) way:** `scripts/deploy.sh` already prunes old images at +the end of every deploy: + +```bash +docker image prune -f --filter "until=72h" +``` + +…but it does **not** prune build cache, which is the larger offender. To stop +unbounded growth, add a build-cache prune alongside it in +`scripts/deploy.sh` (keeps the last 7 days so CI stays fast): + +```bash +# in scripts/deploy.sh, step 6 ("Clean up old Docker images"): +docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true +docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true # <- add this +``` + +Because cleanup then runs on every deploy and lives in version control, there's +no host-level cron to remember. (A weekly `/etc/cron.weekly/docker-prune` is an +alternative, but the deploy-script approach is preferred — it's +version-controlled and scoped to this repo.) + ### View logs ```bash diff --git a/scripts/deploy.sh b/scripts/deploy.sh index 90fc4bf6..4a55bc82 100755 --- a/scripts/deploy.sh +++ b/scripts/deploy.sh @@ -88,9 +88,12 @@ if ! curl -sf "$HEALTH_URL" > /dev/null 2>&1; then exit 4 fi -# ── 6. Clean up old Docker images ─────────────────────────────────────────── -log "Pruning unused Docker images …" +# ── 6. Clean up old Docker images + build cache ───────────────────────────── +# Image prune alone leaves CI build cache to grow unbounded (the larger half of +# disk creep). Prune both; keep the last week of cache so CI stays fast. +log "Pruning unused Docker images and build cache …" docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true +docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true log "Deploy complete." exit 0