chore(ops): prune build cache in deploy.sh + document rescale & disk maintenance
All checks were successful
All checks were successful
deploy.sh already pruned old images but never build cache — the larger half of disk creep from CI rebuilds (root fs hit 83% on prod). Add `docker builder prune --filter until=168h` alongside the existing image prune so cleanup happens every deploy, version-controlled, no host cron. Docs (hetzner-server-setup.md, Maintenance section): - New "Rescaling / Upgrading the Server" — when/why, same-arch (Arm/CAX) + CPU-RAM-only vs irreversible disk-expand constraints, poweroff→rescale→ power-on→verify steps, and the Arm-capacity-unavailable-in-DC caveat. - New "Disk Maintenance (Docker Pruning)" — emergency manual prune + the automated deploy.sh approach. - Fixed stale Resource Budget: cadvisor 128→192 MB (matches compose), total 672→736 MB, and "live-upgrade" wording (rescale needs a power-off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1513,11 +1513,11 @@ Prometheus + Grafana monitoring stack with host and container metrics.
|
||||
| prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) |
|
||||
| grafana | 192 MB | Dashboards (SQLite backend) |
|
||||
| node-exporter | 64 MB | Host CPU/RAM/disk metrics |
|
||||
| cadvisor | 128 MB | Per-container resource metrics |
|
||||
| cadvisor | 192 MB | Per-container resource metrics |
|
||||
| redis-exporter | 32 MB | Redis memory, connections, command stats |
|
||||
| **Total new** | **672 MB** | |
|
||||
| **Total new** | **736 MB** | |
|
||||
|
||||
Existing stack ~1.8 GB + 672 MB new = ~2.5 GB. Leaves ~1.6 GB for OS. If too tight, live-upgrade to CAX21 (8 GB/80 GB, ~7.50 EUR/mo) via **Cloud Console > Server > Rescale** (~2 min restart).
|
||||
Existing stack ~1.8 GB + 736 MB new = ~2.5 GB. Leaves ~1.5 GB for OS. If too tight, rescale to CAX21 (4 vCPU / 8 GB, ~7.99 EUR/mo) — note this requires a brief **power-off** (it is not a live resize); see [Rescaling / Upgrading the Server](#rescaling-upgrading-the-server-cpu-ram) for the full procedure and the Arm-capacity caveat.
|
||||
|
||||
### 18.1 DNS Record
|
||||
|
||||
@@ -3021,6 +3021,129 @@ cd ~/apps/orion && bash scripts/deploy.sh
|
||||
|
||||
The script handles stashing local changes, pulling, rebuilding containers, running migrations, and health checks.
|
||||
|
||||
### Rescaling / Upgrading the Server (CPU & RAM)
|
||||
|
||||
When the box runs short on RAM/CPU, rescale it to a larger plan. The disk is
|
||||
handled separately (see [Disk Maintenance](#disk-maintenance-docker-pruning))
|
||||
— rescaling for CPU/RAM does **not** grow the disk, and it usually doesn't need
|
||||
to.
|
||||
|
||||
**When to rescale (the signals that justify it):**
|
||||
|
||||
- `free -h` shows swap heavily/fully used and almost no free RAM.
|
||||
- Container memory limits are routinely maxed (`orion-alertmanager-1`,
|
||||
`orion-cadvisor-1` near 100% of their `mem_limit` — visible in
|
||||
`docker stats --no-stream`).
|
||||
- `HostHighCpuUsage` fires repeatedly during CI bursts (Gitea Actions builds
|
||||
eat ~half the host CPU on a 2-vCPU box).
|
||||
|
||||
**Two important constraints (shown on the Rescale screen):**
|
||||
|
||||
1. **Same architecture only.** Our servers are **Arm64 / Ampere (CAX series)**.
|
||||
You can rescale only between CAX plans (CAX11 → CAX21 → CAX31 → CAX41). All
|
||||
the x86 plans (CX / CPX / CCX) will be **greyed out** — that's expected, not
|
||||
an error.
|
||||
2. **"CPU and RAM only" vs "expand disk".** Always choose **CPU and RAM only**
|
||||
unless you specifically need more disk. Expanding the disk is
|
||||
**irreversible** — you can never downgrade to a plan with a smaller disk
|
||||
afterwards. CPU+RAM-only rescales stay reversible.
|
||||
|
||||
**Procedure:**
|
||||
|
||||
```bash
|
||||
# 1. On the server — stop the stack cleanly so containers shut down gracefully.
|
||||
# (Do NOT force-off from the console; let docker stop properly.)
|
||||
sudo poweroff # your SSH session will drop — expected
|
||||
```
|
||||
|
||||
```text
|
||||
# 2. In console.hetzner.cloud → Servers → ubuntu-4gb-nbg1-1 → Rescale:
|
||||
# - wait until the server shows as "off"
|
||||
# - select the target CAX plan (e.g. CAX21: 4 vCPU / 8 GB)
|
||||
# - leave the option on "CPU and RAM only"
|
||||
# - confirm the rescale (takes ~2 min)
|
||||
# 3. Power the server back on.
|
||||
```
|
||||
|
||||
```bash
|
||||
# 4. After it boots (~1-2 min), SSH back in and verify:
|
||||
nproc # vCPU count matches new plan
|
||||
free -h # new RAM total; swap no longer maxed
|
||||
docker compose -f docker-compose.yml --profile full ps # all containers healthy
|
||||
docker stats --no-stream # real headroom now
|
||||
```
|
||||
|
||||
Containers auto-start on boot via their restart policies, so you normally don't
|
||||
need to bring anything up manually — just confirm they came back healthy and the
|
||||
public sites return 200 (`curl -sI https://rewardflow.lu`).
|
||||
|
||||
!!! warning "Arm capacity can be unavailable in-datacenter"
|
||||
Hetzner's Arm/Ampere (CAX) stock is limited and fluctuates per location. If
|
||||
the target CAX plans are greyed out with a tooltip like *"not available,
|
||||
please choose another location or type"*, it means there's **no Arm capacity
|
||||
in nbg1 (Nuremberg) right now** — nothing is wrong on your end. **Power the
|
||||
server back on immediately** (don't leave production down waiting) and
|
||||
**retry the rescale later** — capacity comes and goes over hours. Rescaling
|
||||
cannot change the datacenter; if nbg1 stays dry for days, the only
|
||||
alternative is a snapshot → new server in another DC → restore (a full
|
||||
migration — avoid unless truly necessary).
|
||||
|
||||
**Post-rescale follow-ups:** once on a bigger box, bump the monitoring
|
||||
containers' memory limits in `docker-compose.yml` to give them headroom (commit
|
||||
to the repo and `git pull` on the server — don't hand-edit, since the file is
|
||||
git-tracked):
|
||||
|
||||
- `cadvisor` `mem_limit: 192m` → `256m` (chronic flapping/restarts at 192m
|
||||
cause `TargetDown` floods)
|
||||
- `alertmanager` `mem_limit: 32m` → `64m`
|
||||
|
||||
Then update the [Resource Budget](#resource-budget-4-gb-server) table to the new
|
||||
totals, and lift any silences placed during the pre-rescale alert flood (see
|
||||
Alertmanager UI at `http://localhost:9093/#/silences`).
|
||||
|
||||
### Disk Maintenance (Docker Pruning)
|
||||
|
||||
The root filesystem fills over time because **Gitea CI rebuilds images on every
|
||||
run**, leaving behind dangling image layers and (more significantly) Docker
|
||||
**build cache**. Left unchecked this trips the `HostHighDiskUsage` alert
|
||||
(threshold 80%). On a typical incident the split was ~14 GB unused images +
|
||||
~9 GB build cache reclaimable on a 38 GB disk.
|
||||
|
||||
**Emergency manual prune** (run on the server when `HostHighDiskUsage` fires):
|
||||
|
||||
```bash
|
||||
df -h / # check current usage
|
||||
docker builder prune -f # remove all reclaimable build cache (the big one)
|
||||
docker image prune -af # remove all images not used by a running container
|
||||
docker system df # confirm what was reclaimed
|
||||
df -h / # verify usage dropped
|
||||
```
|
||||
|
||||
This is non-destructive: running containers, volumes, and the database are
|
||||
untouched. Pruned images are re-pulled/rebuilt on the next deploy.
|
||||
|
||||
**The proper (automated) way:** `scripts/deploy.sh` already prunes old images at
|
||||
the end of every deploy:
|
||||
|
||||
```bash
|
||||
docker image prune -f --filter "until=72h"
|
||||
```
|
||||
|
||||
…but it does **not** prune build cache, which is the larger offender. To stop
|
||||
unbounded growth, add a build-cache prune alongside it in
|
||||
`scripts/deploy.sh` (keeps the last 7 days so CI stays fast):
|
||||
|
||||
```bash
|
||||
# in scripts/deploy.sh, step 6 ("Clean up old Docker images"):
|
||||
docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
|
||||
docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true # <- add this
|
||||
```
|
||||
|
||||
Because cleanup then runs on every deploy and lives in version control, there's
|
||||
no host-level cron to remember. (A weekly `/etc/cron.weekly/docker-prune` is an
|
||||
alternative, but the deploy-script approach is preferred — it's
|
||||
version-controlled and scoped to this repo.)
|
||||
|
||||
### View logs
|
||||
|
||||
```bash
|
||||
|
||||
@@ -88,9 +88,12 @@ if ! curl -sf "$HEALTH_URL" > /dev/null 2>&1; then
|
||||
exit 4
|
||||
fi
|
||||
|
||||
# ── 6. Clean up old Docker images ───────────────────────────────────────────
|
||||
log "Pruning unused Docker images …"
|
||||
# ── 6. Clean up old Docker images + build cache ─────────────────────────────
|
||||
# Image prune alone leaves CI build cache to grow unbounded (the larger half of
|
||||
# disk creep). Prune both; keep the last week of cache so CI stays fast.
|
||||
log "Pruning unused Docker images and build cache …"
|
||||
docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
|
||||
docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true
|
||||
|
||||
log "Deploy complete."
|
||||
exit 0
|
||||
|
||||
Reference in New Issue
Block a user