docs(ops): add CI-runner offload (2a) + Gitea migration (2c) runbooks
Some checks failed
CI / docs (push) Blocked by required conditions
CI / deploy (push) Blocked by required conditions
CI / ruff (push) Successful in 2m7s
CI / validate (push) Successful in 39s
CI / dependency-scanning (push) Successful in 45s
CI / pytest (push) Failing after 3h3m22s
Some checks failed
CI / docs (push) Blocked by required conditions
CI / deploy (push) Blocked by required conditions
CI / ruff (push) Successful in 2m7s
CI / validate (push) Successful in 39s
CI / dependency-scanning (push) Successful in 45s
CI / pytest (push) Failing after 3h3m22s
Document two ways to take CI/Gitea load off the production box, since the HostHighCpuUsage floods are caused by act_runner running ruff/pytest/validate on the prod server (not by Gitea hosting, which is light): - 2a "Offloading CI to a Separate Server" — move just the act_runner to a cheap x86 box (no data migration, no DNS, no downtime). Includes the smaller build-burst caveat (deploy still builds on prod) + the registry-pull path. - 2c "Migrating Gitea to a Separate Server" — full separation runbook: pg_dump + data-volume tar/restore, DNS cutover, Caddy/SSL, rollback. Notes the box becomes stateful/critical (backups + hardening). mkdocs --strict clean; arch validation 0 new findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3144,6 +3144,236 @@ no host-level cron to remember. (A weekly `/etc/cron.weekly/docker-prune` is an
|
|||||||
alternative, but the deploy-script approach is preferred — it's
|
alternative, but the deploy-script approach is preferred — it's
|
||||||
version-controlled and scoped to this repo.)
|
version-controlled and scoped to this repo.)
|
||||||
|
|
||||||
|
### Offloading CI to a Separate Server (2a — recommended)
|
||||||
|
|
||||||
|
**Why:** the Gitea Actions runner (`act_runner`, systemd `gitea-runner.service`)
|
||||||
|
runs the CI jobs from `.gitea/workflows/ci.yml` — `ruff`, `pytest` (which spins
|
||||||
|
up its own postgres service container), and `validate` — **on the production
|
||||||
|
box**. Those jobs are the ~47% CPU spike on every push that trips
|
||||||
|
`HostHighCpuUsage` and competes with the app for RAM. Gitea *itself* (git
|
||||||
|
hosting) is light (~0% CPU, ~5% RAM); the **runner** is the resource hog.
|
||||||
|
|
||||||
|
Moving just the runner to a separate, cheap server eliminates the prod CPU
|
||||||
|
bursts with **no data migration, no DNS change, and no downtime** — often
|
||||||
|
removing the need for a rescale entirely. The runner box can be **x86** (it only
|
||||||
|
lints/tests; it doesn't need to match prod's Arm architecture) and stateless
|
||||||
|
(rebuildable in minutes), so a **CX22 (2 vCPU / 4 GB, ~3.79 EUR/mo)** is the
|
||||||
|
minimum and a **CX32 (4 vCPU / 8 GB, ~6.80 EUR/mo)** is comfortable for CI
|
||||||
|
bursts. x86 has no capacity-wait (see "Why x86 is more abundant" — Arm/Ampere is
|
||||||
|
a limited pool).
|
||||||
|
|
||||||
|
**Steps:**
|
||||||
|
|
||||||
|
1. **Provision + harden** a new x86 server (Ubuntu 24.04): follow Steps 2–6
|
||||||
|
(non-root user, SSH hardening, UFW, **Docker** — the runner executes jobs in
|
||||||
|
containers so Docker is required).
|
||||||
|
2. **Get a runner registration token** in Gitea: Site Administration → Actions →
|
||||||
|
Runners → *Create new Runner* → copy the token.
|
||||||
|
3. **Install act_runner** (amd64 build for x86), matching the version in
|
||||||
|
[Step 15](#step-15-gitea-actions-runner):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p ~/gitea-runner && cd ~/gitea-runner
|
||||||
|
VERSION=0.2.13
|
||||||
|
wget -O act_runner \
|
||||||
|
"https://gitea.com/gitea/act_runner/releases/download/v${VERSION}/act_runner-${VERSION}-linux-amd64"
|
||||||
|
chmod +x act_runner
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Register with the SAME labels** as the current runner — `ci.yml` uses
|
||||||
|
`runs-on: ubuntu-latest`, so the label mapping must be replicated or jobs
|
||||||
|
won't be picked up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./act_runner register --no-interactive \
|
||||||
|
--instance https://git.wizard.lu \
|
||||||
|
--token <RUNNER_TOKEN> \
|
||||||
|
--name ci-runner-1 \
|
||||||
|
--labels 'ubuntu-latest:docker://docker.gitea.com/runner-images:ubuntu-latest,ubuntu-22.04:docker://docker.gitea.com/runner-images:ubuntu-22.04,ubuntu-20.04:docker://docker.gitea.com/runner-images:ubuntu-20.04'
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Generate config + install as a systemd service** (mirror prod's
|
||||||
|
`gitea-runner.service`, adjusting `User`/paths for the new box):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./act_runner generate-config > config.yaml
|
||||||
|
sudo tee /etc/systemd/system/gitea-runner.service >/dev/null <<'UNIT'
|
||||||
|
[Unit]
|
||||||
|
Description=Gitea Actions Runner
|
||||||
|
After=network.target
|
||||||
|
[Service]
|
||||||
|
Type=simple
|
||||||
|
User=samir
|
||||||
|
WorkingDirectory=/home/samir/gitea-runner
|
||||||
|
ExecStart=/home/samir/gitea-runner/act_runner daemon --config /home/samir/gitea-runner/config.yaml
|
||||||
|
Restart=always
|
||||||
|
RestartSec=10
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
UNIT
|
||||||
|
sudo systemctl daemon-reload && sudo systemctl enable --now gitea-runner.service
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Verify** the new runner shows **online/idle** in Gitea's Runners list.
|
||||||
|
7. **Smoke-test:** push a trivial commit to `master` and confirm the jobs land on
|
||||||
|
`ci-runner-1` (not the prod runner), and the deploy still completes. The CD
|
||||||
|
deploy step uses `appleboy/ssh-action` with the SSH key stored in **Gitea
|
||||||
|
repo secrets** (not on the runner host), so the new runner picks it up
|
||||||
|
automatically — **no key to copy**.
|
||||||
|
8. **Decommission the prod runner** once the new one is proven:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# on the production box:
|
||||||
|
sudo systemctl disable --now gitea-runner.service
|
||||||
|
```
|
||||||
|
|
||||||
|
Optionally remove it from Gitea's Runners list. Watch prod `docker stats`
|
||||||
|
during the next CI run — the CPU burst should be gone.
|
||||||
|
|
||||||
|
!!! note "One smaller burst remains on prod"
|
||||||
|
The deploy job still runs `docker compose up -d --build` **on prod** (via
|
||||||
|
SSH), so the api image is still *built* on the production box — a smaller
|
||||||
|
burst than the full CI suite. To remove that too, build images on the runner
|
||||||
|
and have prod `pull` instead of `--build`: build → push to **Gitea's built-in
|
||||||
|
container registry** → change `deploy.sh` from `--build` to `pull`. That's a
|
||||||
|
larger CI rework (and the runner must build **arm64** images via
|
||||||
|
`buildx --platform linux/arm64` while prod stays Arm) — defer unless the
|
||||||
|
build burst alone is still a problem.
|
||||||
|
|
||||||
|
### Migrating Gitea to a Separate Server (2c)
|
||||||
|
|
||||||
|
**When:** after 2a, if you want full separation — production box = app only;
|
||||||
|
a separate box = Gitea + CI. Buys architectural cleanliness (a prod incident no
|
||||||
|
longer touches git/CI, and vice versa) and frees the `gitea` + `gitea-db`
|
||||||
|
containers off prod. **Trade-off:** it's a real data migration, and the new box
|
||||||
|
becomes **stateful and critical** (source of truth + — if the runner is
|
||||||
|
co-located — the deploy path to prod), so it must be backed up, monitored, and
|
||||||
|
hardened like prod. Do it in a **planned maintenance window** (Gitea + CI are
|
||||||
|
unavailable during cutover). Co-locate it on the **same box as the 2a runner**.
|
||||||
|
|
||||||
|
Current Gitea layout (for reference): `~/gitea/docker-compose.yml` defines two
|
||||||
|
containers — `gitea` (`gitea/gitea:latest`, web on `127.0.0.1:3000`, git SSH on
|
||||||
|
host `2222`) and `gitea-db` (`postgres:15`). Data lives in two named volumes:
|
||||||
|
`gitea_gitea-data` (repos, LFS, config, actions artifacts) and
|
||||||
|
`gitea_gitea-db-data` (the postgres DB). Backups are under `~/backups/gitea/`.
|
||||||
|
|
||||||
|
!!! note "Backup coverage & rollback — read before you cut over"
|
||||||
|
**What's already safe (code):** This Gitea instance hosts a *single* repo
|
||||||
|
(`sboulahtit/orion`) with **no** issues, PRs, releases, wikis, LFS, or
|
||||||
|
attachments — so a normal local clone is a **complete backup of all code
|
||||||
|
history**. Before migrating, run `git fetch --all --tags` on your laptop (or
|
||||||
|
keep a `git clone --mirror`) so every branch/tag is local. Worst case, you
|
||||||
|
could recreate the repo from your laptop and `git push` — zero code loss.
|
||||||
|
|
||||||
|
**The one thing a clone does NOT cover — the 4 CI secrets.** Gitea Actions
|
||||||
|
secrets are **write-only**: you cannot read their values back from the UI or
|
||||||
|
API. The four (from `.gitea/workflows/ci.yml` → the `deploy` job) are:
|
||||||
|
|
||||||
|
| Secret | Value | Sensitive? |
|
||||||
|
|---|---|---|
|
||||||
|
| `DEPLOY_HOST` | prod IP (`91.99.65.229`) | no — known |
|
||||||
|
| `DEPLOY_USER` | `samir` | no — known |
|
||||||
|
| `DEPLOY_PATH` | `~/apps/orion` | no — known |
|
||||||
|
| `DEPLOY_SSH_KEY` | **private** SSH deploy key | **yes** — the only real one |
|
||||||
|
|
||||||
|
So only `DEPLOY_SSH_KEY` matters, and its **public** half is already in
|
||||||
|
prod's `~/.ssh/authorized_keys`. Two ways it's covered:
|
||||||
|
|
||||||
|
1. **Automatic (primary path):** the proper restore preserves all four. The
|
||||||
|
encrypted values live in the `secret` table (captured by `pg_dump`) and
|
||||||
|
are decrypted by `SECRET_KEY` inside `app.ini` (which lives in the
|
||||||
|
`gitea-data` volume). **You must restore the DB *and* the `gitea-data`
|
||||||
|
volume from the *same* instance together** — the encrypted secrets are
|
||||||
|
useless without their matching `SECRET_KEY`. Never restore one without
|
||||||
|
the other.
|
||||||
|
2. **Belt-and-suspenders (manual):** before cutover, confirm you still hold
|
||||||
|
the `DEPLOY_SSH_KEY` *private* key off-box. If you ever rebuild from the
|
||||||
|
local clone alone, re-add the four under *new Gitea → repo → Settings →
|
||||||
|
Actions → Secrets*; the three known ones are trivial, and for the key
|
||||||
|
either reuse the private key you saved or **regenerate**:
|
||||||
|
`ssh-keygen -t ed25519 -f deploy_key`, append `deploy_key.pub` to prod's
|
||||||
|
`~/.ssh/authorized_keys`, then paste `deploy_key` as the new
|
||||||
|
`DEPLOY_SSH_KEY`.
|
||||||
|
|
||||||
|
**One-shot backup (recommended right before cutover):** run
|
||||||
|
`docker exec gitea gitea dump -t /tmp` and copy the resulting
|
||||||
|
`gitea-dump-*.zip` off the box. That single archive bundles repos + DB +
|
||||||
|
config (`app.ini`/`SECRET_KEY`), so it inherently includes the encrypted
|
||||||
|
secrets *and* the key to decrypt them — the cleanest restore artifact.
|
||||||
|
|
||||||
|
**Rollback:** the migration keeps the old volumes intact (step 12 uses
|
||||||
|
`docker compose down`, **not** `down -v`). If anything goes sideways,
|
||||||
|
re-point `git.wizard.lu` DNS back to the prod IP and `docker compose up -d`
|
||||||
|
the old stack — it's untouched. Keep the old volumes until the new box is
|
||||||
|
fully verified.
|
||||||
|
|
||||||
|
**Steps:**
|
||||||
|
|
||||||
|
1. **Stage the stack on the new box.** Copy `~/gitea/docker-compose.yml` over.
|
||||||
|
**Reuse the exact existing env values** (especially `GITEA__database__PASSWD`
|
||||||
|
/ `POSTGRES_PASSWORD` — copy them from the current file; do not regenerate, or
|
||||||
|
the restored DB won't authenticate). Keep `ROOT_URL`/`DOMAIN`/`SSH_DOMAIN`
|
||||||
|
as `git.wizard.lu`.
|
||||||
|
2. **Announce downtime / stop writes** on the old Gitea.
|
||||||
|
3. **Dump the data on the old box:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/gitea
|
||||||
|
docker exec gitea-db pg_dump -U gitea gitea > /tmp/gitea-db.sql
|
||||||
|
docker compose stop gitea # quiesce before copying the data volume
|
||||||
|
docker run --rm -v gitea_gitea-data:/data -v /tmp:/backup alpine \
|
||||||
|
tar czf /backup/gitea-data.tgz -C /data .
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Transfer** `/tmp/gitea-db.sql` + `/tmp/gitea-data.tgz` to the new box
|
||||||
|
(`scp`/`rsync`).
|
||||||
|
5. **Restore the DB** on the new box:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose up -d gitea-db # wait until healthy
|
||||||
|
cat gitea-db.sql | docker exec -i gitea-db psql -U gitea -d gitea
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Restore the data volume** on the new box:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm -v gitea_gitea-data:/data -v $PWD:/backup alpine \
|
||||||
|
sh -c "tar xzf /backup/gitea-data.tgz -C /data"
|
||||||
|
```
|
||||||
|
|
||||||
|
7. **Start Gitea:** `docker compose up -d gitea` and check `docker compose logs
|
||||||
|
gitea`.
|
||||||
|
8. **Firewall:** open `2222/tcp` (git SSH) on the new box's UFW; keep `3000`
|
||||||
|
bound to localhost (Caddy proxies it).
|
||||||
|
9. **Reverse proxy + SSL** on the new box: install Caddy (Step 14) and add the
|
||||||
|
`git.wizard.lu` block (same as prod):
|
||||||
|
|
||||||
|
```caddy
|
||||||
|
git.wizard.lu {
|
||||||
|
tls { issuer acme }
|
||||||
|
reverse_proxy localhost:3000
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
10. **DNS cutover:** point `git.wizard.lu` A/AAAA at the new box's IP (TTL 300 →
|
||||||
|
~5 min). Once propagated, Caddy on the new box auto-issues the TLS cert.
|
||||||
|
11. **No remote/runner URL changes needed** — the hostname `git.wizard.lu`
|
||||||
|
stays the same (only the IP moved), so your `gitea` git remote and the
|
||||||
|
runner's `--instance https://git.wizard.lu` keep working after DNS flips.
|
||||||
|
12. **Decommission Gitea on prod** (keep volumes + backups for a rollback
|
||||||
|
window):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/gitea && docker compose down # leaves volumes intact
|
||||||
|
```
|
||||||
|
|
||||||
|
Remove the `git.wizard.lu` block from prod's Caddyfile and reload Caddy;
|
||||||
|
optionally close `2222/tcp` on prod's UFW.
|
||||||
|
13. **Set up backups on the new box** (Step 17) — it's now stateful/critical.
|
||||||
|
14. **Verify:** web UI loads with valid SSL, clone/push over SSH (`:2222`)
|
||||||
|
works, a push triggers CI, and repos/actions history are intact. (See the
|
||||||
|
"Backup coverage & rollback" callout above if anything needs reverting.)
|
||||||
|
|
||||||
### View logs
|
### View logs
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
Reference in New Issue
Block a user