Some checks failed
CI / docs (push) Blocked by required conditions
CI / deploy (push) Blocked by required conditions
CI / ruff (push) Successful in 2m7s
CI / validate (push) Successful in 39s
CI / dependency-scanning (push) Successful in 45s
CI / pytest (push) Failing after 3h3m22s
Document two ways to take CI/Gitea load off the production box, since the HostHighCpuUsage floods are caused by act_runner running ruff/pytest/validate on the prod server (not by Gitea hosting, which is light): - 2a "Offloading CI to a Separate Server" — move just the act_runner to a cheap x86 box (no data migration, no DNS, no downtime). Includes the smaller build-burst caveat (deploy still builds on prod) + the registry-pull path. - 2c "Migrating Gitea to a Separate Server" — full separation runbook: pg_dump + data-volume tar/restore, DNS cutover, Caddy/SSL, rollback. Notes the box becomes stateful/critical (backups + hardening). mkdocs --strict clean; arch validation 0 new findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3445 lines
130 KiB
Markdown
3445 lines
130 KiB
Markdown
# Hetzner Cloud Server Setup
|
||
|
||
Complete step-by-step guide for deploying Orion on a Hetzner Cloud VPS.
|
||
|
||
!!! info "Server Details"
|
||
- **Provider**: Hetzner Cloud
|
||
- **OS**: Ubuntu 24.04.3 LTS (upgraded to 24.04.4 after updates)
|
||
- **Architecture**: aarch64 (ARM64)
|
||
- **IP**: `91.99.65.229`
|
||
- **IPv6**: `2a01:4f8:1c1a:b39c::1`
|
||
- **Disk**: 37 GB
|
||
- **RAM**: 4 GB
|
||
- **Auth**: SSH key (configured via Hetzner Console)
|
||
- **Setup date**: 2026-02-11
|
||
|
||
!!! success "Progress — 2026-02-12"
|
||
**Completed (Steps 1–16):**
|
||
|
||
- Non-root user `samir` with SSH key
|
||
- Server hardened (UFW firewall, SSH root login disabled, fail2ban)
|
||
- Docker 29.2.1 & Docker Compose 5.0.2 installed
|
||
- Gitea running at `https://git.wizard.lu` (user: `sboulahtit`, repo: `orion`)
|
||
- Repository cloned to `~/apps/orion`
|
||
- Production `.env` configured with generated secrets
|
||
- Full Docker stack deployed (API, PostgreSQL, Redis, Celery worker/beat, Flower)
|
||
- Database migrated (76 tables) and seeded (admin, platforms, CMS, email templates)
|
||
- API verified at `https://api.wizard.lu/health`
|
||
- DNS A records configured and propagated for `wizard.lu` and subdomains
|
||
- Caddy 2.10.2 reverse proxy with auto-SSL (Let's Encrypt)
|
||
- Temporary firewall rules removed (ports 3000, 8001)
|
||
- Gitea Actions runner v0.2.13 registered and running as systemd service
|
||
- SSH key added to Gitea for local push via SSH
|
||
- Git remote updated: `ssh://git@git.wizard.lu:2222/sboulahtit/orion.git`
|
||
- ProxyHeadersMiddleware added for correct HTTPS behind Caddy
|
||
- Fixed TierLimitExceededException import and Pydantic @field_validator bugs
|
||
- `wizard.lu` serving frontend with CSS over HTTPS (mixed content fixed)
|
||
- `/merchants` and `/admin` redirect fix (CMS catch-all was intercepting)
|
||
|
||
!!! success "Progress — 2026-02-13"
|
||
**Completed:**
|
||
|
||
- CI fully green: ruff (lint), pytest, architecture, docs all passing
|
||
- Pinned ruff==0.8.4 in requirements-dev.txt (CI/local version mismatch was root cause of recurring I001 errors)
|
||
- Pre-commit hooks configured and installed (ruff auto-fix, architecture validation, trailing whitespace, end-of-file)
|
||
- AAAA (IPv6) records added for all wizard.lu domains
|
||
- mkdocs build clean (zero warnings) — all 32 orphan pages added to nav
|
||
- Pre-commit documented in `docs/development/code-quality.md`
|
||
- **Step 16: Continuous deployment** — auto-deploy on push to master via `scripts/deploy.sh` + Gitea Actions
|
||
|
||
**Next steps:**
|
||
|
||
- [x] Step 17: Backups
|
||
- [x] Step 18: Monitoring & observability
|
||
|
||
**Deferred (not urgent, do when all platforms ready):**
|
||
|
||
- [x] ~~DNS A + AAAA records for platform domains (`omsflow.lu`, `rewardflow.lu`)~~
|
||
- [x] ~~Uncomment platform domains in Caddyfile after DNS propagation~~
|
||
|
||
!!! success "Progress — 2026-02-14"
|
||
**Completed:**
|
||
|
||
- **Wizamart → Orion rename** — 1,086 occurrences replaced across 184 files (database identifiers, email addresses, domains, config, templates, docs, seed data)
|
||
- Template renamed: `homepage-wizamart.html` → `homepage-orion.html`
|
||
- **Production DB rebuilt from scratch** with Orion naming (`orion_db`, `orion_user`)
|
||
- Platform domains configured in seed data: wizard.lu (main), omsflow.lu, rewardflow.lu (loyalty)
|
||
- Docker volume explicitly named `orion_postgres_data`
|
||
- `.dockerignore` added — prevents `.env` from being baked into Docker images
|
||
- `env_file: .env` added to `docker-compose.yml` — containers load host env vars properly
|
||
- `CapacitySnapshot` model import fixed (moved from billing to monitoring in `alembic/env.py`)
|
||
- All services verified healthy at `https://api.wizard.lu/health`
|
||
- **Step 17: Backups** — automated pg_dump scripts (daily + weekly rotation), R2 offsite upload, restore helper
|
||
- **Step 18: Monitoring** — Prometheus, Grafana, node-exporter, cAdvisor added to docker-compose; `/metrics` endpoint activated via `prometheus_client`
|
||
|
||
!!! success "Progress — 2026-02-15"
|
||
**Completed:**
|
||
|
||
- **Step 17 server-side**: Hetzner backups enabled (5 of 7 daily images, last 6.22 GB)
|
||
- **Step 18 server-side**: Full monitoring stack deployed — Prometheus (4/4 targets up), Grafana at `https://grafana.wizard.lu` with Node Exporter Full (#1860) and Docker/cAdvisor (#193) dashboards
|
||
- **Domain rename**: `oms.lu` → `omsflow.lu`, `loyalty.lu` → `rewardflow.lu` across entire codebase (19 + 13 files)
|
||
- **Platform domains live**: all three platforms serving HTTPS via Caddy with auto-SSL
|
||
- `https://wizard.lu` (main)
|
||
- `https://omsflow.lu` (OMS)
|
||
- `https://rewardflow.lu` (Loyalty+)
|
||
- Platform `domain` column updated in production DB
|
||
- RAM usage ~2.4 GB on 4 GB server (stable, CI jobs add ~550 MB temporarily)
|
||
- **Systemd backup timer** (`orion-backup.timer`) — daily at 03:00 UTC, tested manually
|
||
- **Cloudflare R2 offsite backups** — `orion-backups` bucket, `awscli` configured with `--profile r2`, `--upload` flag added to systemd timer
|
||
- `python3-pip` and `awscli` installed on server (pip user install, PATH added to `.bashrc` and systemd service)
|
||
|
||
**Steps 1–18 fully complete.** All infrastructure operational.
|
||
|
||
!!! success "Progress — 2026-02-15 (continued)"
|
||
**Completed (Steps 19–24):**
|
||
|
||
- **Step 19: Prometheus Alerting** — alert rules (host, container, API, Celery, targets) + Alertmanager with email routing
|
||
- **Step 20: Security Hardening** — Docker network segmentation (frontend/backend/monitoring), fail2ban config, unattended-upgrades
|
||
- **Step 21: Cloudflare Domain Proxy** — origin certificates, WAF, bot protection, rate limiting (documented, user deploys)
|
||
- **Step 22: Incident Response** — 8 runbooks with copy-paste commands, severity levels, decision tree
|
||
- **Step 23: Environment Reference** — all 55+ env vars documented with defaults and production requirements
|
||
- **Step 24: Documentation Updates** — hetzner docs, launch readiness, mkdocs nav updated
|
||
|
||
**Steps 1–24 fully complete.** Enterprise infrastructure hardening done.
|
||
|
||
!!! success "Progress — 2026-02-24"
|
||
**Completed:**
|
||
|
||
- **Step 25: Google Wallet Integration** — Google Cloud project "Orion" created, Wallet API enabled, service account configured
|
||
- Google Pay Merchant ID: `BCR2DN5TW2CNXDAG`
|
||
- Google Wallet Issuer ID: `3388000000023089598`
|
||
- Service account: `wallet-service@orion-488322.iam.gserviceaccount.com` (admin role in Pay & Wallet Console)
|
||
- Service account JSON key generated
|
||
- Dependencies added to `requirements.txt`: `google-auth>=2.0.0`, `PyJWT>=2.0.0` (commit `d36783a`)
|
||
- Loyalty env vars added to `.env.example` and `docs/deployment/environment.md`
|
||
- `LOYALTY_GOOGLE_ISSUER_ID` and `LOYALTY_GOOGLE_SERVICE_ACCOUNT_JSON` added to `app/core/config.py` Settings class
|
||
- **End-to-end integration wired:**
|
||
- Enrollment auto-creates Google Wallet class + object (`card_service` → `wallet_service.create_wallet_objects`)
|
||
- Stamp/points operations auto-sync to Google Wallet (`stamp_service`/`points_service` → `wallet_service.sync_card_to_wallets`)
|
||
- Storefront API returns wallet URLs (`GET /loyalty/card`, `POST /loyalty/enroll`)
|
||
- "Add to Google Wallet" button wired in storefront dashboard and enrollment success page (Alpine.js conditional rendering)
|
||
- Google Wallet is a platform-wide config (env vars only) — merchants don't need to configure anything
|
||
|
||
**Next steps:**
|
||
|
||
- [ ] Upload service account JSON to Hetzner server
|
||
- [ ] Set `LOYALTY_GOOGLE_ISSUER_ID` and `LOYALTY_GOOGLE_SERVICE_ACCOUNT_JSON` in production `.env`
|
||
- [ ] Restart app and test end-to-end: enroll → add pass → stamp → verify pass updates
|
||
- [ ] Submit for Google production approval when ready
|
||
- [ ] Apple Wallet setup (APNs push, certificates, pass images)
|
||
|
||
!!! success "Progress — 2026-02-16"
|
||
**Completed:**
|
||
|
||
- **Step 21: Cloudflare Domain Proxy** — all three domains active on Cloudflare (Full setup):
|
||
- `wizard.lu` — DNS records configured (6 A + 6 AAAA), old CNAME records removed, NS switched at Netim, SSL/TLS set to Full (Strict), Always Use HTTPS enabled, AI crawlers blocked
|
||
- `omsflow.lu` — DNS records configured (2 A + 2 AAAA), NS switched at Netim, SSL/TLS Full (Strict) + Always Use HTTPS
|
||
- `rewardflow.lu` — DNS records configured (2 A + 2 AAAA), NS switched at Netim, SSL/TLS Full (Strict) + Always Use HTTPS
|
||
- `git.wizard.lu` stays DNS-only (grey cloud) for SSH access on port 2222
|
||
- DNSSEC disabled at registrar (will re-enable via Cloudflare later)
|
||
- Registrar: Netim (`netim.com`)
|
||
- Origin certificates generated (non-wildcard, specific subdomains) and installed on server
|
||
- Caddyfile updated: origin certs for proxied domains, `tls { issuer acme }` for `git.wizard.lu`
|
||
- Access logging enabled for fail2ban (`/var/log/caddy/access.log`)
|
||
- All domains verified working: `wizard.lu`, `omsflow.lu`, `rewardflow.lu`, `api.wizard.lu`, `git.wizard.lu`
|
||
- **Step 19: SendGrid SMTP** — fully configured and tested:
|
||
- SendGrid account created (free trial, 60-day limit)
|
||
- `wizard.lu` domain authenticated (5 CNAME + 1 TXT in Cloudflare DNS)
|
||
- Link branding enabled
|
||
- API key `orion-production` created
|
||
- Alertmanager SMTP configured (`alerts@wizard.lu` → SendGrid)
|
||
- App email configured (`EMAIL_PROVIDER=sendgrid`, `noreply@wizard.lu`)
|
||
- Test alert sent and received successfully
|
||
|
||
- **Cloudflare security** — configured on all three domains:
|
||
- Bot Fight Mode enabled
|
||
- DDoS protection active (default)
|
||
- Rate limiting: 100 req/10s on `/api/` paths, block for 10s
|
||
|
||
**Steps 1–24 fully deployed and operational.**
|
||
|
||
!!! success "Progress — 2026-02-17"
|
||
**Launch readiness — fully deployed and verified (44/44 checks pass):**
|
||
|
||
- **Memory limits** on all 6 app containers (db: 256m, redis: 128m, api: 512m, celery-worker: 768m, celery-beat: 128m, flower: 192m) — rebalanced after celery-worker OOM kills (concurrency reduced from 4 to 2)
|
||
- **Flower port** restricted to localhost only (`127.0.0.1:5555:5555`) — access via Caddy reverse proxy
|
||
- **Flower password** changed from default
|
||
- **Infrastructure health checks** — `/health/ready` now checks PostgreSQL (`SELECT 1`) and Redis (`ping`) with individual check details and latency
|
||
- **fail2ban Caddy auth jail** deployed — bans IPs after 10 failed auth attempts
|
||
- **Unattended upgrades** verified active
|
||
- **Scaling guide** — practical playbook at `docs/deployment/scaling-guide.md`
|
||
- **Server verification script** — `scripts/verify-server.sh` (44/44 PASS, 0 FAIL, 0 WARN)
|
||
|
||
**Server is launch-ready for first client (24 stores).**
|
||
|
||
|
||
## Installed Software Versions
|
||
|
||
| Software | Version |
|
||
|---|---|
|
||
| Ubuntu | 24.04.4 LTS |
|
||
| Kernel | 6.8.0-100-generic (aarch64) |
|
||
| Docker | 29.2.1 |
|
||
| Docker Compose | 5.0.2 |
|
||
| PostgreSQL | 15 (container) |
|
||
| Redis | 7-alpine (container) |
|
||
| Python | 3.11-slim (container) |
|
||
| Gitea | latest (container) |
|
||
| Caddy | 2.10.2 |
|
||
| act_runner | 0.2.13 |
|
||
|
||
---
|
||
|
||
## Step 1: Initial Server Access
|
||
|
||
```bash
|
||
ssh root@91.99.65.229
|
||
```
|
||
|
||
## Step 2: Create Non-Root User
|
||
|
||
Create a dedicated user with sudo privileges and copy the SSH key:
|
||
|
||
```bash
|
||
# Create user
|
||
adduser samir
|
||
usermod -aG sudo samir
|
||
|
||
# Copy SSH keys to new user
|
||
rsync --archive --chown=samir:samir ~/.ssh /home/samir
|
||
```
|
||
|
||
Verify by connecting as the new user (from a **new terminal**):
|
||
|
||
```bash
|
||
ssh samir@91.99.65.229
|
||
```
|
||
|
||
## Step 3: System Update & Essential Packages
|
||
|
||
```bash
|
||
sudo apt update && sudo apt upgrade -y
|
||
|
||
sudo apt install -y \
|
||
curl \
|
||
git \
|
||
wget \
|
||
ufw \
|
||
fail2ban \
|
||
htop \
|
||
unzip \
|
||
make
|
||
```
|
||
|
||
Reboot if a kernel upgrade is pending:
|
||
|
||
```bash
|
||
sudo reboot
|
||
```
|
||
|
||
## Step 4: Firewall Configuration (UFW)
|
||
|
||
```bash
|
||
sudo ufw allow OpenSSH
|
||
sudo ufw allow 80/tcp
|
||
sudo ufw allow 443/tcp
|
||
sudo ufw enable
|
||
```
|
||
|
||
Verify:
|
||
|
||
```bash
|
||
sudo ufw status
|
||
```
|
||
|
||
Expected output:
|
||
|
||
```
|
||
Status: active
|
||
|
||
To Action From
|
||
-- ------ ----
|
||
OpenSSH ALLOW Anywhere
|
||
80/tcp ALLOW Anywhere
|
||
443/tcp ALLOW Anywhere
|
||
```
|
||
|
||
!!! warning "Hetzner Cloud blocks outbound TCP 25 and 465 by default"
|
||
Hetzner Cloud applies an **upstream egress block on TCP ports 25 and 465** to every Cloud Server, as their published anti-spam policy. This block sits *above* UFW/iptables on the VM — `ufw status` won't show it, and `iptables -L OUTPUT` looks completely clean. The symptom is that SYN packets to those ports simply time out at the network layer while every other port (including 587) works.
|
||
|
||
If your monitoring stack (Step 19) or any other service needs to send via port 465 (SMTPS / implicit TLS), you must request the unblock from Hetzner support:
|
||
|
||
1. **Test first** — confirm it's actually the Hetzner block, not something on your VM:
|
||
```bash
|
||
timeout 5 nc -4 -zv <mail-host> 465 # silent timeout → likely Hetzner upstream
|
||
timeout 5 nc -4 -zv <mail-host> 587 # succeeds → general egress is fine, only 465 is blocked
|
||
```
|
||
2. **Submit unblock request** via [console.hetzner.cloud](https://console.hetzner.cloud) → Support → New ticket. Hetzner's docs invite this explicitly: *"Outgoing traffic to ports 25 and 465 are blocked by default on all Cloud Servers. Send us a request to unblock these ports."*
|
||
|
||
Sample ticket text:
|
||
|
||
```
|
||
Hi,
|
||
|
||
Please unblock outbound TCP port 465 for my Cloud server:
|
||
Project: <project name>
|
||
Server: <server name>
|
||
IPv4: <server IPv4>
|
||
|
||
Reason: legitimate SMTP submission via my mail provider's documented
|
||
SMTPS endpoint. Confirmed via UFW, iptables, nftables, and Hetzner
|
||
Cloud Firewall that no rule on my side blocks the port; the block
|
||
is upstream.
|
||
|
||
Volume: monitoring alert emails, ~10/day.
|
||
Thanks.
|
||
```
|
||
|
||
Hetzner usually auto-approves within minutes for legitimate use cases.
|
||
|
||
Real prod incident this caused: 5 hours of "is my SMTP password wrong?" debugging on 2026-05-30 before discovering the egress block. Don't repeat that — if you see a port-465 connection time out from a Cloud Server, suspect the upstream block first.
|
||
|
||
## Step 5: Harden SSH
|
||
|
||
!!! warning "Before doing this step"
|
||
Make sure you can SSH as `samir` from another terminal first!
|
||
If you lock yourself out, you'll need to use Hetzner's console rescue mode.
|
||
|
||
```bash
|
||
sudo sed -i 's/^#\?PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
|
||
sudo sed -i 's/^#\?PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
|
||
sudo systemctl restart ssh # Note: Ubuntu 24.04 uses 'ssh' not 'sshd'
|
||
```
|
||
|
||
## Step 6: Install Docker & Docker Compose
|
||
|
||
```bash
|
||
curl -fsSL https://get.docker.com | sh
|
||
sudo usermod -aG docker samir
|
||
```
|
||
|
||
Log out and back in for the group change:
|
||
|
||
```bash
|
||
exit
|
||
# Then: ssh samir@91.99.65.229
|
||
```
|
||
|
||
Verify:
|
||
|
||
```bash
|
||
docker --version
|
||
docker compose version
|
||
```
|
||
|
||
## Step 7: Gitea (Self-Hosted Git)
|
||
|
||
Create the Gitea directory and compose file:
|
||
|
||
```bash
|
||
mkdir -p ~/gitea && cd ~/gitea
|
||
```
|
||
|
||
Create `docker-compose.yml` with `nano ~/gitea/docker-compose.yml`:
|
||
|
||
```yaml
|
||
services:
|
||
gitea:
|
||
image: gitea/gitea:latest
|
||
container_name: gitea
|
||
restart: always
|
||
environment:
|
||
- USER_UID=1000
|
||
- USER_GID=1000
|
||
- GITEA__database__DB_TYPE=postgres
|
||
- GITEA__database__HOST=gitea-db:5432
|
||
- GITEA__database__NAME=gitea
|
||
- GITEA__database__USER=gitea
|
||
- GITEA__database__PASSWD=<GENERATED_PASSWORD>
|
||
- GITEA__server__ROOT_URL=http://91.99.65.229:3000/
|
||
- GITEA__server__SSH_DOMAIN=91.99.65.229
|
||
- GITEA__server__DOMAIN=91.99.65.229
|
||
- GITEA__actions__ENABLED=true
|
||
volumes:
|
||
- gitea-data:/data
|
||
- /etc/timezone:/etc/timezone:ro
|
||
- /etc/localtime:/etc/localtime:ro
|
||
ports:
|
||
- "127.0.0.1:3000:3000" # Localhost only — Caddy proxies git.wizard.lu
|
||
- "2222:22" # SSH must stay public (Caddy can't proxy SSH)
|
||
depends_on:
|
||
gitea-db:
|
||
condition: service_healthy
|
||
|
||
gitea-db:
|
||
image: postgres:15
|
||
container_name: gitea-db
|
||
restart: always
|
||
environment:
|
||
POSTGRES_DB: gitea
|
||
POSTGRES_USER: gitea
|
||
POSTGRES_PASSWORD: <GENERATED_PASSWORD>
|
||
volumes:
|
||
- gitea-db-data:/var/lib/postgresql/data
|
||
healthcheck:
|
||
test: ["CMD-SHELL", "pg_isready -U gitea"]
|
||
interval: 10s
|
||
timeout: 5s
|
||
retries: 5
|
||
|
||
volumes:
|
||
gitea-data:
|
||
gitea-db-data:
|
||
```
|
||
|
||
Generate the database password with `openssl rand -hex 16` and replace `<GENERATED_PASSWORD>` in both places.
|
||
|
||
Start Gitea:
|
||
|
||
```bash
|
||
docker compose up -d
|
||
docker compose ps
|
||
```
|
||
|
||
!!! note "Initial setup only"
|
||
For initial setup, temporarily change the Gitea port to `"3000:3000"` (without `127.0.0.1`) so you can access the web UI from your browser at `http://91.99.65.229:3000`. After Caddy is configured (Step 14), change it back to `"127.0.0.1:3000:3000"` and access via `https://git.wizard.lu` instead.
|
||
|
||
Visit `http://91.99.65.229:3000` and complete the setup wizard. Create an admin account (e.g. `sboulahtit`).
|
||
|
||
Then create a repository (e.g. `orion`).
|
||
|
||
## Step 8: Push Repository to Gitea
|
||
|
||
### Add SSH Key to Gitea
|
||
|
||
Before pushing via SSH, add your local machine's public key to Gitea:
|
||
|
||
1. Copy your public key:
|
||
|
||
```bash
|
||
cat ~/.ssh/id_ed25519.pub
|
||
# Or if using RSA: cat ~/.ssh/id_rsa.pub
|
||
```
|
||
|
||
2. In the Gitea web UI: click your avatar → **Settings** → **SSH / GPG Keys** → **Add Key** → paste the key.
|
||
|
||
3. Add the Gitea SSH host to known hosts:
|
||
|
||
```bash
|
||
ssh-keyscan -p 2222 git.wizard.lu >> ~/.ssh/known_hosts
|
||
```
|
||
|
||
### Add Remote and Push
|
||
|
||
From your **local machine**:
|
||
|
||
```bash
|
||
cd /home/samir/Documents/PycharmProjects/letzshop-product-import
|
||
git remote add gitea ssh://git@git.wizard.lu:2222/sboulahtit/orion.git
|
||
git push gitea master
|
||
```
|
||
|
||
!!! note "Remote URL updated"
|
||
The remote was initially set to `http://91.99.65.229:3000/...` during setup.
|
||
After Caddy was configured, it was updated to use the domain with SSH:
|
||
`ssh://git@git.wizard.lu:2222/sboulahtit/orion.git`
|
||
|
||
To update an existing remote:
|
||
```bash
|
||
git remote set-url gitea ssh://git@git.wizard.lu:2222/sboulahtit/orion.git
|
||
```
|
||
|
||
## Step 9: Clone Repository on Server
|
||
|
||
```bash
|
||
mkdir -p ~/apps
|
||
cd ~/apps
|
||
git clone http://localhost:3000/sboulahtit/orion.git
|
||
cd orion
|
||
```
|
||
|
||
## Step 10: Configure Production Environment
|
||
|
||
```bash
|
||
cp .env.example .env
|
||
nano .env
|
||
```
|
||
|
||
### Critical Production Values
|
||
|
||
Generate secrets:
|
||
|
||
```bash
|
||
openssl rand -hex 32 # For JWT_SECRET_KEY
|
||
openssl rand -hex 16 # For database password
|
||
openssl rand -hex 16 # For Redis password
|
||
```
|
||
|
||
| Variable | How to Generate / What to Set |
|
||
|---|---|
|
||
| `DEBUG` | `False` |
|
||
| `DATABASE_URL` | `postgresql://orion_user:YOUR_DB_PW@db:5432/orion_db` |
|
||
| `JWT_SECRET_KEY` | Output of `openssl rand -hex 32` |
|
||
| `ADMIN_PASSWORD` | Strong password |
|
||
| `USE_CELERY` | `true` |
|
||
| `REDIS_PASSWORD` | Output of `openssl rand -hex 16` |
|
||
| `REDIS_URL` | `redis://redis:6379/0` (docker-compose handles auth via `REDIS_PASSWORD`) |
|
||
| `STRIPE_SECRET_KEY` | Your Stripe secret key (configure later) |
|
||
| `STRIPE_PUBLISHABLE_KEY` | Your Stripe publishable key (configure later) |
|
||
| `STRIPE_WEBHOOK_SECRET` | Your Stripe webhook secret (configure later) |
|
||
| `STORAGE_BACKEND` | `r2` (if using Cloudflare R2, configure later) |
|
||
|
||
Also update the PostgreSQL password in `docker-compose.yml` (lines 9 and 40) to match.
|
||
|
||
!!! danger "Docker bypasses UFW"
|
||
Docker manipulates iptables directly. Never use bare port mappings like `"5432:5432"` in docker-compose — they expose services to the internet regardless of UFW rules. See [Step 20.0](#200-docker-port-binding-critical-docker-bypasses-ufw) for details.
|
||
|
||
## Step 11: Deploy with Docker Compose
|
||
|
||
```bash
|
||
cd ~/apps/orion
|
||
|
||
# Create directories with correct permissions for the container user
|
||
mkdir -p logs uploads exports
|
||
sudo chown -R 1000:1000 logs uploads exports
|
||
|
||
# Start infrastructure first
|
||
docker compose up -d db redis
|
||
|
||
# Wait for health checks to pass
|
||
docker compose ps
|
||
|
||
# Build and start the full stack
|
||
docker compose --profile full up -d --build
|
||
```
|
||
|
||
Verify all services are running:
|
||
|
||
```bash
|
||
docker compose --profile full ps
|
||
```
|
||
|
||
Expected: `api` (healthy), `db` (healthy), `redis` (healthy), `celery-worker` (healthy), `celery-beat` (running), `flower` (running).
|
||
|
||
## Step 12: Initialize Database
|
||
|
||
!!! note "PYTHONPATH required"
|
||
The seed scripts need `PYTHONPATH=/app` set explicitly when running inside the container.
|
||
Use `run --rm` (not `exec`) if the api service is not yet running.
|
||
|
||
### First-time initialization
|
||
|
||
```bash
|
||
# Run migrations (use 'heads' for multi-branch Alembic)
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api alembic upgrade heads
|
||
|
||
# Seed production data (order matters)
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/init_production.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/init_log_settings.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/create_default_content_pages.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/seed_email_templates.py
|
||
```
|
||
|
||
### Full reset procedure (nuclear — deletes all data)
|
||
|
||
Use this to reset the database from scratch. Stop workers first to avoid task conflicts.
|
||
|
||
```bash
|
||
# 1. Stop everything
|
||
docker compose --profile full down
|
||
|
||
# 2. Pull latest code
|
||
git pull origin master
|
||
|
||
# 3. Rebuild ALL images (picks up latest code)
|
||
docker compose --profile full build
|
||
|
||
# 4. Start infrastructure only
|
||
docker compose up -d db redis
|
||
|
||
# 5. Wait for healthy
|
||
docker compose exec db pg_isready -U orion_user -d orion_db
|
||
docker compose exec redis redis-cli ping
|
||
|
||
# 6. Drop and recreate schema
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python -c "
|
||
from app.core.config import settings
|
||
from sqlalchemy import create_engine, text
|
||
e = create_engine(settings.database_url)
|
||
c = e.connect()
|
||
c.execute(text('DROP SCHEMA IF EXISTS public CASCADE'))
|
||
c.execute(text('CREATE SCHEMA public'))
|
||
c.commit()
|
||
c.close()
|
||
print('Schema reset complete')
|
||
"
|
||
|
||
# 7. Run migrations
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api alembic upgrade heads
|
||
|
||
# 8. Seed in order
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/init_production.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/init_log_settings.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/create_default_content_pages.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/seed_email_templates_core.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/seed_email_templates_loyalty.py
|
||
docker compose --profile full run --rm -e PYTHONPATH=/app api python scripts/seed/seed_demo.py
|
||
|
||
# 9. Start all services
|
||
docker compose --profile full up -d
|
||
|
||
# 10. Verify
|
||
docker compose --profile full ps
|
||
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
|
||
```
|
||
|
||
!!! note "After the reset"
|
||
`init_production.py` re-creates the four admin users with their **default** passwords (see `init_production.py:280-300`). Any admin-side configuration that lives in the `admin_settings` table (e.g. the manual SMTP overrides under `/admin/settings`) is wiped and must be re-applied. The `/health` endpoint reads `.build-info` which is only regenerated by `scripts/deploy.sh` or `scripts/deploy-api-only.sh` (see [Step 16.5](#165-manual-deploy)), so after a manual reset it will report the **previous** commit; harmless but worth knowing.
|
||
|
||
### Seeded Data Summary
|
||
|
||
| Data | Count |
|
||
|---|---|
|
||
| Admin users | 3 (super admin + OMS admin + Loyalty admin) |
|
||
| Platforms | 3 (OMS, Wizard, Loyalty) |
|
||
| Platform modules | 57 |
|
||
| Admin settings | 15 |
|
||
| Subscription tiers | 12 (4 per platform: Essential, Professional, Business, Enterprise) |
|
||
| Log settings | 6 |
|
||
| CMS pages | 30 (platform homepages + marketing pages + store defaults) |
|
||
| Email templates | 28 (4 languages: en, fr, de, lb) |
|
||
|
||
---
|
||
|
||
## Step 13: DNS Configuration
|
||
|
||
Before setting up Caddy, point your domain's DNS to the server.
|
||
|
||
### wizard.lu (Main Platform) — Completed
|
||
|
||
| Type | Name | Value | TTL |
|
||
|---|---|---|---|
|
||
| A | `@` | `91.99.65.229` | 300 |
|
||
| A | `www` | `91.99.65.229` | 300 |
|
||
| A | `api` | `91.99.65.229` | 300 |
|
||
| A | `git` | `91.99.65.229` | 300 |
|
||
| A | `flower` | `91.99.65.229` | 300 |
|
||
|
||
### omsflow.lu (OMS Platform) — Completed
|
||
|
||
| Type | Name | Value | TTL |
|
||
|---|---|---|---|
|
||
| A | `@` | `91.99.65.229` | 300 |
|
||
| A | `www` | `91.99.65.229` | 300 |
|
||
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
|
||
### rewardflow.lu (Loyalty+ Platform) — Completed
|
||
|
||
| Type | Name | Value | TTL |
|
||
|---|---|---|---|
|
||
| A | `@` | `91.99.65.229` | 300 |
|
||
| A | `www` | `91.99.65.229` | 300 |
|
||
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
|
||
### IPv6 (AAAA) Records — Completed
|
||
|
||
AAAA records are included in the DNS tables above for all domains. To verify your IPv6 address:
|
||
|
||
```bash
|
||
ip -6 addr show eth0 | grep 'scope global'
|
||
```
|
||
|
||
It should match the value in the Hetzner Cloud Console (Networking tab). Then create AAAA records mirroring each A record above, e.g.:
|
||
|
||
| Type | Name (wizard.lu) | Value | TTL |
|
||
|---|---|---|---|
|
||
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
| AAAA | `api` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
| AAAA | `git` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
| AAAA | `flower` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
|
||
Repeat for `omsflow.lu`, `rewardflow.lu`, and `hostwizard.lu`.
|
||
|
||
**hostwizard.lu DNS Records:**
|
||
|
||
| Type | Name | Value | TTL |
|
||
|---|---|---|---|
|
||
| A | `@` | `91.99.65.229` | 300 |
|
||
| A | `www` | `91.99.65.229` | 300 |
|
||
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
|
||
!!! tip "DNS propagation"
|
||
Set TTL to 300 (5 minutes) initially. DNS changes can take up to 24 hours to propagate globally, but usually complete within 30 minutes. Verify with: `dig api.wizard.lu +short`
|
||
|
||
## Step 14: Reverse Proxy with Caddy
|
||
|
||
Install Caddy:
|
||
|
||
```bash
|
||
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https
|
||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' \
|
||
| sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
|
||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' \
|
||
| sudo tee /etc/apt/sources.list.d/caddy-stable.list
|
||
sudo apt update && sudo apt install caddy
|
||
```
|
||
|
||
### Caddyfile Configuration
|
||
|
||
Edit `/etc/caddy/Caddyfile`:
|
||
|
||
```caddy
|
||
# ─── Platform 1: Main (wizard.lu) ───────────────────────────
|
||
wizard.lu {
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.wizard.lu {
|
||
redir https://wizard.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Platform 2: OMS (omsflow.lu) ───────────────────────────────
|
||
omsflow.lu {
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.omsflow.lu {
|
||
redir https://omsflow.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Platform 3: Loyalty+ (rewardflow.lu) ──────────────────
|
||
rewardflow.lu {
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.rewardflow.lu {
|
||
redir https://rewardflow.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Platform 4: HostWizard (hostwizard.lu) ──────────────────
|
||
hostwizard.lu {
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.hostwizard.lu {
|
||
redir https://hostwizard.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Services ───────────────────────────────────────────────
|
||
api.wizard.lu {
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
git.wizard.lu {
|
||
reverse_proxy localhost:3000
|
||
}
|
||
|
||
flower.wizard.lu {
|
||
reverse_proxy localhost:5555
|
||
}
|
||
```
|
||
|
||
!!! info "How multi-platform routing works"
|
||
All platform domains (`wizard.lu`, `omsflow.lu`, `rewardflow.lu`, `hostwizard.lu`) point to the **same FastAPI backend** on port 8001. The `PlatformContextMiddleware` reads the `Host` header to detect which platform the request is for. Caddy preserves the Host header by default, so no extra configuration is needed.
|
||
|
||
The `domain` column in the `platforms` database table must match:
|
||
|
||
| Platform | code | domain |
|
||
|---|---|---|
|
||
| Main | `main` | `wizard.lu` |
|
||
| OMS | `oms` | `omsflow.lu` |
|
||
| Loyalty+ | `loyalty` | `rewardflow.lu` |
|
||
| HostWizard | `hosting` | `hostwizard.lu` |
|
||
|
||
Start Caddy:
|
||
|
||
```bash
|
||
sudo systemctl restart caddy
|
||
```
|
||
|
||
Caddy automatically provisions Let's Encrypt SSL certificates for all configured domains.
|
||
|
||
Verify:
|
||
|
||
```bash
|
||
curl -I https://wizard.lu
|
||
curl -I https://api.wizard.lu/health
|
||
curl -I https://git.wizard.lu
|
||
```
|
||
|
||
After Caddy is working, lock down Gitea's port to localhost in `~/gitea/docker-compose.yml`:
|
||
|
||
```yaml
|
||
ports:
|
||
- "127.0.0.1:3000:3000" # Localhost only — Caddy proxies git.wizard.lu
|
||
- "2222:22" # SSH must stay public (Caddy can't proxy SSH)
|
||
```
|
||
|
||
Then restart Gitea: `cd ~/gitea && docker compose up -d gitea`
|
||
|
||
!!! warning "Do NOT rely on UFW for Docker ports"
|
||
Docker bypasses UFW entirely. The only way to restrict Docker port access is to bind to `127.0.0.1` in the port mapping. See [Step 20.0](#200-docker-port-binding-critical-docker-bypasses-ufw).
|
||
|
||
Update Gitea's configuration to use its new domain. In `~/gitea/docker-compose.yml`, change:
|
||
|
||
```yaml
|
||
- GITEA__server__ROOT_URL=https://git.wizard.lu/
|
||
- GITEA__server__SSH_DOMAIN=git.wizard.lu
|
||
- GITEA__server__DOMAIN=git.wizard.lu
|
||
```
|
||
|
||
Then restart Gitea:
|
||
|
||
```bash
|
||
cd ~/gitea && docker compose up -d gitea
|
||
```
|
||
|
||
### Multi-Tenant Store Routing
|
||
|
||
Stores on each platform use two routing modes:
|
||
|
||
- **Standard (subdomain)**: `acme.omsflow.lu` — included in the base subscription
|
||
- **Premium (custom domain)**: `acme.lu` — available with premium subscription tiers
|
||
|
||
Both modes are handled by the `StoreContextMiddleware` which reads the `Host` header, so Caddy just needs to forward requests and preserve the header.
|
||
|
||
#### Wildcard Subdomains (for store subdomains)
|
||
|
||
Each non-main platform uses a wildcard Caddy block so any store subdomain (e.g. `acme.omsflow.lu`, `parcelproxy.hostwizard.lu`) is automatically routed without per-store Caddy changes.
|
||
|
||
The wildcard blocks use the same origin cert as the platform root domain. The origin cert must include `*.<platform_domain>` — see [Step 21.4](#214-generate-origin-certificates) for how to generate it.
|
||
|
||
```caddy
|
||
*.omsflow.lu {
|
||
tls /etc/caddy/certs/omsflow.lu/cert.pem /etc/caddy/certs/omsflow.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
*.rewardflow.lu {
|
||
tls /etc/caddy/certs/rewardflow.lu/cert.pem /etc/caddy/certs/rewardflow.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
*.hostwizard.lu {
|
||
tls /etc/caddy/certs/hostwizard.lu/cert.pem /etc/caddy/certs/hostwizard.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
```
|
||
|
||
!!! warning "No wildcard for wizard.lu"
|
||
`wizard.lu` cannot use a wildcard block because `git.wizard.lu` is DNS-only (grey cloud in Cloudflare) and uses a Let's Encrypt cert. A wildcard origin cert would conflict. For wizard.lu subdomains, add each one explicitly (same pattern as `api.wizard.lu`, `flower.wizard.lu`).
|
||
|
||
**Cloudflare DNS**: Add a wildcard DNS record for each platform:
|
||
|
||
- `*.omsflow.lu` → A `91.99.65.229` (proxied, orange cloud)
|
||
- `*.rewardflow.lu` → A `91.99.65.229` (proxied, orange cloud)
|
||
- `*.hostwizard.lu` → A `91.99.65.229` (proxied, orange cloud)
|
||
|
||
With this in place, adding a new store subdomain only requires a database entry (via admin UI) — no DNS or Caddy changes needed.
|
||
|
||
#### Runbook: Add a Store Subdomain
|
||
|
||
When a merchant creates a store with subdomain `acme` on the OMS platform:
|
||
|
||
1. **Database**: Create the store via admin UI — set `subdomain = "acme"` and link to the platform. The `StoreContextMiddleware` will resolve `acme.omsflow.lu` automatically.
|
||
2. **DNS**: Already covered by the wildcard `*.omsflow.lu` record in Cloudflare.
|
||
3. **Caddy**: Already covered by the `*.omsflow.lu` block.
|
||
4. **SSL**: Already covered by the wildcard origin cert.
|
||
5. **Verify**: `curl -I https://acme.omsflow.lu/`
|
||
|
||
No infrastructure changes needed — it's fully self-service.
|
||
|
||
#### Runbook: Add a Custom Store Domain
|
||
|
||
When a premium store brings their own domain (e.g. `wizamart.com`), the domain must be added to **your** Cloudflare account. This ensures you control SSL, WAF, and DNS — critical since you are responsible for the infrastructure.
|
||
|
||
**Step 1: Add domain to Cloudflare**
|
||
|
||
1. In [Cloudflare Dashboard](https://dash.cloudflare.com), click **Add a site** > enter `wizamart.com`
|
||
2. Select the Free plan
|
||
3. Cloudflare assigns NS records — give these to the store owner to update at their registrar
|
||
4. Wait for NS propagation (can take up to 24h, usually minutes)
|
||
|
||
**Step 2: Configure DNS in Cloudflare**
|
||
|
||
Add A records (proxied, orange cloud):
|
||
|
||
| Type | Name | Content | Proxy |
|
||
|------|------|---------|-------|
|
||
| A | `wizamart.com` | `91.99.65.229` | Proxied |
|
||
| A | `www` | `91.99.65.229` | Proxied |
|
||
|
||
**Step 3: Generate origin certificate**
|
||
|
||
1. In Cloudflare: **SSL/TLS** > **Origin Server** > **Create Certificate**
|
||
2. Hostnames: `wizamart.com, www.wizamart.com`
|
||
3. Validity: 15 years
|
||
4. Download cert and key (key shown only once!)
|
||
|
||
**Step 4: Install cert on server**
|
||
|
||
```bash
|
||
sudo mkdir -p /etc/caddy/certs/wizamart.com
|
||
sudo nano /etc/caddy/certs/wizamart.com/cert.pem # paste certificate
|
||
sudo nano /etc/caddy/certs/wizamart.com/key.pem # paste private key
|
||
sudo chown -R caddy:caddy /etc/caddy/certs/wizamart.com
|
||
sudo chmod 600 /etc/caddy/certs/wizamart.com/key.pem
|
||
```
|
||
|
||
**Step 5: Add to Caddyfile**
|
||
|
||
```bash
|
||
sudo nano /etc/caddy/Caddyfile
|
||
```
|
||
|
||
Add:
|
||
|
||
```caddy
|
||
# ─── Custom store domain: wizamart.com ────────────────────────
|
||
wizamart.com {
|
||
tls /etc/caddy/certs/wizamart.com/cert.pem /etc/caddy/certs/wizamart.com/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.wizamart.com {
|
||
tls /etc/caddy/certs/wizamart.com/cert.pem /etc/caddy/certs/wizamart.com/key.pem
|
||
redir https://wizamart.com{uri} permanent
|
||
}
|
||
```
|
||
|
||
**Step 6: Reload Caddy**
|
||
|
||
```bash
|
||
sudo systemctl reload caddy
|
||
sudo systemctl status caddy
|
||
```
|
||
|
||
**Step 7: Configure Cloudflare settings**
|
||
|
||
In Cloudflare dashboard for `wizamart.com`:
|
||
|
||
| Setting | Location | Value |
|
||
|---|---|---|
|
||
| SSL mode | SSL/TLS > Overview | Full (Strict) |
|
||
| Always Use HTTPS | SSL/TLS > Edge Certificates | On |
|
||
| Bot Fight Mode | Security > Settings | On |
|
||
|
||
**Step 8: Register domain in database**
|
||
|
||
Via admin UI: create a `StoreDomain` record linking `wizamart.com` to the store and platform.
|
||
|
||
**Step 9: Verify**
|
||
|
||
```bash
|
||
curl -I https://wizamart.com/
|
||
```
|
||
|
||
The `StoreContextMiddleware` will detect `wizamart.com` as a custom domain, look it up in the `store_domains` table, and route to the correct store.
|
||
|
||
#### Runbook: Add a New Platform Domain
|
||
|
||
When adding an entirely new platform (e.g. `newplatform.lu`):
|
||
|
||
**Step 1: Cloudflare**
|
||
|
||
1. Add `newplatform.lu` as a site in Cloudflare
|
||
2. Configure NS at registrar
|
||
3. Add DNS records (all proxied):
|
||
- `newplatform.lu` → A `91.99.65.229`
|
||
- `www.newplatform.lu` → A `91.99.65.229`
|
||
- `*.newplatform.lu` → A `91.99.65.229` (for store subdomains)
|
||
|
||
**Step 2: Generate origin certificate**
|
||
|
||
In Cloudflare: **SSL/TLS** > **Origin Server** > **Create Certificate**
|
||
|
||
Hostnames: `newplatform.lu, www.newplatform.lu, *.newplatform.lu`
|
||
|
||
**Step 3: Install cert and update Caddyfile**
|
||
|
||
```bash
|
||
sudo mkdir -p /etc/caddy/certs/newplatform.lu
|
||
sudo nano /etc/caddy/certs/newplatform.lu/cert.pem
|
||
sudo nano /etc/caddy/certs/newplatform.lu/key.pem
|
||
sudo chown -R caddy:caddy /etc/caddy/certs/newplatform.lu
|
||
sudo chmod 600 /etc/caddy/certs/newplatform.lu/key.pem
|
||
```
|
||
|
||
Add to Caddyfile:
|
||
|
||
```caddy
|
||
# ─── Platform: NewPlatform (newplatform.lu) ───────────────────
|
||
newplatform.lu {
|
||
tls /etc/caddy/certs/newplatform.lu/cert.pem /etc/caddy/certs/newplatform.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.newplatform.lu {
|
||
tls /etc/caddy/certs/newplatform.lu/cert.pem /etc/caddy/certs/newplatform.lu/key.pem
|
||
redir https://newplatform.lu{uri} permanent
|
||
}
|
||
|
||
*.newplatform.lu {
|
||
tls /etc/caddy/certs/newplatform.lu/cert.pem /etc/caddy/certs/newplatform.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
```
|
||
|
||
```bash
|
||
sudo systemctl reload caddy
|
||
```
|
||
|
||
**Step 4: Cloudflare settings**
|
||
|
||
Same as other platforms — see [Step 21.6](#216-cloudflare-settings-per-domain).
|
||
|
||
**Step 5: Database**
|
||
|
||
Create the platform record via `init_production.py` or admin UI with `domain = "newplatform.lu"`.
|
||
|
||
**Step 6: Verify**
|
||
|
||
```bash
|
||
curl -I https://newplatform.lu/
|
||
curl -I https://teststore.newplatform.lu/
|
||
```
|
||
|
||
#### Future: Scaling Custom Domains Beyond 50
|
||
|
||
The manual Caddyfile approach (one block per custom domain + Cloudflare origin cert) works well up to ~50 custom store domains. Beyond that, managing Caddyfile blocks and origin certs becomes tedious. Two scaling strategies:
|
||
|
||
##### Option 1: Caddy On-Demand TLS (simple, no Cloudflare protection)
|
||
|
||
A single catch-all Caddy block replaces all custom domain blocks. Caddy auto-provisions Let's Encrypt certs on first request and validates domains against the database:
|
||
|
||
```caddy
|
||
https:// {
|
||
tls {
|
||
on_demand
|
||
ask http://localhost:8001/api/v1/internal/verify-domain
|
||
}
|
||
reverse_proxy localhost:8001
|
||
}
|
||
```
|
||
|
||
The `/verify-domain` endpoint checks the `store_domains` table — returns 200 (provision cert) or 404 (reject). Adding a new custom domain becomes a database insert only, no Caddy or Cloudflare changes needed.
|
||
|
||
**Limitation**: Custom domains point directly to the server (no Cloudflare proxy), so they lose DDoS protection, WAF, bot mitigation, and CDN caching. A DDoS attack on any custom domain impacts all sites on the server.
|
||
|
||
Let's Encrypt rate limits (50 certs/registered domain/week, 300 new orders/3 hours) are not an issue since each custom domain is unique. Caddy handles 5,000+ certs comfortably in memory (~50-100MB).
|
||
|
||
##### Option 2: Cloudflare for SaaS (recommended for production scale)
|
||
|
||
[Cloudflare for SaaS](https://developers.cloudflare.com/cloudflare-for-platforms/cloudflare-for-saas/) (Custom Hostnames) is how Shopify, Webflow, and major SaaS platforms handle custom domains at scale. Every custom domain gets full Cloudflare protection:
|
||
|
||
1. Configure a **fallback origin** in your Cloudflare account (e.g., `customers.omsflow.lu → 91.99.65.229`)
|
||
2. Customer sets a CNAME: `wizamart.com → customers.omsflow.lu`
|
||
3. Cloudflare proxies `wizamart.com` through your account — DDoS, WAF, bot protection, CDN all included
|
||
4. SSL is automatic (no Let's Encrypt, no origin certs to manage)
|
||
5. Adding a domain = database insert + one Cloudflare API call (automatable)
|
||
|
||
**Cost**: Available on Cloudflare Pro ($20/month) with per-hostname pricing (~$0.10/month each at volume). At 5,000 domains × $0.10 = ~$500/month for full Cloudflare protection on every customer domain.
|
||
|
||
**Hybrid approach**: Platform domains (`*.omsflow.lu`, `rewardflow.lu`, etc.) stay on the current Cloudflare setup with origin certs. Only customer custom domains use Cloudflare for SaaS.
|
||
|
||
##### Infrastructure scaling at 1,000+ sites
|
||
|
||
At this scale, the 4GB Hetzner VPS becomes the bottleneck before Caddy does. Plan for:
|
||
|
||
- Horizontal scaling: multiple app servers behind a load balancer
|
||
- Dedicated PostgreSQL server
|
||
- Dedicated Redis instance
|
||
- CDN for static assets (Cloudflare, already in place)
|
||
|
||
## Step 15: Gitea Actions Runner
|
||
|
||
!!! warning "ARM64 architecture"
|
||
This server is ARM64. Download the `arm64` binary, not `amd64`.
|
||
|
||
Download and install:
|
||
|
||
```bash
|
||
mkdir -p ~/gitea-runner && cd ~/gitea-runner
|
||
|
||
# Download act_runner v0.2.13 (ARM64)
|
||
wget https://gitea.com/gitea/act_runner/releases/download/v0.2.13/act_runner-0.2.13-linux-arm64
|
||
chmod +x act_runner-0.2.13-linux-arm64
|
||
ln -s act_runner-0.2.13-linux-arm64 act_runner
|
||
```
|
||
|
||
Register the runner (get token from **Site Administration > Actions > Runners > Create new Runner**):
|
||
|
||
```bash
|
||
./act_runner register \
|
||
--instance https://git.wizard.lu \
|
||
--token YOUR_RUNNER_TOKEN
|
||
```
|
||
|
||
Accept the default runner name and labels when prompted.
|
||
|
||
Create a systemd service for persistent operation:
|
||
|
||
```bash
|
||
sudo nano /etc/systemd/system/gitea-runner.service
|
||
```
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=Gitea Actions Runner
|
||
After=network.target
|
||
|
||
[Service]
|
||
Type=simple
|
||
User=samir
|
||
WorkingDirectory=/home/samir/gitea-runner
|
||
ExecStart=/home/samir/gitea-runner/act_runner daemon --config /home/samir/gitea-runner/config.yaml
|
||
Restart=always
|
||
RestartSec=10
|
||
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
```
|
||
|
||
Enable and start:
|
||
|
||
```bash
|
||
sudo systemctl daemon-reload
|
||
sudo systemctl enable --now gitea-runner
|
||
sudo systemctl status gitea-runner
|
||
```
|
||
|
||
Verify the runner shows as **Online** in Gitea: **Site Administration > Actions > Runners**.
|
||
|
||
### 15.1 Runner Configuration
|
||
|
||
Generate a config file to override defaults (notably the 3h job timeout which causes silent CI failures on a 4GB server):
|
||
|
||
```bash
|
||
cd ~/gitea-runner
|
||
./act_runner generate-config > config.yaml
|
||
sed -i 's/timeout: 3h/timeout: 3h/' config.yaml
|
||
sed -i 's/shutdown_timeout: 0s/shutdown_timeout: 300s/' config.yaml
|
||
sudo systemctl restart gitea-runner
|
||
```
|
||
|
||
Key settings in `config.yaml`:
|
||
|
||
| Setting | Default | Recommended | Why |
|
||
|---|---|---|---|
|
||
| `runner.timeout` | 3h | 3h | 2,484 unit tests take ~2.5h on the CAX11 (2 vCPU ARM). Keep the default |
|
||
| `runner.shutdown_timeout` | 0s | 300s | Wait for running jobs to finish on restart — `0s` kills jobs immediately |
|
||
| `runner.fetch_timeout` | 5s | 5s | OK as-is |
|
||
|
||
!!! tip "CI also has per-job and per-test timeouts"
|
||
The `.gitea/workflows/ci.yml` sets `timeout-minutes: 150` on the pytest job and `--timeout=120` per individual test. These work together with the runner timeout to catch different failure modes.
|
||
|
||
### 15.2 Swap for CI Stability
|
||
|
||
The CI runner spins up Docker-in-Docker containers for each job. On a 4GB server running the full app stack, this can exhaust available RAM and silently kill the pytest process. Adding 1GB swap prevents this.
|
||
|
||
!!! note "No extra cost"
|
||
Swap uses existing SSD disk space, not additional Hetzner resources.
|
||
|
||
```bash
|
||
sudo fallocate -l 1G /swapfile
|
||
sudo chmod 600 /swapfile
|
||
sudo mkswap /swapfile
|
||
sudo swapon /swapfile
|
||
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
|
||
|
||
# Verify
|
||
free -h
|
||
```
|
||
|
||
Expected output should show `Swap: 1.0Gi` in the total column.
|
||
|
||
## Step 16: Continuous Deployment
|
||
|
||
Automate deployment on every successful push to master. The Gitea Actions runner and the app both run on the same server, so the deploy job SSHes from the CI Docker container to `172.17.0.1` (Docker bridge gateway — see note in 16.2).
|
||
|
||
```
|
||
push to master
|
||
├── ruff ──────┐
|
||
├── pytest ────┤
|
||
└── validate ──┤
|
||
└── deploy (SSH → scripts/deploy.sh)
|
||
├── git stash / pull / pop
|
||
├── docker compose up -d --build
|
||
├── alembic upgrade heads
|
||
└── health check (retries)
|
||
```
|
||
|
||
### 16.1 Generate Deploy SSH Key (on server)
|
||
|
||
```bash
|
||
ssh-keygen -t ed25519 -C "gitea-deploy@wizard.lu" -f ~/.ssh/deploy_ed25519 -N ""
|
||
cat ~/.ssh/deploy_ed25519.pub >> ~/.ssh/authorized_keys
|
||
```
|
||
|
||
### 16.2 Add Gitea Secrets
|
||
|
||
In **Repository Settings > Actions > Secrets**, add:
|
||
|
||
| Secret | Value |
|
||
|---|---|
|
||
| `DEPLOY_SSH_KEY` | Contents of `~/.ssh/deploy_ed25519` (private key) |
|
||
| `DEPLOY_HOST` | `172.17.0.1` (Docker bridge gateway — **not** `127.0.0.1`) |
|
||
| `DEPLOY_USER` | `samir` |
|
||
| `DEPLOY_PATH` | `/home/samir/apps/orion` |
|
||
|
||
!!! important "Why `172.17.0.1` and not `127.0.0.1`?"
|
||
CI jobs run inside Docker containers where `127.0.0.1` is the container, not the host. `172.17.0.1` is the Docker bridge gateway that routes to the host. Ensure the firewall allows SSH from the Docker bridge network: `sudo ufw allow from 172.17.0.0/16 to any port 22`. When Gitea and Orion are on separate servers, replace with the Orion server's IP.
|
||
|
||
### 16.3 Deploy Script
|
||
|
||
The deploy script lives at `scripts/deploy.sh` in the repository. It:
|
||
|
||
1. Stashes local changes (preserves `.env`)
|
||
2. Pulls latest code (`--ff-only`)
|
||
3. Pops stash to restore local changes
|
||
4. Writes `.build-info` (commit SHA + deploy timestamp)
|
||
5. Rebuilds and restarts Docker containers (`docker compose -f docker-compose.yml --profile full up -d --build`)
|
||
6. Runs database migrations (`alembic upgrade heads`)
|
||
7. Health checks `http://localhost:8001/health` with 12 retries (60s total)
|
||
|
||
!!! warning "Always use `-f docker-compose.yml` on the production server"
|
||
The explicit `-f` flag prevents `docker-compose.override.yml` (which exposes db/redis ports for local dev) from being loaded. This flag must never be removed from `deploy.sh`, and any manual `docker compose` commands on the server must also include it. See [Docker Deployment — Dev vs Prod](docker.md#dev-vs-prod-compose-architecture) for details.
|
||
|
||
Exit codes: `0` success, `1` git pull failed, `2` docker compose failed, `3` migration failed, `4` health check failed.
|
||
|
||
### 16.4 CI Workflow
|
||
|
||
The deploy job in `.gitea/workflows/ci.yml` runs only on master push, after `ruff`, `pytest`, and `validate` pass:
|
||
|
||
```yaml
|
||
deploy:
|
||
runs-on: ubuntu-latest
|
||
if: github.event_name == 'push' && github.ref == 'refs/heads/master'
|
||
needs: [ruff, pytest, validate]
|
||
steps:
|
||
- name: Deploy to production
|
||
uses: appleboy/ssh-action@v1
|
||
with:
|
||
host: ${{ secrets.DEPLOY_HOST }}
|
||
username: ${{ secrets.DEPLOY_USER }}
|
||
key: ${{ secrets.DEPLOY_SSH_KEY }}
|
||
port: 22
|
||
command_timeout: 10m
|
||
script: cd ${{ secrets.DEPLOY_PATH }} && bash scripts/deploy.sh
|
||
```
|
||
|
||
### 16.5 Manual Deploy
|
||
|
||
Two manual paths, pick the right one for the change you're shipping.
|
||
|
||
#### 16.5a — Code-only fix (default for ad-hoc manual deploys)
|
||
|
||
For frontend / template / api-only changes that don't touch the Dockerfile,
|
||
requirements.txt, docker-compose.yml, or alembic migrations. Rebuilds and
|
||
restarts **only** the api container — db, redis, celery-worker, celery-beat,
|
||
flower stay running.
|
||
|
||
```bash
|
||
cd ~/apps/orion && bash scripts/deploy-api-only.sh
|
||
```
|
||
|
||
What it does (`scripts/deploy-api-only.sh`):
|
||
|
||
1. Refuses if working tree is dirty (no silent stash → no risk of pop conflicts).
|
||
2. `git pull --ff-only`.
|
||
3. **Writes `.build-info`** — this is the critical step that ensures the
|
||
`?v=<commit-sha>` cache-bust query on every shared JS/CSS URL flips to the
|
||
new SHA. Without this, browsers happily keep serving the previous
|
||
deploy's cached assets even though the new code is in the image.
|
||
4. `docker compose -f docker-compose.yml --profile full up -d --build api`.
|
||
5. Health-check with a 30s budget (tight, since the DB/Redis weren't touched).
|
||
|
||
Exit codes: `0` success, `1` git pull / dirty tree, `2` docker build/up
|
||
failed, `3` health check failed.
|
||
|
||
#### 16.5b — Full deploy (use when CI is down)
|
||
|
||
Use this when you've also changed migrations, the Dockerfile,
|
||
requirements.txt, or docker-compose.yml itself — anything that needs the
|
||
full restart-everything + migrate cycle that the CI runs. Restarts EVERY
|
||
service in the `full` profile (db, redis, api, celery-worker, celery-beat,
|
||
flower) and runs `alembic upgrade heads`.
|
||
|
||
```bash
|
||
cd ~/apps/orion && bash scripts/deploy.sh
|
||
```
|
||
|
||
Heavier — brief DB downtime, Redis is blown away (sessions / rate-limit
|
||
counters / cached anything), in-flight Celery tasks killed — so don't use
|
||
it for code-only fixes.
|
||
|
||
### 16.6 Verify
|
||
|
||
```bash
|
||
# All app containers running
|
||
cd ~/apps/orion && docker compose --profile full ps
|
||
|
||
# API health (via Caddy with SSL)
|
||
curl https://api.wizard.lu/health
|
||
|
||
# Main platform
|
||
curl -I https://wizard.lu
|
||
|
||
# Gitea
|
||
curl -I https://git.wizard.lu
|
||
|
||
# Flower
|
||
curl -I https://flower.wizard.lu
|
||
|
||
# Gitea runner status
|
||
sudo systemctl status gitea-runner
|
||
```
|
||
|
||
## Step 17: Backups
|
||
|
||
Three layers of backup protection: Hetzner server snapshots, automated PostgreSQL dumps with local rotation, and offsite sync to Cloudflare R2.
|
||
|
||
### 17.1 Enable Hetzner Server Backups
|
||
|
||
In the Hetzner Cloud Console:
|
||
|
||
1. Go to **Servers** > select your server > **Backups**
|
||
2. Click **Enable backups** (~20% of server cost, ~1.20 EUR/mo for CAX11)
|
||
3. Hetzner takes automatic weekly snapshots with 7-day retention
|
||
|
||
This covers full-disk recovery (OS, Docker volumes, config files) but is coarse-grained. Database-level backups (below) give finer restore granularity.
|
||
|
||
### 17.2 Cloudflare R2 Setup (Offsite Backup Storage)
|
||
|
||
R2 provides S3-compatible object storage with a generous free tier (10 GB storage, 10 million reads/month).
|
||
|
||
**Create Cloudflare account and R2 bucket:**
|
||
|
||
1. Sign up at [cloudflare.com](https://dash.cloudflare.com/sign-up) (free account)
|
||
2. Go to **R2 Object Storage** > **Create bucket**
|
||
3. Name: `orion-backups`, region: automatic
|
||
4. Go to **R2** > **Manage R2 API Tokens** > **Create API token**
|
||
- Permissions: Object Read & Write
|
||
- Specify bucket: `orion-backups`
|
||
5. Note the **Account ID**, **Access Key ID**, and **Secret Access Key**
|
||
|
||
**Install and configure AWS CLI on the server:**
|
||
|
||
```bash
|
||
# awscli is not available via apt on Ubuntu 24.04; install via pip
|
||
sudo apt install -y python3-pip
|
||
pip3 install awscli --break-system-packages
|
||
|
||
# Add ~/.local/bin to PATH (pip installs binaries there)
|
||
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
|
||
source ~/.bashrc
|
||
|
||
aws configure --profile r2
|
||
# Access Key ID: <from step 5>
|
||
# Secret Access Key: <from step 5>
|
||
# Default region name: auto
|
||
# Default output format: json
|
||
```
|
||
|
||
**Test connectivity:**
|
||
|
||
```bash
|
||
aws s3 ls --endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com --profile r2
|
||
```
|
||
|
||
Add the R2 backup bucket name to your production `.env`:
|
||
|
||
```bash
|
||
R2_BACKUP_BUCKET=orion-backups
|
||
```
|
||
|
||
### 17.3 Backup Script
|
||
|
||
The backup script at `scripts/backup.sh` handles:
|
||
|
||
- `pg_dump` of Orion DB (via `docker exec orion-db-1`)
|
||
- `pg_dump` of Gitea DB (via `docker exec gitea-db`)
|
||
- On Sundays: copies daily backup to `weekly/` subdirectory
|
||
- Rotation: keeps 7 daily, 4 weekly backups
|
||
- Optional `--upload` flag: syncs to Cloudflare R2
|
||
|
||
```bash
|
||
# Create backup directories
|
||
mkdir -p ~/backups/{orion,gitea}/{daily,weekly}
|
||
|
||
# Run a manual backup
|
||
bash ~/apps/orion/scripts/backup.sh
|
||
|
||
# Run with R2 upload
|
||
bash ~/apps/orion/scripts/backup.sh --upload
|
||
|
||
# Verify backup integrity
|
||
ls -lh ~/backups/orion/daily/
|
||
gunzip -t ~/backups/orion/daily/*.sql.gz
|
||
```
|
||
|
||
### 17.4 Systemd Timer (Daily at 03:00)
|
||
|
||
Create the service unit:
|
||
|
||
```bash
|
||
sudo nano /etc/systemd/system/orion-backup.service
|
||
```
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=Orion database backup
|
||
After=docker.service
|
||
|
||
[Service]
|
||
Type=oneshot
|
||
User=samir
|
||
Environment="PATH=/home/samir/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||
ExecStart=/usr/bin/bash /home/samir/apps/orion/scripts/backup.sh --upload
|
||
StandardOutput=journal
|
||
StandardError=journal
|
||
```
|
||
|
||
Create the timer:
|
||
|
||
```bash
|
||
sudo nano /etc/systemd/system/orion-backup.timer
|
||
```
|
||
|
||
```ini
|
||
[Unit]
|
||
Description=Run Orion backup daily at 03:00
|
||
|
||
[Timer]
|
||
OnCalendar=*-*-* 03:00:00
|
||
Persistent=true
|
||
|
||
[Install]
|
||
WantedBy=timers.target
|
||
```
|
||
|
||
Enable and start:
|
||
|
||
```bash
|
||
sudo systemctl daemon-reload
|
||
sudo systemctl enable --now orion-backup.timer
|
||
|
||
# Verify timer is active
|
||
systemctl list-timers orion-backup.timer
|
||
|
||
# Test manually
|
||
sudo systemctl start orion-backup.service
|
||
journalctl -u orion-backup.service --no-pager
|
||
```
|
||
|
||
### 17.5 Restore Procedure
|
||
|
||
The restore script at `scripts/restore.sh` handles the full restore cycle:
|
||
|
||
```bash
|
||
# Restore Orion database
|
||
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/orion_20260214_030000.sql.gz
|
||
|
||
# Restore Gitea database
|
||
bash ~/apps/orion/scripts/restore.sh gitea ~/backups/gitea/daily/gitea_20260214_030000.sql.gz
|
||
```
|
||
|
||
The script will:
|
||
|
||
1. Stop app containers (keep DB running)
|
||
2. Drop and recreate the database
|
||
3. Restore from the `.sql.gz` backup
|
||
4. Run Alembic migrations (Orion only)
|
||
5. Restart all containers
|
||
|
||
To restore from R2 (if local backups are lost):
|
||
|
||
```bash
|
||
# Download from R2
|
||
aws s3 sync s3://orion-backups/ ~/backups/ \
|
||
--endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com \
|
||
--profile r2
|
||
|
||
# Then restore as usual
|
||
bash ~/apps/orion/scripts/restore.sh orion ~/backups/orion/daily/<latest>.sql.gz
|
||
```
|
||
|
||
### 17.6 Verification
|
||
|
||
```bash
|
||
# Backup files exist
|
||
ls -lh ~/backups/orion/daily/
|
||
ls -lh ~/backups/gitea/daily/
|
||
|
||
# Backup integrity
|
||
gunzip -t ~/backups/orion/daily/*.sql.gz
|
||
|
||
# Timer is scheduled
|
||
systemctl list-timers orion-backup.timer
|
||
|
||
# R2 sync (if configured)
|
||
aws s3 ls s3://orion-backups/ --endpoint-url https://<ACCOUNT_ID>.r2.cloudflarestorage.com --profile r2 --recursive
|
||
```
|
||
|
||
---
|
||
|
||
## Step 18: Monitoring & Observability
|
||
|
||
Prometheus + Grafana monitoring stack with host and container metrics.
|
||
|
||
### Architecture
|
||
|
||
```
|
||
┌──────────────┐ scrape ┌─────────────────┐
|
||
│ Prometheus │◄────────────────│ Orion API │ /metrics
|
||
│ :9090 │◄────────────────│ node-exporter │ :9100
|
||
│ │◄────────────────│ cAdvisor │ :8080
|
||
└──────┬───────┘ └─────────────────┘
|
||
│ query
|
||
┌──────▼───────┐
|
||
│ Grafana │──── https://grafana.wizard.lu
|
||
│ :3001 │
|
||
└──────────────┘
|
||
```
|
||
|
||
### Resource Budget (4 GB Server)
|
||
|
||
| Container | RAM Limit | Purpose |
|
||
|---|---|---|
|
||
| prometheus | 256 MB | Metrics storage (15-day retention, 2 GB max) |
|
||
| grafana | 192 MB | Dashboards (SQLite backend) |
|
||
| node-exporter | 64 MB | Host CPU/RAM/disk metrics |
|
||
| cadvisor | 192 MB | Per-container resource metrics |
|
||
| redis-exporter | 32 MB | Redis memory, connections, command stats |
|
||
| **Total new** | **736 MB** | |
|
||
|
||
Existing stack ~1.8 GB + 736 MB new = ~2.5 GB. Leaves ~1.5 GB for OS. If too tight, rescale to CAX21 (4 vCPU / 8 GB, ~7.99 EUR/mo) — note this requires a brief **power-off** (it is not a live resize); see [Rescaling / Upgrading the Server](#rescaling-upgrading-the-server-cpu-ram) for the full procedure and the Arm-capacity caveat.
|
||
|
||
### 18.1 DNS Record
|
||
|
||
Add A and AAAA records for `grafana.wizard.lu`:
|
||
|
||
| Type | Name | Value | TTL |
|
||
|---|---|---|---|
|
||
| A | `grafana` | `91.99.65.229` | 300 |
|
||
| AAAA | `grafana` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||
|
||
### 18.2 Caddy Configuration
|
||
|
||
Add to `/etc/caddy/Caddyfile`:
|
||
|
||
```caddy
|
||
grafana.wizard.lu {
|
||
reverse_proxy localhost:3001
|
||
}
|
||
```
|
||
|
||
Reload Caddy:
|
||
|
||
```bash
|
||
sudo systemctl reload caddy
|
||
```
|
||
|
||
### 18.3 Production Environment
|
||
|
||
Add to `~/apps/orion/.env`:
|
||
|
||
```bash
|
||
ENABLE_METRICS=true
|
||
GRAFANA_URL=https://grafana.wizard.lu
|
||
GRAFANA_ADMIN_USER=admin
|
||
GRAFANA_ADMIN_PASSWORD=<strong-password>
|
||
```
|
||
|
||
### 18.4 Deploy
|
||
|
||
```bash
|
||
cd ~/apps/orion
|
||
docker compose --profile full up -d --build
|
||
```
|
||
|
||
Verify all containers are running:
|
||
|
||
```bash
|
||
docker compose --profile full ps
|
||
docker stats --no-stream
|
||
```
|
||
|
||
### 18.5 Grafana First Login
|
||
|
||
1. Open `https://grafana.wizard.lu`
|
||
2. Login with `admin` / `<password from .env>`
|
||
3. Change the default password when prompted
|
||
|
||
**Import community dashboards:**
|
||
|
||
- **Node Exporter Full**: Dashboards > Import > ID `1860` > Select Prometheus datasource
|
||
- **Docker / cAdvisor**: Dashboards > Import > ID `193` > Select Prometheus datasource
|
||
|
||
### 18.6 Verification
|
||
|
||
```bash
|
||
# Prometheus metrics from Orion API
|
||
curl -s https://api.wizard.lu/metrics | head -5
|
||
|
||
# Health endpoints
|
||
curl -s https://api.wizard.lu/health/live
|
||
curl -s https://api.wizard.lu/health/ready
|
||
|
||
# Prometheus targets (all should be "up")
|
||
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health
|
||
|
||
# Grafana accessible
|
||
curl -I https://grafana.wizard.lu
|
||
|
||
# RAM usage within limits
|
||
docker stats --no-stream
|
||
```
|
||
|
||
---
|
||
|
||
## Step 19: Prometheus Alerting
|
||
|
||
Alert rules and Alertmanager for email notifications when things go wrong.
|
||
|
||
### 19.1 Architecture
|
||
|
||
```
|
||
┌──────────────┐ evaluates ┌───────────────────┐
|
||
│ Prometheus │─────────────►│ alert.rules.yml │
|
||
│ :9090 │ │ (host, container, │
|
||
│ │ │ API, Celery) │
|
||
└──────┬───────┘ └───────────────────┘
|
||
│ fires alerts
|
||
┌──────▼───────┐
|
||
│ Alertmanager │──── email ──► admin@wizard.lu
|
||
│ :9093 │
|
||
└──────────────┘
|
||
```
|
||
|
||
### 19.2 Alert Rules
|
||
|
||
Alert rules are defined in `monitoring/prometheus/alert.rules.yml`:
|
||
|
||
| Group | Alert | Condition | Severity |
|
||
|---|---|---|---|
|
||
| Host | HostHighCpuUsage | CPU >80% for 5m | warning |
|
||
| Host | HostHighMemoryUsage | Memory >85% for 5m | warning |
|
||
| Host | HostHighDiskUsage | Disk >80% | warning |
|
||
| Host | HostDiskFullPrediction | Disk full within 4h | critical |
|
||
| Containers | ContainerHighRestartCount | >3 restarts/hour | critical |
|
||
| Containers | ContainerOomKilled | Any OOM kill | critical |
|
||
| Containers | ContainerHighCpu | >80% CPU for 5m | warning |
|
||
| API | ApiHighErrorRate | 5xx rate >1% for 5m | critical |
|
||
| API | ApiHighLatency | P95 >2s for 5m | warning |
|
||
| API | ApiHealthCheckDown | Health check failing 1m | critical |
|
||
| Celery | CeleryQueueBacklog | >100 tasks for 10m | warning |
|
||
| Prometheus | TargetDown | Any target down 2m | critical |
|
||
|
||
### 19.3 Alertmanager Configuration
|
||
|
||
Alertmanager config is in `monitoring/alertmanager/alertmanager.yml`:
|
||
|
||
- **Critical alerts**: repeat every 1 hour
|
||
- **Warning alerts**: repeat every 4 hours
|
||
- Groups by `alertname` + `severity`, 30s wait, 5m interval
|
||
- Inhibition: warnings suppressed when critical is already firing for same alert
|
||
|
||
!!! warning "Configure SMTP before deploying"
|
||
Edit `monitoring/alertmanager/alertmanager.yml` and fill in the SMTP settings (host, username, password, recipient email). Alertmanager will start but won't send emails until SMTP is configured.
|
||
|
||
### 19.4 Docker Compose Changes
|
||
|
||
The `docker-compose.yml` includes:
|
||
|
||
- `alertmanager` service: `prom/alertmanager:latest`, profiles: [full], port 127.0.0.1:9093, mem_limit: 32m
|
||
- `prometheus` volumes: mounts `alert.rules.yml` as read-only
|
||
- `prometheus.yml`: `alerting:` section pointing to alertmanager:9093, `rule_files:` for alert rules, new scrape job for alertmanager
|
||
|
||
### 19.5 Alertmanager SMTP Setup (SendGrid)
|
||
|
||
Alertmanager needs SMTP to send email notifications. SendGrid handles both transactional emails and marketing campaigns under one account — set it up once and use it for everything.
|
||
|
||
**Free trial**: 100 emails/day for 60 days. Covers alerting + transactional emails through launch. After 60 days, upgrade to a paid plan (Essentials starts at ~$20/mo for 50K emails/mo).
|
||
|
||
!!! info "Live prod uses mail1.myservices.hosting:465, not SendGrid"
|
||
The current prod env migrated away from SendGrid to the mailbox-hosting provider's SMTP relay (`mail1.myservices.hosting`) earlier in 2026. Both the app's `/admin/settings` SMTP block and `monitoring/alertmanager/alertmanager.yml` point at it. The SendGrid steps in this section are kept as a working reference for greenfield deploys; if you're rehydrating the existing prod, use the mailbox-hosting setup instead.
|
||
|
||
Quick summary of the live alertmanager SMTP block (don't commit the real password — `alertmanager.yml` is gitignored, only `.example` ships in repo):
|
||
|
||
```yaml
|
||
global:
|
||
smtp_smarthost: 'mail1.myservices.hosting:465' # implicit TLS, not 587
|
||
smtp_from: 'alerts@wizard.lu'
|
||
smtp_auth_username: 'support@wizard.lu'
|
||
smtp_auth_password: '<from /admin/settings SMTP block>'
|
||
smtp_require_tls: true
|
||
```
|
||
|
||
Two prerequisites for this to work:
|
||
|
||
1. **Hetzner outbound TCP 465 must be unblocked** (see warning in Step 4 — Cloud Servers block 25 and 465 by default; submit a one-paragraph ticket to lift it, auto-approved in minutes).
|
||
2. **Port 465 = implicit TLS** (TLS-on-connect, not STARTTLS). Alertmanager's email integration handles this natively when the smarthost port is `465`; you only need `smtp_require_tls: true`, no extra `smtp_tls_config` block.
|
||
|
||
Verification with swaks (redacts the credential automatically):
|
||
|
||
```bash
|
||
swaks --to admin@wizard.lu \
|
||
--from alerts@wizard.lu \
|
||
--server mail1.myservices.hosting:465 \
|
||
--auth PLAIN \
|
||
--auth-user support@wizard.lu \
|
||
--tls-on-connect \
|
||
--header "Subject: smoke test" \
|
||
2>&1 | sed -E 's/^( ~> [A-Za-z0-9+\/=]{12,})$/ ~> [REDACTED]/'
|
||
```
|
||
|
||
Expected: `235 Authentication successful` then `250 2.0.0 Ok: queued`. If you see `535 Authentication failed: The provided authorization grant is invalid, expired, or revoked` on port **587**, that's the provider's PLAIN backend being OAuth-wired — switch to port 465 instead, which routes through the password backend.
|
||
|
||
**1. Create SendGrid account:**
|
||
|
||
1. Sign up at [sendgrid.com](https://sendgrid.com/) (free plan)
|
||
2. Complete **Sender Authentication**: go to **Settings** > **Sender Authentication** > **Domain Authentication**
|
||
3. Authenticate your sending domain (`wizard.lu`) — SendGrid provides CNAME records to add to DNS
|
||
4. Create an API key: **Settings** > **API Keys** > **Create API Key** (Full Access)
|
||
5. Save the API key — you'll need it for both Alertmanager and the app's `EMAIL_PROVIDER`
|
||
|
||
!!! info "SendGrid SMTP credentials"
|
||
SendGrid uses a single credential pattern for SMTP:
|
||
|
||
- **Server**: `smtp.sendgrid.net`
|
||
- **Port**: `587` (STARTTLS)
|
||
- **Username**: literally the string `apikey` (not your email)
|
||
- **Password**: your API key (starts with `SG.`)
|
||
|
||
**2. Update alertmanager config on the server:**
|
||
|
||
```bash
|
||
nano ~/apps/orion/monitoring/alertmanager/alertmanager.yml
|
||
```
|
||
|
||
Replace the SMTP placeholders:
|
||
|
||
```yaml
|
||
global:
|
||
smtp_smarthost: 'smtp.sendgrid.net:587'
|
||
smtp_from: 'alerts@wizard.lu'
|
||
smtp_auth_username: 'apikey'
|
||
smtp_auth_password: 'SG.your-sendgrid-api-key-here'
|
||
smtp_require_tls: true
|
||
```
|
||
|
||
Update the `to:` addresses in both receivers to your actual email.
|
||
|
||
**3. Update app email config** in `~/apps/orion/.env`:
|
||
|
||
```bash
|
||
# SendGrid for all application emails (password reset, order confirmation, etc.)
|
||
EMAIL_PROVIDER=sendgrid
|
||
SENDGRID_API_KEY=SG.your-sendgrid-api-key-here
|
||
EMAIL_FROM_ADDRESS=noreply@wizard.lu
|
||
EMAIL_FROM_NAME=Orion
|
||
```
|
||
|
||
**4. Restart services:**
|
||
|
||
```bash
|
||
cd ~/apps/orion
|
||
docker compose --profile full restart alertmanager api
|
||
curl -s http://localhost:9093/-/healthy # Should return OK
|
||
```
|
||
|
||
**5. Test by triggering a test alert (optional):**
|
||
|
||
```bash
|
||
# Send a test alert to alertmanager (v2 API)
|
||
curl -X POST http://localhost:9093/api/v2/alerts -H "Content-Type: application/json" -d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Test alert - please ignore"},"startsAt":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","endsAt":"'$(date -u -d '+5 minutes' +%Y-%m-%dT%H:%M:%SZ)'"}]'
|
||
```
|
||
|
||
Check your inbox within 30 seconds. Then verify the alert resolved:
|
||
|
||
```bash
|
||
curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool
|
||
```
|
||
|
||
!!! tip "Alternative SMTP providers"
|
||
Any SMTP service works if you prefer a different provider:
|
||
|
||
- **Amazon SES**: `email-smtp.eu-west-1.amazonaws.com:587` — cheapest at scale ($0.10/1K emails)
|
||
- **Mailgun**: `smtp.mailgun.org:587` — transactional only, no built-in marketing
|
||
- **Gmail**: `smtp.gmail.com:587` with an App Password (not recommended for production)
|
||
|
||
### 19.6 Deploy
|
||
|
||
```bash
|
||
cd ~/apps/orion
|
||
docker compose --profile full up -d
|
||
```
|
||
|
||
### 19.7 Verification
|
||
|
||
```bash
|
||
# Alertmanager healthy
|
||
curl -s http://localhost:9093/-/healthy
|
||
|
||
# Alert rules loaded
|
||
curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | head -20
|
||
|
||
# Active alerts (should be empty if all is well)
|
||
curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool
|
||
|
||
# Alertmanager target in Prometheus
|
||
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep alertmanager
|
||
```
|
||
|
||
### 19.8 Multi-Domain Email Strategy
|
||
|
||
SendGrid supports multiple authenticated domains on a single account. This enables sending emails from client domains (e.g., `orders@acme.lu`) without clients needing their own SendGrid plan.
|
||
|
||
**Current setup:**
|
||
|
||
- `wizard.lu` authenticated — used for platform emails (`alerts@`, `noreply@`)
|
||
|
||
**Future: client domain onboarding**
|
||
|
||
When a client wants emails sent from their domain (e.g., `acme.lu`):
|
||
|
||
1. In SendGrid: **Settings** > **Sender Authentication** > **Authenticate a Domain** → add `acme.lu`
|
||
2. SendGrid provides CNAME + TXT records
|
||
3. Client adds the DNS records to their domain
|
||
4. Verify in SendGrid
|
||
|
||
This is the professional approach — emails come from the client's domain with proper SPF/DKIM, not from `wizard.lu`. Build an admin flow to automate this as part of store onboarding.
|
||
|
||
!!! note "Volume planning"
|
||
The free trial allows 100 emails/day. Once clients start sending marketing campaigns, upgrade to a paid SendGrid plan based on total volume across all client domains.
|
||
|
||
---
|
||
|
||
## Step 19b: Sentry Error Tracking
|
||
|
||
Application-level error tracking with [Sentry](https://sentry.io). While Prometheus monitors infrastructure metrics (CPU, memory, HTTP error rates), Sentry captures **individual exceptions** with full stack traces, request context, and breadcrumbs — making it possible to debug production errors without SSH access.
|
||
|
||
!!! info "How Sentry fits into the monitoring stack"
|
||
```
|
||
┌──────────────────────────────────────────────────────────────┐
|
||
│ Observability Stack │
|
||
├──────────────────┬──────────────────┬────────────────────────┤
|
||
│ Prometheus │ Grafana │ Sentry │
|
||
│ Infrastructure │ Dashboards │ Application errors │
|
||
│ metrics & alerts │ & visualization │ & performance traces │
|
||
├──────────────────┴──────────────────┴────────────────────────┤
|
||
│ Prometheus: "API 5xx rate is 3%" │
|
||
│ Sentry: "TypeError in /api/v1/orders/checkout line 42 │
|
||
│ request_id=abc123, user_id=7, store=acme" │
|
||
└──────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### What's Already Wired
|
||
|
||
The codebase already initializes Sentry in two places — you just need to provide the DSN:
|
||
|
||
| Component | File | Integrations |
|
||
|---|---|---|
|
||
| FastAPI (API server) | `main.py:42-58` | `FastApiIntegration`, `SqlalchemyIntegration` |
|
||
| Celery (background workers) | `app/core/celery_config.py:31-39` | `CeleryIntegration` |
|
||
|
||
Both read from the same `SENTRY_DSN` environment variable. When unset, Sentry is silently skipped.
|
||
|
||
### 19b.1 Create Sentry Project
|
||
|
||
1. Sign up at [sentry.io](https://sentry.io) (free Developer plan: **5K errors/month**, 1 user)
|
||
2. Create a new project:
|
||
- **Platform**: Python → FastAPI
|
||
- **Project name**: `orion` (or `rewardflow`)
|
||
- **Team**: default
|
||
3. Copy the **DSN** from the project settings — it looks like:
|
||
```
|
||
https://abc123def456@o123456.ingest.de.sentry.io/7891011
|
||
```
|
||
|
||
!!! tip "Sentry pricing"
|
||
| Plan | Errors/month | Cost | Notes |
|
||
|---|---|---|---|
|
||
| **Developer** (free) | 5,000 | $0 | 1 user, 30-day retention |
|
||
| **Team** | 50,000 | $26/mo | Unlimited users, 90-day retention |
|
||
| **Business** | 50,000 | $80/mo | SSO, audit logs, 90-day retention |
|
||
|
||
The free plan is sufficient for launch. Upgrade to Team if you exceed 5K errors/month or need multiple team members.
|
||
|
||
### 19b.2 Configure Environment
|
||
|
||
Add to `~/apps/orion/.env` on the server:
|
||
|
||
```bash
|
||
# Sentry Error Tracking
|
||
SENTRY_DSN=https://your-key@o123456.ingest.de.sentry.io/your-project-id
|
||
SENTRY_ENVIRONMENT=production
|
||
SENTRY_TRACES_SAMPLE_RATE=0.1
|
||
```
|
||
|
||
| Variable | Default | Description |
|
||
|---|---|---|
|
||
| `SENTRY_DSN` | `None` (disabled) | Project DSN from Sentry dashboard |
|
||
| `SENTRY_ENVIRONMENT` | `development` | Tags errors by environment (`production`, `staging`) |
|
||
| `SENTRY_TRACES_SAMPLE_RATE` | `0.1` | Fraction of requests traced for performance (0.1 = 10%) |
|
||
|
||
!!! warning "Traces sample rate"
|
||
`0.1` (10%) is a good starting point. At high traffic, lower to `0.01` (1%) to stay within the free plan's span limits. For initial launch with low traffic, you can temporarily set `1.0` (100%) for full visibility.
|
||
|
||
### 19b.3 Deploy
|
||
|
||
Restart the API and Celery containers to pick up the new env vars:
|
||
|
||
```bash
|
||
cd ~/apps/orion
|
||
docker compose --profile full restart api celery-worker celery-beat
|
||
```
|
||
|
||
Check the API logs to confirm Sentry initialized:
|
||
|
||
```bash
|
||
docker compose --profile full logs api --tail 20 | grep -i sentry
|
||
```
|
||
|
||
You should see:
|
||
|
||
```
|
||
Sentry initialized for environment: production
|
||
```
|
||
|
||
### 19b.4 Verify
|
||
|
||
**1. Trigger a test error** by hitting the API with a request that will fail:
|
||
|
||
```bash
|
||
curl -s https://api.wizard.lu/api/v1/nonexistent-endpoint-sentry-test
|
||
```
|
||
|
||
**2. Check Sentry dashboard:**
|
||
|
||
- Go to [sentry.io](https://sentry.io) → your project → **Issues**
|
||
- You should see a `404 Not Found` or similar error appear within seconds
|
||
- Click into it to see the full stack trace, request headers, and breadcrumbs
|
||
|
||
**3. Verify Celery integration** — check that the Celery worker also reports to Sentry:
|
||
|
||
```bash
|
||
docker compose --profile full logs celery-worker --tail 10 | grep -i sentry
|
||
```
|
||
|
||
### 19b.5 Sentry Features to Configure
|
||
|
||
After verifying the basic setup, configure these in the Sentry web UI:
|
||
|
||
**Alerts (Sentry → Alerts → Create Alert):**
|
||
|
||
| Alert | Condition | Action |
|
||
|---|---|---|
|
||
| New issue spike | >10 events in 1 hour | Email notification |
|
||
| First seen error | Any new issue | Email notification |
|
||
| Unresolved high-volume | >50 events in 24h | Email notification |
|
||
|
||
**Release tracking** — Sentry automatically tags errors with the release version via `release=f"orion@{settings.version}"` in `main.py`. This lets you see which deploy introduced a bug.
|
||
|
||
**Source maps** (optional, post-launch) — if you want JS errors from the admin frontend, add the Sentry browser SDK to your base template. Not needed for launch since most errors will be server-side.
|
||
|
||
### 19b.6 What Sentry Captures
|
||
|
||
With the current integration, Sentry automatically captures:
|
||
|
||
| Data | Source | Example |
|
||
|---|---|---|
|
||
| Python exceptions | FastAPI + Celery | `TypeError`, `ValidationError`, unhandled 500s |
|
||
| Request context | `FastApiIntegration` | URL, method, headers, query params, user IP |
|
||
| DB query breadcrumbs | `SqlalchemyIntegration` | SQL queries leading up to the error |
|
||
| Celery task failures | `CeleryIntegration` | Task name, args, retry count, worker hostname |
|
||
| User info | `send_default_pii=True` | User email and IP (if authenticated) |
|
||
| Performance traces | `traces_sample_rate` | End-to-end request timing, DB query duration |
|
||
|
||
!!! note "Privacy"
|
||
`send_default_pii=True` is set in both `main.py` and `celery_config.py`. This sends user emails and IP addresses to Sentry for debugging context. If GDPR compliance requires stricter data handling, set this to `False` and configure Sentry's [Data Scrubbing](https://docs.sentry.io/security-legal-pii/scrubbing/) rules.
|
||
|
||
---
|
||
|
||
## Step 19c: Redis Monitoring (Redis Exporter)
|
||
|
||
Add direct Redis monitoring to Prometheus. Without this, Redis can die silently — Celery tasks stop processing and emails stop sending, but no alert fires.
|
||
|
||
### Why Not Just cAdvisor?
|
||
|
||
cAdvisor tells you "the Redis container is running." The Redis exporter tells you "Redis is running, responding to commands, using 45MB memory, has 3 clients connected, and command latency is 0.2ms." It also catches scenarios where the container is running but Redis itself is unhealthy (maxmemory reached, connection limit hit).
|
||
|
||
### Resource Impact
|
||
|
||
| Container | RAM | CPU | Image Size |
|
||
|---|---|---|---|
|
||
| redis-exporter | ~5 MB | negligible | ~10 MB |
|
||
|
||
### 19c.1 Docker Compose
|
||
|
||
The `redis-exporter` service has been added to `docker-compose.yml`:
|
||
|
||
```yaml
|
||
redis-exporter:
|
||
image: oliver006/redis_exporter:latest
|
||
restart: always
|
||
profiles:
|
||
- full
|
||
ports:
|
||
- "127.0.0.1:9121:9121"
|
||
environment:
|
||
REDIS_ADDR: redis://redis:6379
|
||
depends_on:
|
||
redis:
|
||
condition: service_healthy
|
||
mem_limit: 32m
|
||
networks:
|
||
- backend
|
||
- monitoring
|
||
```
|
||
|
||
It joins both `backend` (to reach Redis) and `monitoring` (so Prometheus can scrape it).
|
||
|
||
### 19c.2 Prometheus Scrape Target
|
||
|
||
Added to `monitoring/prometheus.yml`:
|
||
|
||
```yaml
|
||
- job_name: "redis"
|
||
static_configs:
|
||
- targets: ["redis-exporter:9121"]
|
||
labels:
|
||
service: "redis"
|
||
```
|
||
|
||
### 19c.3 Alert Rules
|
||
|
||
Four Redis-specific alerts added to `monitoring/prometheus/alert.rules.yml`:
|
||
|
||
| Alert | Condition | Severity | What It Means |
|
||
|---|---|---|---|
|
||
| `RedisDown` | `redis_up == 0` for 1m | critical | Redis is unreachable — all background tasks stalled |
|
||
| `RedisHighMemoryUsage` | >80% of maxmemory for 5m | warning | Queue backlog or memory leak |
|
||
| `RedisHighConnectionCount` | >50 clients for 5m | warning | Possible connection leak |
|
||
| `RedisRejectedConnections` | Any rejected in 5m | critical | Redis is refusing new connections |
|
||
|
||
### 19c.4 Deploy
|
||
|
||
```bash
|
||
cd ~/apps/orion
|
||
git pull
|
||
docker compose --profile full up -d
|
||
```
|
||
|
||
Verify the exporter is running and Prometheus can scrape it:
|
||
|
||
```bash
|
||
# Exporter health
|
||
curl -s http://localhost:9121/health
|
||
|
||
# Redis metrics flowing
|
||
curl -s http://localhost:9121/metrics | grep redis_up
|
||
|
||
# Prometheus target status (should show "redis" as UP)
|
||
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"redis"'
|
||
```
|
||
|
||
### 19c.5 Grafana Dashboard
|
||
|
||
Import the community Redis dashboard:
|
||
|
||
1. Open `https://grafana.wizard.lu`
|
||
2. **Dashboards** → **Import** → ID `763` → Select Prometheus datasource
|
||
3. You'll see: memory usage, connected clients, commands/sec, hit rate, key count
|
||
|
||
### 19c.6 Verification
|
||
|
||
```bash
|
||
# Redis is being monitored
|
||
curl -s http://localhost:9121/metrics | grep redis_up
|
||
# redis_up 1
|
||
|
||
# Memory usage
|
||
curl -s http://localhost:9121/metrics | grep redis_memory_used_bytes
|
||
# redis_memory_used_bytes 1.234e+07 (≈12 MB)
|
||
|
||
# Connected clients
|
||
curl -s http://localhost:9121/metrics | grep redis_connected_clients
|
||
# redis_connected_clients 4 (API + celery-worker + celery-beat + flower)
|
||
|
||
# Alert rules loaded
|
||
curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | grep -i redis
|
||
```
|
||
|
||
---
|
||
|
||
## Step 20: Security Hardening
|
||
|
||
Docker network segmentation, fail2ban configuration, and automatic security updates.
|
||
|
||
### 20.0 Docker Port Binding (Critical — Docker Bypasses UFW)
|
||
|
||
!!! danger "Docker bypasses UFW firewall rules"
|
||
Docker manipulates iptables directly, bypassing UFW entirely. A port mapping like `"5432:5432"` exposes PostgreSQL to the **public internet** even if UFW only allows ports 22, 80, and 443. This was flagged by the German Federal Office for Information Security (BSI/CERT-Bund) in March 2026.
|
||
|
||
**Rules for port mappings in `docker-compose.yml`:**
|
||
|
||
1. **No port mapping** for services that only talk to other containers (PostgreSQL, Redis) — they communicate via Docker's internal network using service names (`db:5432`, `redis:6379`)
|
||
2. **Bind to `127.0.0.1`** for services that need host access but not internet access (API via Caddy, Flower, Prometheus, Grafana, etc.)
|
||
3. **Never use bare port mappings** like `"5432:5432"` or `"6380:6379"` — these bind to `0.0.0.0` (all interfaces)
|
||
|
||
| Service | Correct | Wrong |
|
||
|---|---|---|
|
||
| PostgreSQL | *(no ports section)* | `"5432:5432"` |
|
||
| Redis | *(no ports section)* | `"6380:6379"` |
|
||
| API | `"127.0.0.1:8001:8000"` | `"8001:8000"` |
|
||
| Flower | `"127.0.0.1:5555:5555"` | `"5555:5555"` |
|
||
|
||
**Gitea stack** (`~/gitea/docker-compose.yml`) also needs this fix:
|
||
|
||
```yaml
|
||
# BEFORE (vulnerable):
|
||
ports:
|
||
- "3000:3000"
|
||
- "2222:22"
|
||
|
||
# AFTER (secure):
|
||
ports:
|
||
- "127.0.0.1:3000:3000" # Caddy proxies git.wizard.lu
|
||
- "2222:22" # SSH must stay public (Caddy can't proxy SSH)
|
||
```
|
||
|
||
Port 2222 stays public because Caddy cannot proxy SSH — this is acceptable since SSH is designed for internet exposure. Port 3000 (Gitea web UI) must be localhost-only since Caddy reverse proxies `git.wizard.lu` to it.
|
||
|
||
**After deploying, verify no services are exposed:**
|
||
|
||
```bash
|
||
# Should return nothing for 5432 and 6379
|
||
sudo ss -tlnp | grep -E '0.0.0.0:(5432|6379|6380)'
|
||
|
||
# Should show 127.0.0.1 only for app services
|
||
sudo ss -tlnp | grep -E '(8001|5555|9090|3001)'
|
||
|
||
# Gitea web UI should be localhost only, SSH stays public
|
||
sudo ss -tlnp | grep 3000 # should show 127.0.0.1
|
||
sudo ss -tlnp | grep 2222 # will show 0.0.0.0 (expected for SSH)
|
||
```
|
||
|
||
### 20.0b Redis Authentication (Defense-in-Depth)
|
||
|
||
Redis is isolated on Docker's internal network with no exposed ports, but as defense-in-depth a password is configured via the `REDIS_PASSWORD` environment variable.
|
||
|
||
Add to `~/apps/orion/.env`:
|
||
|
||
```bash
|
||
# Generate a strong password
|
||
openssl rand -hex 16
|
||
# Add to .env
|
||
REDIS_PASSWORD=<generated-password>
|
||
```
|
||
|
||
The `docker-compose.yml` passes this to `redis-server --requirepass` and includes it in all `REDIS_URL` connection strings automatically.
|
||
|
||
### 20.0c Prometheus /metrics Endpoint Restriction
|
||
|
||
The `/metrics` endpoint is restricted to localhost and Docker internal networks at the application level. External requests to `https://api.wizard.lu/metrics` receive a `403 Forbidden` response. Prometheus scrapes from the Docker monitoring network (172.x.x.x) and is unaffected.
|
||
|
||
### 20.1 Docker Network Segmentation
|
||
|
||
Three isolated networks replace the default flat network:
|
||
|
||
| Network | Purpose | Services |
|
||
|---|---|---|
|
||
| `orion_frontend` | External-facing | api |
|
||
| `orion_backend` | Database + workers | db, redis, api, celery-worker, celery-beat, flower |
|
||
| `orion_monitoring` | Metrics collection | api, prometheus, grafana, node-exporter, cadvisor, alertmanager |
|
||
|
||
The `api` service is on all three networks because it needs to:
|
||
|
||
- Serve HTTP traffic (frontend)
|
||
- Connect to database and Redis (backend)
|
||
- Expose `/metrics` to Prometheus (monitoring)
|
||
|
||
This is already configured in the updated `docker-compose.yml`. After deploying, verify:
|
||
|
||
```bash
|
||
docker network ls | grep orion
|
||
# Expected: orion_frontend, orion_backend, orion_monitoring
|
||
```
|
||
|
||
### 20.2 fail2ban Configuration
|
||
|
||
fail2ban is already installed (Step 3) but needs jail configuration. All commands below are copy-pasteable.
|
||
|
||
**SSH jail** — bans IPs after 3 failed SSH attempts for 24 hours:
|
||
|
||
```bash
|
||
sudo tee /etc/fail2ban/jail.local << 'EOF'
|
||
[sshd]
|
||
enabled = true
|
||
port = ssh
|
||
filter = sshd
|
||
logpath = /var/log/auth.log
|
||
maxretry = 3
|
||
bantime = 86400
|
||
findtime = 600
|
||
EOF
|
||
```
|
||
|
||
**Caddy access logging** — fail2ban needs a log file to watch. Add a global `log` directive to your Caddyfile:
|
||
|
||
```bash
|
||
sudo nano /etc/caddy/Caddyfile
|
||
```
|
||
|
||
Add this block at the **very top** of the Caddyfile, before any site blocks:
|
||
|
||
```caddy
|
||
{
|
||
log {
|
||
output file /var/log/caddy/access.log {
|
||
roll_size 100MiB
|
||
roll_keep 5
|
||
}
|
||
format json
|
||
}
|
||
}
|
||
```
|
||
|
||
Create the log directory and restart Caddy:
|
||
|
||
```bash
|
||
sudo mkdir -p /var/log/caddy
|
||
sudo chown caddy:caddy /var/log/caddy
|
||
sudo systemctl restart caddy
|
||
sudo systemctl status caddy
|
||
|
||
# Verify logging works (make a request, then check)
|
||
curl -s https://wizard.lu > /dev/null
|
||
sudo tail -1 /var/log/caddy/access.log | python3 -m json.tool | head -5
|
||
```
|
||
|
||
**Caddy auth filter** — matches 401/403 responses in Caddy's JSON logs:
|
||
|
||
```bash
|
||
sudo tee /etc/fail2ban/filter.d/caddy-auth.conf << 'EOF'
|
||
[Definition]
|
||
failregex = ^.*"remote_ip":"<HOST>".*"status":(401|403).*$
|
||
ignoreregex =
|
||
EOF
|
||
```
|
||
|
||
**Caddy jail** — bans IPs after 10 failed auth attempts for 1 hour:
|
||
|
||
```bash
|
||
sudo tee /etc/fail2ban/jail.d/caddy.conf << 'EOF'
|
||
[caddy-auth]
|
||
enabled = true
|
||
port = http,https
|
||
filter = caddy-auth
|
||
logpath = /var/log/caddy/access.log
|
||
maxretry = 10
|
||
bantime = 3600
|
||
findtime = 600
|
||
EOF
|
||
```
|
||
|
||
**Restart and verify:**
|
||
|
||
```bash
|
||
sudo systemctl restart fail2ban
|
||
|
||
# Both jails should be listed
|
||
sudo fail2ban-client status
|
||
|
||
# SSH jail details
|
||
sudo fail2ban-client status sshd
|
||
|
||
# Caddy jail details (will show 0 bans initially)
|
||
sudo fail2ban-client status caddy-auth
|
||
```
|
||
|
||
### 20.3 Unattended Security Upgrades
|
||
|
||
Install and enable automatic security updates:
|
||
|
||
```bash
|
||
sudo apt install -y unattended-upgrades apt-listchanges
|
||
sudo dpkg-reconfigure -plow unattended-upgrades
|
||
```
|
||
|
||
This enables security-only updates with automatic reboot disabled (safe default). Verify:
|
||
|
||
```bash
|
||
sudo unattended-upgrades --dry-run 2>&1 | head -10
|
||
cat /etc/apt/apt.conf.d/20auto-upgrades
|
||
```
|
||
|
||
Expected `20auto-upgrades` content:
|
||
|
||
```
|
||
APT::Periodic::Update-Package-Lists "1";
|
||
APT::Periodic::Unattended-Upgrade "1";
|
||
```
|
||
|
||
### 20.4 Clean Up Legacy Docker Network
|
||
|
||
After deploying with network segmentation, the old default network may remain:
|
||
|
||
```bash
|
||
# Check if orion_default still exists
|
||
docker network ls | grep orion_default
|
||
|
||
# Remove it (safe — no containers should be using it)
|
||
docker network rm orion_default 2>/dev/null || echo "Already removed"
|
||
```
|
||
|
||
### 20.5 Verification
|
||
|
||
```bash
|
||
# fail2ban jails active (should show sshd and caddy-auth)
|
||
sudo fail2ban-client status
|
||
|
||
# SSH jail details
|
||
sudo fail2ban-client status sshd
|
||
|
||
# Docker networks (should show 3: frontend, backend, monitoring)
|
||
docker network ls | grep orion
|
||
|
||
# Unattended upgrades configured
|
||
sudo unattended-upgrades --dry-run 2>&1 | head
|
||
|
||
# Caddy access log being written
|
||
sudo tail -1 /var/log/caddy/access.log
|
||
```
|
||
|
||
---
|
||
|
||
## Step 21: Cloudflare Domain Proxy
|
||
|
||
Move DNS to Cloudflare for WAF, DDoS protection, and CDN. This step involves DNS propagation — do it during a maintenance window.
|
||
|
||
!!! warning "DNS changes affect all services"
|
||
Moving nameservers involves propagation delay (minutes to hours). Plan for brief interruption. Do this step last, after Steps 19–20 are verified.
|
||
|
||
### 21.1 Pre-Migration: Record Email DNS
|
||
|
||
Before changing nameservers, document all email-related DNS records:
|
||
|
||
```bash
|
||
# Run for each domain (wizard.lu, omsflow.lu, rewardflow.lu)
|
||
dig wizard.lu MX +short
|
||
dig wizard.lu TXT +short
|
||
dig _dmarc.wizard.lu TXT +short
|
||
dig default._domainkey.wizard.lu TXT +short # DKIM selector may vary
|
||
```
|
||
|
||
Save the output — you'll need to verify these exist after Cloudflare import.
|
||
|
||
### 21.2 Add Domains to Cloudflare
|
||
|
||
1. Log in to [Cloudflare Dashboard](https://dash.cloudflare.com)
|
||
2. **Add a site** for each domain: `wizard.lu`, `omsflow.lu`, `rewardflow.lu`
|
||
3. Select **Free** plan → choose **Full setup** (nameserver-based, not CNAME/partial)
|
||
4. Block AI crawlers on all pages
|
||
5. Cloudflare auto-scans and imports existing DNS records — **review carefully**:
|
||
- Delete any stale CNAME records (leftover from partial setup)
|
||
- Add missing A/AAAA records manually (Cloudflare scan may miss some)
|
||
- Verify MX/SPF/DKIM/DMARC records are present before changing NS
|
||
- Email records (MX, TXT) must stay as **DNS-only (grey cloud)** — never proxy MX records
|
||
6. Set proxy status:
|
||
- **Orange cloud (proxied)**: `@`, `www`, `api`, `flower`, `grafana` — gets WAF + CDN
|
||
- **Grey cloud (DNS only)**: `git` — needs direct access for SSH on port 2222
|
||
|
||
### 21.3 Change Nameservers
|
||
|
||
At your domain registrar (Netim), update NS records to Cloudflare's assigned nameservers. Cloudflare shows the exact pair during activation (e.g., `name1.ns.cloudflare.com`, `name2.ns.cloudflare.com`).
|
||
|
||
Disable DNSSEC at the registrar before switching NS — re-enable later via Cloudflare.
|
||
|
||
### 21.4 Generate Origin Certificates
|
||
|
||
Cloudflare Origin Certificates (free, 15-year validity) avoid ACME challenge issues when traffic is proxied:
|
||
|
||
1. In Cloudflare: **SSL/TLS** > **Origin Server** > **Create Certificate**
|
||
2. Generate for each domain:
|
||
- `wizard.lu`: `wizard.lu, api.wizard.lu, www.wizard.lu, flower.wizard.lu, grafana.wizard.lu` (**specific subdomains, no wildcard**)
|
||
- `omsflow.lu`: `omsflow.lu, www.omsflow.lu, *.omsflow.lu` (wildcard for store subdomains)
|
||
- `rewardflow.lu`: `rewardflow.lu, www.rewardflow.lu, *.rewardflow.lu` (wildcard for store subdomains)
|
||
- `hostwizard.lu`: `hostwizard.lu, www.hostwizard.lu, *.hostwizard.lu` (wildcard for store subdomains)
|
||
3. Download the certificate and private key (private key is shown only once)
|
||
|
||
!!! warning "Do NOT use wildcard origin certs for wizard.lu"
|
||
A `*.wizard.lu` wildcard cert will match `git.wizard.lu`, which is DNS-only (grey cloud) and uses a Let's Encrypt cert. A wildcard origin cert would conflict. Use specific subdomains instead. For new wizard.lu subdomains, add them explicitly to this cert and to the Caddyfile.
|
||
|
||
Install on the server:
|
||
|
||
```bash
|
||
sudo mkdir -p /etc/caddy/certs/{wizard.lu,omsflow.lu,rewardflow.lu,hostwizard.lu}
|
||
|
||
# For each domain, create cert.pem and key.pem:
|
||
sudo nano /etc/caddy/certs/wizard.lu/cert.pem # paste certificate
|
||
sudo nano /etc/caddy/certs/wizard.lu/key.pem # paste private key
|
||
# Repeat for omsflow.lu, rewardflow.lu, and hostwizard.lu
|
||
|
||
sudo chown -R caddy:caddy /etc/caddy/certs/
|
||
sudo chmod 600 /etc/caddy/certs/*/key.pem
|
||
```
|
||
|
||
### 21.5 Update Caddyfile
|
||
|
||
For Cloudflare-proxied domains, use explicit TLS with origin certs. Keep auto-HTTPS for `git.wizard.lu` (DNS-only, grey cloud):
|
||
|
||
```caddy
|
||
{
|
||
log {
|
||
output file /var/log/caddy/access.log {
|
||
roll_size 100MiB
|
||
roll_keep 5
|
||
}
|
||
format json
|
||
}
|
||
}
|
||
|
||
# ─── Platform 1: Main (wizard.lu) ───────────────────────────
|
||
wizard.lu {
|
||
tls /etc/caddy/certs/wizard.lu/cert.pem /etc/caddy/certs/wizard.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.wizard.lu {
|
||
tls /etc/caddy/certs/wizard.lu/cert.pem /etc/caddy/certs/wizard.lu/key.pem
|
||
redir https://wizard.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Platform 2: OMS (omsflow.lu) ───────────────────────────
|
||
omsflow.lu {
|
||
tls /etc/caddy/certs/omsflow.lu/cert.pem /etc/caddy/certs/omsflow.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.omsflow.lu {
|
||
tls /etc/caddy/certs/omsflow.lu/cert.pem /etc/caddy/certs/omsflow.lu/key.pem
|
||
redir https://omsflow.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Platform 3: Loyalty+ (rewardflow.lu) ──────────────────
|
||
rewardflow.lu {
|
||
tls /etc/caddy/certs/rewardflow.lu/cert.pem /etc/caddy/certs/rewardflow.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.rewardflow.lu {
|
||
tls /etc/caddy/certs/rewardflow.lu/cert.pem /etc/caddy/certs/rewardflow.lu/key.pem
|
||
redir https://rewardflow.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Platform 4: HostWizard (hostwizard.lu) ──────────────────
|
||
hostwizard.lu {
|
||
tls /etc/caddy/certs/hostwizard.lu/cert.pem /etc/caddy/certs/hostwizard.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
www.hostwizard.lu {
|
||
tls /etc/caddy/certs/hostwizard.lu/cert.pem /etc/caddy/certs/hostwizard.lu/key.pem
|
||
redir https://hostwizard.lu{uri} permanent
|
||
}
|
||
|
||
# ─── Store subdomains (wildcard — all except wizard.lu) ──────
|
||
*.omsflow.lu {
|
||
tls /etc/caddy/certs/omsflow.lu/cert.pem /etc/caddy/certs/omsflow.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
*.rewardflow.lu {
|
||
tls /etc/caddy/certs/rewardflow.lu/cert.pem /etc/caddy/certs/rewardflow.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
*.hostwizard.lu {
|
||
tls /etc/caddy/certs/hostwizard.lu/cert.pem /etc/caddy/certs/hostwizard.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
# ─── Services (wizard.lu origin cert) ───────────────────────
|
||
api.wizard.lu {
|
||
tls /etc/caddy/certs/wizard.lu/cert.pem /etc/caddy/certs/wizard.lu/key.pem
|
||
reverse_proxy localhost:8001
|
||
}
|
||
|
||
flower.wizard.lu {
|
||
tls /etc/caddy/certs/wizard.lu/cert.pem /etc/caddy/certs/wizard.lu/key.pem
|
||
reverse_proxy localhost:5555
|
||
}
|
||
|
||
grafana.wizard.lu {
|
||
tls /etc/caddy/certs/wizard.lu/cert.pem /etc/caddy/certs/wizard.lu/key.pem
|
||
reverse_proxy localhost:3001
|
||
}
|
||
|
||
# ─── DNS-only domain (Let's Encrypt, not proxied by Cloudflare) ─
|
||
git.wizard.lu {
|
||
tls {
|
||
issuer acme
|
||
}
|
||
reverse_proxy localhost:3000
|
||
}
|
||
```
|
||
|
||
Restart Caddy:
|
||
|
||
```bash
|
||
sudo systemctl restart caddy
|
||
sudo systemctl status caddy
|
||
```
|
||
|
||
### 21.6 Cloudflare Settings (per domain)
|
||
|
||
Configure these in the Cloudflare dashboard for each domain (`wizard.lu`, `omsflow.lu`, `rewardflow.lu`, `hostwizard.lu`):
|
||
|
||
| Setting | Location | Value |
|
||
|---|---|---|
|
||
| SSL mode | SSL/TLS > Overview | Full (Strict) |
|
||
| Always Use HTTPS | SSL/TLS > Edge Certificates | On |
|
||
| Bot Fight Mode | Security > Settings | On |
|
||
| DDoS protection | Security > Security rules > DDoS | Active (enabled by default) |
|
||
| AI crawlers | Security (during setup) | Blocked on all pages |
|
||
|
||
**Rate limiting rule** (Security > Security rules > Create rule):
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Match | URI Path contains `/api/` |
|
||
| Characteristics | IP |
|
||
| Rate | 100 requests per 10 seconds |
|
||
| Action | Block |
|
||
| Duration | 10 seconds |
|
||
|
||
### 21.7 Production Environment
|
||
|
||
Add to `~/apps/orion/.env`:
|
||
|
||
```bash
|
||
CLOUDFLARE_ENABLED=true
|
||
```
|
||
|
||
### 21.8 Verification
|
||
|
||
```bash
|
||
# CF proxy active (look for cf-ray header)
|
||
curl -I https://wizard.lu | grep cf-ray
|
||
|
||
# DNS resolves to Cloudflare IPs (not 91.99.65.229)
|
||
dig wizard.lu +short
|
||
|
||
# All domains responding
|
||
curl -I https://omsflow.lu
|
||
curl -I https://rewardflow.lu
|
||
curl -I https://api.wizard.lu/health
|
||
|
||
# git.wizard.lu still on Let's Encrypt (not CF)
|
||
curl -I https://git.wizard.lu
|
||
```
|
||
|
||
!!! info "`git.wizard.lu` stays DNS-only"
|
||
The Gitea instance uses SSH on port 2222 for git operations. Cloudflare proxy only supports HTTP/HTTPS, so `git.wizard.lu` must remain as DNS-only (grey cloud) with Let's Encrypt auto-SSL via Caddy.
|
||
|
||
---
|
||
|
||
## Step 22: Incident Response Runbook
|
||
|
||
A comprehensive incident response runbook is available at [Incident Response](incident-response.md). It includes:
|
||
|
||
- **Severity levels**: SEV-1 (platform down, <15min), SEV-2 (feature broken, <1h), SEV-3 (minor, <4h)
|
||
- **Quick diagnosis decision tree**: SSH → Docker → containers → Caddy → DNS
|
||
- **8 runbooks** with copy-paste commands for common incidents
|
||
- **Post-incident report template**
|
||
- **Monitoring URLs** quick reference
|
||
|
||
---
|
||
|
||
## Step 23: Environment Reference
|
||
|
||
A complete environment variables reference is available at [Environment Variables](environment.md). It documents all 55+ configuration variables from `app/core/config.py`, grouped by category with defaults and production requirements.
|
||
|
||
---
|
||
|
||
## Step 24: Documentation Updates
|
||
|
||
This document has been updated with Steps 19–24. Additional documentation changes:
|
||
|
||
- `docs/deployment/incident-response.md` — new incident response runbook
|
||
- `docs/deployment/environment.md` — complete env var reference (was empty)
|
||
- `docs/deployment/launch-readiness.md` — updated with Feb 2026 infrastructure status
|
||
- `mkdocs.yml` — incident-response.md added to nav
|
||
|
||
---
|
||
|
||
## Step 25: Google Wallet Integration
|
||
|
||
Enable loyalty card passes in Google Wallet so customers can add their loyalty card to their Android phone.
|
||
|
||
### Prerequisites
|
||
|
||
- Google account (personal Gmail is fine)
|
||
- Loyalty module deployed and working
|
||
|
||
### 25.1 Google Pay & Wallet Console
|
||
|
||
Register as a Google Wallet Issuer:
|
||
|
||
1. Go to [pay.google.com/business/console](https://pay.google.com/business/console)
|
||
2. Enter your business name (e.g., "Letzshop" or your company name) — this is for Google's review, customers don't see it on passes
|
||
3. Note your **Issuer ID** from the Google Wallet API section
|
||
|
||
!!! info "Issuer ID"
|
||
The Issuer ID is a long numeric string (e.g., `3388000000023089598`). You'll find it under Google Wallet API → Manage in the Pay & Wallet Console.
|
||
|
||
### 25.2 Google Cloud Project
|
||
|
||
1. Go to [console.cloud.google.com](https://console.cloud.google.com)
|
||
2. Create a new project (e.g., "Orion")
|
||
3. Enable the **Google Wallet API**:
|
||
- Navigate to "APIs & Services" → "Library"
|
||
- Search for "Google Wallet API" and enable it
|
||
|
||
### 25.3 Service Account
|
||
|
||
Create a service account for API access:
|
||
|
||
1. Go to "APIs & Services" → "Credentials" → "Create Credentials"
|
||
2. Select **Google Wallet API** as the API
|
||
3. Select **Application data** (not user data — your backend calls the API directly)
|
||
4. Name the service account (e.g., `wallet-service`)
|
||
5. Click "Done"
|
||
|
||
Download the JSON key:
|
||
|
||
1. Go to "IAM & Admin" → "Service Accounts"
|
||
2. Click on the service account you created
|
||
3. Go to **Keys** tab → **Add Key** → **Create new key** → **JSON**
|
||
4. Save the downloaded `.json` file securely
|
||
|
||
### 25.4 Link Service Account to Issuer
|
||
|
||
1. Go back to [pay.google.com/business/console](https://pay.google.com/business/console)
|
||
2. In the **left sidebar**, click **Users** (not inside the Wallet API section)
|
||
3. Invite the service account email (e.g., `wallet-service@orion-488322.iam.gserviceaccount.com`)
|
||
4. Assign **Admin** role
|
||
5. Verify it appears in the users list
|
||
|
||
!!! warning "Common mistake"
|
||
The "Users" link is in the **left sidebar** of the Pay & Wallet Console, not inside the "Google Wallet API" → "Manage" section. The Manage page has "Setup test accounts" which is a different feature.
|
||
|
||
### 25.5 Deploy to Server
|
||
|
||
Upload the service account JSON key to the Hetzner server:
|
||
|
||
```bash
|
||
# From your local machine
|
||
scp /path/to/orion-488322-xxxxx.json samir@91.99.65.229:~/apps/orion/google-wallet-sa.json
|
||
```
|
||
|
||
Add the environment variables to the production `.env`:
|
||
|
||
```bash
|
||
ssh samir@91.99.65.229
|
||
cd ~/apps/orion
|
||
nano .env
|
||
```
|
||
|
||
Add:
|
||
|
||
```bash
|
||
# Loyalty Module
|
||
LOYALTY_DEFAULT_LOGO_URL=https://rewardflow.lu/static/modules/loyalty/shared/img/default-logo-200.png
|
||
LOYALTY_GOOGLE_WALLET_ORIGINS=["https://rewardflow.lu"]
|
||
|
||
# Google Wallet (Loyalty Module)
|
||
LOYALTY_GOOGLE_ISSUER_ID=3388000000023089598
|
||
LOYALTY_GOOGLE_SERVICE_ACCOUNT_JSON=/app/google-wallet-sa.json
|
||
```
|
||
|
||
!!! note "Docker path"
|
||
The path must be relative to the Docker container's filesystem. If the file is in `~/apps/orion/`, it maps to `/app/` inside the container (check your `docker-compose.yml` volumes).
|
||
|
||
Mount the JSON file in `docker-compose.yml` if not already covered by the app volume:
|
||
|
||
```yaml
|
||
services:
|
||
api:
|
||
volumes:
|
||
- ./google-wallet-sa.json:/app/google-wallet-sa.json:ro
|
||
```
|
||
|
||
Restart the application:
|
||
|
||
```bash
|
||
docker compose --profile full up -d --build
|
||
```
|
||
|
||
### 25.6 Platform-Level Configuration
|
||
|
||
Google Wallet is a **platform-wide setting** — all merchants on the platform share the same Issuer ID and service account. Merchants don't need to configure anything; wallet integration activates automatically when the env vars are set.
|
||
|
||
The two required env vars:
|
||
|
||
```bash
|
||
# In production .env
|
||
LOYALTY_GOOGLE_ISSUER_ID=3388000000023089598
|
||
LOYALTY_GOOGLE_SERVICE_ACCOUNT_JSON=/app/google-wallet-sa.json
|
||
```
|
||
|
||
When both are set, every loyalty program on the platform automatically gets Google Wallet support: enrollment creates wallet passes, stamp/points operations sync to passes, and the storefront shows "Add to Google Wallet" buttons.
|
||
|
||
### 25.7 Verify Configuration
|
||
|
||
Check the API health and wallet service status:
|
||
|
||
```bash
|
||
# Check the app logs for wallet service initialization
|
||
docker compose --profile full logs api | grep -i "wallet\|loyalty"
|
||
|
||
# Test via API — enroll a customer and check the response for wallet URLs
|
||
curl -s https://api.wizard.lu/health | python3 -m json.tool
|
||
```
|
||
|
||
### 25.8 Testing Google Wallet Passes
|
||
|
||
Google provides a **demo mode** — passes work in test without full production approval:
|
||
|
||
1. Console admins and developers (your Google account) can always test passes
|
||
2. For additional testers, add their Gmail addresses in Pay & Wallet Console → Google Wallet API → Manage → **Setup test accounts**
|
||
3. Use `walletobjects.sandbox` scope for initial testing (the code uses `wallet_object.issuer` which covers both)
|
||
|
||
**End-to-end test flow:**
|
||
|
||
1. Create a loyalty program via the store panel and set the Google Wallet Issuer ID in Settings → Digital Wallet
|
||
2. Enroll a customer (via store or storefront self-enrollment)
|
||
- The system automatically creates a Google Wallet `LoyaltyClass` (for the program) and `LoyaltyObject` (for the card)
|
||
3. Open the storefront loyalty dashboard — the "Add to Google Wallet" button appears
|
||
4. Click the button (or open the URL on an Android device) — the pass is added to Google Wallet
|
||
5. Add a stamp or points — the pass in Google Wallet auto-updates (no push needed, Google syncs)
|
||
|
||
### 25.9 Local Development Setup
|
||
|
||
You can test the full Google Wallet integration from your local machine:
|
||
|
||
```bash
|
||
# In your local .env
|
||
LOYALTY_GOOGLE_ISSUER_ID=3388000000023089598
|
||
LOYALTY_GOOGLE_SERVICE_ACCOUNT_JSON=/path/to/orion-488322-xxxxx.json
|
||
```
|
||
|
||
The `GoogleWalletService` calls Google's REST API directly over HTTPS — no special network configuration needed. The same service account JSON works on both local and server environments.
|
||
|
||
**Local testing checklist:**
|
||
|
||
- [x] Service account JSON downloaded and path set in env
|
||
- [x] `LOYALTY_GOOGLE_ISSUER_ID` set in env
|
||
- [ ] Start the app locally: `python3 -m uvicorn main:app --reload`
|
||
- [ ] Enroll a customer → check logs for "Created Google Wallet class" and "Created Google Wallet object"
|
||
- [ ] Open storefront dashboard → "Add to Google Wallet" button should appear
|
||
- [ ] Open the wallet URL on Android → pass added to Google Wallet
|
||
- [ ] Add stamps → check logs for "Updated Google Wallet object", verify pass updates
|
||
|
||
### 25.10 How It Works (Architecture)
|
||
|
||
The integration is fully automatic — no manual API calls needed after initial setup.
|
||
|
||
```
|
||
┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐
|
||
│ Merchant │────▶│ Orion API │────▶│ Google Wallet API │
|
||
│ sets issuer │ │ │ │ │
|
||
│ ID in UI │ │ │ │ │
|
||
└─────────────┘ └──────────────┘ └─────────────────────┘
|
||
|
||
┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐
|
||
│ Customer │────▶│ Orion API │────▶│ Google Wallet API │
|
||
│ enrolls │ │ │ │ │
|
||
│ │ │create_class +│ │ POST /loyaltyClass │
|
||
│ │ │create_object │ │ POST /loyaltyObject │
|
||
│ │◀────│ save_url │ │ │
|
||
│ │ └──────────────┘ └─────────────────────┘
|
||
│ taps "Add │
|
||
│ to Wallet" │────▶ Google Wallet app adds pass automatically
|
||
└─────────────┘
|
||
|
||
┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐
|
||
│ Staff adds │────▶│ Orion API │────▶│ Google Wallet API │
|
||
│ stamp/pts │ │ │ │ │
|
||
│ │ │update_object │ │ PATCH /loyaltyObject│
|
||
└─────────────┘ └──────────────┘ └─────────────────────┘
|
||
Pass auto-updates on
|
||
customer's phone
|
||
```
|
||
|
||
**Automatic triggers:**
|
||
|
||
| Event | Wallet Action | Service Call |
|
||
|-------|---------------|--------------|
|
||
| Customer enrolls | Create class (if first) + create object | `wallet_service.create_wallet_objects()` |
|
||
| Stamp added/redeemed/voided | Update object with new balance | `wallet_service.sync_card_to_wallets()` |
|
||
| Points earned/redeemed/voided/adjusted | Update object with new balance | `wallet_service.sync_card_to_wallets()` |
|
||
| Customer opens dashboard | Generate save URL (JWT, 1h expiry) | `wallet_service.get_add_to_wallet_urls()` |
|
||
|
||
No push notifications needed — Google syncs object changes automatically.
|
||
|
||
### 25.11 Next Steps
|
||
|
||
After Google Wallet is verified working:
|
||
|
||
1. **Submit for Google production approval** — required before non-test users can add passes
|
||
2. **Apple Wallet** — separate setup requiring Apple Developer account, APNs certificates, and pass signing certificates (see [Loyalty Module docs](../modules/loyalty/index.md#configuration))
|
||
|
||
---
|
||
|
||
## Network Architecture Diagram
|
||
|
||
```
|
||
INTERNET
|
||
│
|
||
┌────────────┼────────────┐
|
||
│ │ │
|
||
┌────▼────┐ ┌────▼────┐ ┌─────▼─────┐
|
||
│Cloudflare│ │Cloudflare│ │ Cloudflare │
|
||
│wizard.lu │ │omsflow │ │rewardflow │
|
||
│(proxied) │ │(proxied)│ │ (proxied) │
|
||
└────┬─────┘ └────┬────┘ └─────┬─────┘
|
||
│ │ │
|
||
└────────────┼────────────┘
|
||
│ HTTPS :443
|
||
│
|
||
┌─────────────────────┼─────────────────────┐
|
||
│ HETZNER SERVER (91.99.65.229) │
|
||
│ │ │
|
||
│ ┌──────────────────▼──────────────────┐ │
|
||
│ │ CADDY (reverse proxy) │ │
|
||
│ │ :80 / :443 │ │
|
||
│ └──┬──────┬──────┬──────┬──────┬──────┘ │
|
||
│ │ │ │ │ │ │
|
||
│ │ │ │ │ │ │
|
||
┌─────┼─────┼──────┼──────┼──────┼──────┼──────┐ │
|
||
│ │ DOCKER NETWORKS │ │ │ │ │
|
||
│ │ │ │ │ │ │ │ │
|
||
│ ┌──▼──────▼──────▼──┐ │ │ │ │ │
|
||
│ │ orion_frontend │ │ │ │ │ │
|
||
│ │ │ │ │ │ │ │
|
||
│ │ ┌───────────────┐ │ │ │ │ │ │
|
||
│ │ │ API :8000 │◄┼──┘ │ │ │ │
|
||
│ │ │ 127.0.0.1:8001│ │ Caddy ►│:5555 │:3001 │ │
|
||
│ │ └───────┬───────┘ │ │ │ │ │
|
||
│ └─────────┼─────────┘ │ │ │ │
|
||
│ │ │ │ │ │
|
||
│ ┌─────────▼─────────────────────────────────┐ │
|
||
│ │ orion_backend │ │
|
||
│ │ │ │
|
||
│ │ ┌──────────┐ ┌───────────┐ │ │
|
||
│ │ │ PostgreSQL│ │ Redis │ │ │
|
||
│ │ │ :5432 │ │ :6379 │ │ │
|
||
│ │ │ (no host │ │ (no host │ │ │
|
||
│ │ │ port) │ │ port) │ │ │
|
||
│ │ └─────▲─────┘ └──▲───▲───┘ │ │
|
||
│ │ │ │ │ │ │
|
||
│ │ ┌─────┴───────────┴┐ ┌┴────────────────┐ │ │
|
||
│ │ │ celery-worker │ │ celery-beat │ │ │
|
||
│ │ │ (no host port) │ │ (no host port) │ │ │
|
||
│ │ └──────────────────┘ └──────────────────┘ │ │
|
||
│ │ │ │
|
||
│ │ ┌──────────────────┐ │ │
|
||
│ │ │ Flower │◄── Caddy :5555 │ │
|
||
│ │ │ 127.0.0.1:5555 │ flower.wizard.lu │ │
|
||
│ │ └──────────────────┘ │ │
|
||
│ └───────────────────────────────────────────┘ │
|
||
│ │
|
||
│ ┌───────────────────────────────────────────┐ │
|
||
│ │ orion_monitoring │ │
|
||
│ │ │ │
|
||
│ │ ┌──────────────┐ scrapes ┌───────────┐ │ │
|
||
│ │ │ Prometheus │◄──────────│ API │ │ │
|
||
│ │ │ 127.0.0.1: │ │ /metrics │ │ │
|
||
│ │ │ 9090 │ └───────────┘ │ │
|
||
│ │ │ │◄─── node-exporter :9100 │ │
|
||
│ │ │ │◄─── cadvisor :8080 │ │
|
||
│ │ │ │◄─── redis-exporter :9121│ │
|
||
│ │ │ │◄─── alertmanager :9093 │ │
|
||
│ │ └──────┬───────┘ │ │
|
||
│ │ │ query │ │
|
||
│ │ ┌──────▼───────┐ │ │
|
||
│ │ │ Grafana │◄── Caddy :3001 │ │
|
||
│ │ │ 127.0.0.1: │ grafana.wizard.lu │ │
|
||
│ │ │ 3001 │ │ │
|
||
│ │ └──────────────┘ │ │
|
||
│ │ │ │
|
||
│ │ ┌──────────────┐ ┌──────────────┐ │ │
|
||
│ │ │ Alertmanager │────►│ SendGrid │ │ │
|
||
│ │ │ 127.0.0.1: │ │ SMTP :587 │ │ │
|
||
│ │ │ 9093 │ │ (external) │ │ │
|
||
│ │ └──────────────┘ └──────────────┘ │ │
|
||
│ └───────────────────────────────────────────┘ │
|
||
└──────────────────────────────────────────────────┘
|
||
│
|
||
┌─────┼──────────────────────────────────────┐
|
||
│ GITEA STACK (~/gitea) │
|
||
│ │
|
||
│ ┌──────────────────┐ ┌────────────────┐ │
|
||
│ │ Gitea │ │ gitea-db │ │
|
||
│ │ 127.0.0.1:3000 │◄─│ PostgreSQL │ │
|
||
│ │ :2222 (public) │ │ (internal) │ │
|
||
│ └──────────────────┘ └────────────────┘ │
|
||
│ ▲ ▲ │
|
||
│ │ │ │
|
||
│ Caddy:3000 SSH :2222 │
|
||
│ git.wizard.lu (public, for git push) │
|
||
└────────────────────────────────────────────┘
|
||
|
||
─── EXPOSED TO INTERNET ───────────────────────────
|
||
:443 → Caddy (via Cloudflare proxy)
|
||
:80 → Caddy (redirects to HTTPS)
|
||
:22 → SSH (server access)
|
||
:2222 → Gitea SSH (git push/pull)
|
||
|
||
─── LOCALHOST ONLY (127.0.0.1) ────────────────────
|
||
:8001 → API (Caddy → API)
|
||
:3000 → Gitea web (Caddy → Gitea)
|
||
:3001 → Grafana (Caddy → Grafana)
|
||
:5555 → Flower (Caddy → Flower)
|
||
:9090 → Prometheus
|
||
:9093 → Alertmanager
|
||
:9100 → Node Exporter
|
||
:9121 → Redis Exporter
|
||
:8080 → cAdvisor
|
||
|
||
─── DOCKER INTERNAL ONLY (no host port) ───────────
|
||
:5432 → PostgreSQL
|
||
:6379 → Redis (password protected)
|
||
```
|
||
|
||
## Domain & Port Reference
|
||
|
||
| Service | Internal Port | External Port | Domain (via Caddy) |
|
||
|---|---|---|---|
|
||
| Orion API | 8000 | 127.0.0.1:8001 | `api.wizard.lu` |
|
||
| Main Platform | 8000 | 127.0.0.1:8001 | `wizard.lu` |
|
||
| OMS Platform | 8000 | 127.0.0.1:8001 | `omsflow.lu` |
|
||
| Loyalty+ Platform | 8000 | 127.0.0.1:8001 | `rewardflow.lu` |
|
||
| PostgreSQL | 5432 | none (Docker internal) | (internal only) |
|
||
| Redis | 6379 | none (Docker internal) | (internal only) |
|
||
| Flower | 5555 | 5555 | `flower.wizard.lu` |
|
||
| Gitea | 3000 | 3000 | `git.wizard.lu` |
|
||
| Prometheus | 9090 | 9090 (localhost) | (internal only) |
|
||
| Grafana | 3000 | 3001 (localhost) | `grafana.wizard.lu` |
|
||
| Node Exporter | 9100 | 9100 (localhost) | (internal only) |
|
||
| cAdvisor | 8080 | 8080 (localhost) | (internal only) |
|
||
| Redis Exporter | 9121 | 9121 (localhost) | (internal only) |
|
||
| Alertmanager | 9093 | 9093 (localhost) | (internal only) |
|
||
| Caddy | — | 80, 443 | (reverse proxy) |
|
||
|
||
!!! note "Single backend, multiple domains"
|
||
All platform domains route to the same FastAPI backend. The `PlatformContextMiddleware` identifies the platform from the `Host` header. See [Multi-Platform Architecture](../architecture/multi-platform-cms.md) for details.
|
||
|
||
## Directory Structure on Server
|
||
|
||
```
|
||
~/
|
||
├── apps/
|
||
│ └── orion/ # Orion application
|
||
│ ├── .env # Production environment
|
||
│ ├── docker-compose.yml # App stack (API, DB, Redis, Celery, monitoring)
|
||
│ ├── monitoring/ # Prometheus + Grafana config
|
||
│ ├── logs/ # Application logs
|
||
│ ├── uploads/ # User uploads
|
||
│ └── exports/ # Export files
|
||
├── backups/
|
||
│ ├── orion/
|
||
│ │ ├── daily/ # 7-day retention
|
||
│ │ └── weekly/ # 4-week retention
|
||
│ └── gitea/
|
||
│ ├── daily/
|
||
│ └── weekly/
|
||
├── gitea/
|
||
│ └── docker-compose.yml # Gitea + PostgreSQL
|
||
└── gitea-runner/ # CI/CD runner (act_runner v0.2.13)
|
||
├── act_runner # symlink → act_runner-0.2.13-linux-arm64
|
||
├── act_runner-0.2.13-linux-arm64
|
||
└── .runner # registration config
|
||
```
|
||
|
||
## Troubleshooting
|
||
|
||
### Permission denied on logs
|
||
|
||
The Docker container runs as `appuser` (UID 1000). Host-mounted volumes need matching ownership:
|
||
|
||
```bash
|
||
sudo chown -R 1000:1000 logs uploads exports
|
||
```
|
||
|
||
### Celery workers restarting
|
||
|
||
Check logs for import errors:
|
||
|
||
```bash
|
||
docker compose --profile full logs celery-worker --tail 30
|
||
```
|
||
|
||
Common cause: stale task module references in `app/core/celery_config.py`.
|
||
|
||
### SSH service name on Ubuntu 24.04
|
||
|
||
Ubuntu 24.04 uses `ssh` not `sshd`:
|
||
|
||
```bash
|
||
sudo systemctl restart ssh # correct
|
||
sudo systemctl restart sshd # will fail
|
||
```
|
||
|
||
### git pull fails with local changes
|
||
|
||
If `docker-compose.yml` was edited on the server (e.g. passwords), stash before pulling:
|
||
|
||
```bash
|
||
git stash
|
||
git pull
|
||
git stash pop
|
||
```
|
||
|
||
## Maintenance
|
||
|
||
### Deploy updates
|
||
|
||
Deployments happen automatically when pushing to master (see [Step 16](#step-16-continuous-deployment)). For manual deploys:
|
||
|
||
```bash
|
||
cd ~/apps/orion && bash scripts/deploy.sh
|
||
```
|
||
|
||
The script handles stashing local changes, pulling, rebuilding containers, running migrations, and health checks.
|
||
|
||
### Rescaling / Upgrading the Server (CPU & RAM)
|
||
|
||
When the box runs short on RAM/CPU, rescale it to a larger plan. The disk is
|
||
handled separately (see [Disk Maintenance](#disk-maintenance-docker-pruning))
|
||
— rescaling for CPU/RAM does **not** grow the disk, and it usually doesn't need
|
||
to.
|
||
|
||
**When to rescale (the signals that justify it):**
|
||
|
||
- `free -h` shows swap heavily/fully used and almost no free RAM.
|
||
- Container memory limits are routinely maxed (`orion-alertmanager-1`,
|
||
`orion-cadvisor-1` near 100% of their `mem_limit` — visible in
|
||
`docker stats --no-stream`).
|
||
- `HostHighCpuUsage` fires repeatedly during CI bursts (Gitea Actions builds
|
||
eat ~half the host CPU on a 2-vCPU box).
|
||
|
||
**Two important constraints (shown on the Rescale screen):**
|
||
|
||
1. **Same architecture only.** Our servers are **Arm64 / Ampere (CAX series)**.
|
||
You can rescale only between CAX plans (CAX11 → CAX21 → CAX31 → CAX41). All
|
||
the x86 plans (CX / CPX / CCX) will be **greyed out** — that's expected, not
|
||
an error.
|
||
2. **"CPU and RAM only" vs "expand disk".** Always choose **CPU and RAM only**
|
||
unless you specifically need more disk. Expanding the disk is
|
||
**irreversible** — you can never downgrade to a plan with a smaller disk
|
||
afterwards. CPU+RAM-only rescales stay reversible.
|
||
|
||
**Procedure:**
|
||
|
||
```bash
|
||
# 1. On the server — stop the stack cleanly so containers shut down gracefully.
|
||
# (Do NOT force-off from the console; let docker stop properly.)
|
||
sudo poweroff # your SSH session will drop — expected
|
||
```
|
||
|
||
```text
|
||
# 2. In console.hetzner.cloud → Servers → ubuntu-4gb-nbg1-1 → Rescale:
|
||
# - wait until the server shows as "off"
|
||
# - select the target CAX plan (e.g. CAX21: 4 vCPU / 8 GB)
|
||
# - leave the option on "CPU and RAM only"
|
||
# - confirm the rescale (takes ~2 min)
|
||
# 3. Power the server back on.
|
||
```
|
||
|
||
```bash
|
||
# 4. After it boots (~1-2 min), SSH back in and verify:
|
||
nproc # vCPU count matches new plan
|
||
free -h # new RAM total; swap no longer maxed
|
||
docker compose -f docker-compose.yml --profile full ps # all containers healthy
|
||
docker stats --no-stream # real headroom now
|
||
```
|
||
|
||
Containers auto-start on boot via their restart policies, so you normally don't
|
||
need to bring anything up manually — just confirm they came back healthy and the
|
||
public sites return 200 (`curl -sI https://rewardflow.lu`).
|
||
|
||
!!! warning "Arm capacity can be unavailable in-datacenter"
|
||
Hetzner's Arm/Ampere (CAX) stock is limited and fluctuates per location. If
|
||
the target CAX plans are greyed out with a tooltip like *"not available,
|
||
please choose another location or type"*, it means there's **no Arm capacity
|
||
in nbg1 (Nuremberg) right now** — nothing is wrong on your end. **Power the
|
||
server back on immediately** (don't leave production down waiting) and
|
||
**retry the rescale later** — capacity comes and goes over hours. Rescaling
|
||
cannot change the datacenter; if nbg1 stays dry for days, the only
|
||
alternative is a snapshot → new server in another DC → restore (a full
|
||
migration — avoid unless truly necessary).
|
||
|
||
**Post-rescale follow-ups:** once on a bigger box, bump the monitoring
|
||
containers' memory limits in `docker-compose.yml` to give them headroom (commit
|
||
to the repo and `git pull` on the server — don't hand-edit, since the file is
|
||
git-tracked):
|
||
|
||
- `cadvisor` `mem_limit: 192m` → `256m` (chronic flapping/restarts at 192m
|
||
cause `TargetDown` floods)
|
||
- `alertmanager` `mem_limit: 32m` → `64m`
|
||
|
||
Then update the [Resource Budget](#resource-budget-4-gb-server) table to the new
|
||
totals, and lift any silences placed during the pre-rescale alert flood (see
|
||
Alertmanager UI at `http://localhost:9093/#/silences`).
|
||
|
||
### Disk Maintenance (Docker Pruning)
|
||
|
||
The root filesystem fills over time because **Gitea CI rebuilds images on every
|
||
run**, leaving behind dangling image layers and (more significantly) Docker
|
||
**build cache**. Left unchecked this trips the `HostHighDiskUsage` alert
|
||
(threshold 80%). On a typical incident the split was ~14 GB unused images +
|
||
~9 GB build cache reclaimable on a 38 GB disk.
|
||
|
||
**Emergency manual prune** (run on the server when `HostHighDiskUsage` fires):
|
||
|
||
```bash
|
||
df -h / # check current usage
|
||
docker builder prune -f # remove all reclaimable build cache (the big one)
|
||
docker image prune -af # remove all images not used by a running container
|
||
docker system df # confirm what was reclaimed
|
||
df -h / # verify usage dropped
|
||
```
|
||
|
||
This is non-destructive: running containers, volumes, and the database are
|
||
untouched. Pruned images are re-pulled/rebuilt on the next deploy.
|
||
|
||
**The proper (automated) way:** `scripts/deploy.sh` already prunes old images at
|
||
the end of every deploy:
|
||
|
||
```bash
|
||
docker image prune -f --filter "until=72h"
|
||
```
|
||
|
||
…but it does **not** prune build cache, which is the larger offender. To stop
|
||
unbounded growth, add a build-cache prune alongside it in
|
||
`scripts/deploy.sh` (keeps the last 7 days so CI stays fast):
|
||
|
||
```bash
|
||
# in scripts/deploy.sh, step 6 ("Clean up old Docker images"):
|
||
docker image prune -f --filter "until=72h" > /dev/null 2>&1 || true
|
||
docker builder prune -f --filter "until=168h" > /dev/null 2>&1 || true # <- add this
|
||
```
|
||
|
||
Because cleanup then runs on every deploy and lives in version control, there's
|
||
no host-level cron to remember. (A weekly `/etc/cron.weekly/docker-prune` is an
|
||
alternative, but the deploy-script approach is preferred — it's
|
||
version-controlled and scoped to this repo.)
|
||
|
||
### Offloading CI to a Separate Server (2a — recommended)
|
||
|
||
**Why:** the Gitea Actions runner (`act_runner`, systemd `gitea-runner.service`)
|
||
runs the CI jobs from `.gitea/workflows/ci.yml` — `ruff`, `pytest` (which spins
|
||
up its own postgres service container), and `validate` — **on the production
|
||
box**. Those jobs are the ~47% CPU spike on every push that trips
|
||
`HostHighCpuUsage` and competes with the app for RAM. Gitea *itself* (git
|
||
hosting) is light (~0% CPU, ~5% RAM); the **runner** is the resource hog.
|
||
|
||
Moving just the runner to a separate, cheap server eliminates the prod CPU
|
||
bursts with **no data migration, no DNS change, and no downtime** — often
|
||
removing the need for a rescale entirely. The runner box can be **x86** (it only
|
||
lints/tests; it doesn't need to match prod's Arm architecture) and stateless
|
||
(rebuildable in minutes), so a **CX22 (2 vCPU / 4 GB, ~3.79 EUR/mo)** is the
|
||
minimum and a **CX32 (4 vCPU / 8 GB, ~6.80 EUR/mo)** is comfortable for CI
|
||
bursts. x86 has no capacity-wait (see "Why x86 is more abundant" — Arm/Ampere is
|
||
a limited pool).
|
||
|
||
**Steps:**
|
||
|
||
1. **Provision + harden** a new x86 server (Ubuntu 24.04): follow Steps 2–6
|
||
(non-root user, SSH hardening, UFW, **Docker** — the runner executes jobs in
|
||
containers so Docker is required).
|
||
2. **Get a runner registration token** in Gitea: Site Administration → Actions →
|
||
Runners → *Create new Runner* → copy the token.
|
||
3. **Install act_runner** (amd64 build for x86), matching the version in
|
||
[Step 15](#step-15-gitea-actions-runner):
|
||
|
||
```bash
|
||
mkdir -p ~/gitea-runner && cd ~/gitea-runner
|
||
VERSION=0.2.13
|
||
wget -O act_runner \
|
||
"https://gitea.com/gitea/act_runner/releases/download/v${VERSION}/act_runner-${VERSION}-linux-amd64"
|
||
chmod +x act_runner
|
||
```
|
||
|
||
4. **Register with the SAME labels** as the current runner — `ci.yml` uses
|
||
`runs-on: ubuntu-latest`, so the label mapping must be replicated or jobs
|
||
won't be picked up:
|
||
|
||
```bash
|
||
./act_runner register --no-interactive \
|
||
--instance https://git.wizard.lu \
|
||
--token <RUNNER_TOKEN> \
|
||
--name ci-runner-1 \
|
||
--labels 'ubuntu-latest:docker://docker.gitea.com/runner-images:ubuntu-latest,ubuntu-22.04:docker://docker.gitea.com/runner-images:ubuntu-22.04,ubuntu-20.04:docker://docker.gitea.com/runner-images:ubuntu-20.04'
|
||
```
|
||
|
||
5. **Generate config + install as a systemd service** (mirror prod's
|
||
`gitea-runner.service`, adjusting `User`/paths for the new box):
|
||
|
||
```bash
|
||
./act_runner generate-config > config.yaml
|
||
sudo tee /etc/systemd/system/gitea-runner.service >/dev/null <<'UNIT'
|
||
[Unit]
|
||
Description=Gitea Actions Runner
|
||
After=network.target
|
||
[Service]
|
||
Type=simple
|
||
User=samir
|
||
WorkingDirectory=/home/samir/gitea-runner
|
||
ExecStart=/home/samir/gitea-runner/act_runner daemon --config /home/samir/gitea-runner/config.yaml
|
||
Restart=always
|
||
RestartSec=10
|
||
[Install]
|
||
WantedBy=multi-user.target
|
||
UNIT
|
||
sudo systemctl daemon-reload && sudo systemctl enable --now gitea-runner.service
|
||
```
|
||
|
||
6. **Verify** the new runner shows **online/idle** in Gitea's Runners list.
|
||
7. **Smoke-test:** push a trivial commit to `master` and confirm the jobs land on
|
||
`ci-runner-1` (not the prod runner), and the deploy still completes. The CD
|
||
deploy step uses `appleboy/ssh-action` with the SSH key stored in **Gitea
|
||
repo secrets** (not on the runner host), so the new runner picks it up
|
||
automatically — **no key to copy**.
|
||
8. **Decommission the prod runner** once the new one is proven:
|
||
|
||
```bash
|
||
# on the production box:
|
||
sudo systemctl disable --now gitea-runner.service
|
||
```
|
||
|
||
Optionally remove it from Gitea's Runners list. Watch prod `docker stats`
|
||
during the next CI run — the CPU burst should be gone.
|
||
|
||
!!! note "One smaller burst remains on prod"
|
||
The deploy job still runs `docker compose up -d --build` **on prod** (via
|
||
SSH), so the api image is still *built* on the production box — a smaller
|
||
burst than the full CI suite. To remove that too, build images on the runner
|
||
and have prod `pull` instead of `--build`: build → push to **Gitea's built-in
|
||
container registry** → change `deploy.sh` from `--build` to `pull`. That's a
|
||
larger CI rework (and the runner must build **arm64** images via
|
||
`buildx --platform linux/arm64` while prod stays Arm) — defer unless the
|
||
build burst alone is still a problem.
|
||
|
||
### Migrating Gitea to a Separate Server (2c)
|
||
|
||
**When:** after 2a, if you want full separation — production box = app only;
|
||
a separate box = Gitea + CI. Buys architectural cleanliness (a prod incident no
|
||
longer touches git/CI, and vice versa) and frees the `gitea` + `gitea-db`
|
||
containers off prod. **Trade-off:** it's a real data migration, and the new box
|
||
becomes **stateful and critical** (source of truth + — if the runner is
|
||
co-located — the deploy path to prod), so it must be backed up, monitored, and
|
||
hardened like prod. Do it in a **planned maintenance window** (Gitea + CI are
|
||
unavailable during cutover). Co-locate it on the **same box as the 2a runner**.
|
||
|
||
Current Gitea layout (for reference): `~/gitea/docker-compose.yml` defines two
|
||
containers — `gitea` (`gitea/gitea:latest`, web on `127.0.0.1:3000`, git SSH on
|
||
host `2222`) and `gitea-db` (`postgres:15`). Data lives in two named volumes:
|
||
`gitea_gitea-data` (repos, LFS, config, actions artifacts) and
|
||
`gitea_gitea-db-data` (the postgres DB). Backups are under `~/backups/gitea/`.
|
||
|
||
!!! note "Backup coverage & rollback — read before you cut over"
|
||
**What's already safe (code):** This Gitea instance hosts a *single* repo
|
||
(`sboulahtit/orion`) with **no** issues, PRs, releases, wikis, LFS, or
|
||
attachments — so a normal local clone is a **complete backup of all code
|
||
history**. Before migrating, run `git fetch --all --tags` on your laptop (or
|
||
keep a `git clone --mirror`) so every branch/tag is local. Worst case, you
|
||
could recreate the repo from your laptop and `git push` — zero code loss.
|
||
|
||
**The one thing a clone does NOT cover — the 4 CI secrets.** Gitea Actions
|
||
secrets are **write-only**: you cannot read their values back from the UI or
|
||
API. The four (from `.gitea/workflows/ci.yml` → the `deploy` job) are:
|
||
|
||
| Secret | Value | Sensitive? |
|
||
|---|---|---|
|
||
| `DEPLOY_HOST` | prod IP (`91.99.65.229`) | no — known |
|
||
| `DEPLOY_USER` | `samir` | no — known |
|
||
| `DEPLOY_PATH` | `~/apps/orion` | no — known |
|
||
| `DEPLOY_SSH_KEY` | **private** SSH deploy key | **yes** — the only real one |
|
||
|
||
So only `DEPLOY_SSH_KEY` matters, and its **public** half is already in
|
||
prod's `~/.ssh/authorized_keys`. Two ways it's covered:
|
||
|
||
1. **Automatic (primary path):** the proper restore preserves all four. The
|
||
encrypted values live in the `secret` table (captured by `pg_dump`) and
|
||
are decrypted by `SECRET_KEY` inside `app.ini` (which lives in the
|
||
`gitea-data` volume). **You must restore the DB *and* the `gitea-data`
|
||
volume from the *same* instance together** — the encrypted secrets are
|
||
useless without their matching `SECRET_KEY`. Never restore one without
|
||
the other.
|
||
2. **Belt-and-suspenders (manual):** before cutover, confirm you still hold
|
||
the `DEPLOY_SSH_KEY` *private* key off-box. If you ever rebuild from the
|
||
local clone alone, re-add the four under *new Gitea → repo → Settings →
|
||
Actions → Secrets*; the three known ones are trivial, and for the key
|
||
either reuse the private key you saved or **regenerate**:
|
||
`ssh-keygen -t ed25519 -f deploy_key`, append `deploy_key.pub` to prod's
|
||
`~/.ssh/authorized_keys`, then paste `deploy_key` as the new
|
||
`DEPLOY_SSH_KEY`.
|
||
|
||
**One-shot backup (recommended right before cutover):** run
|
||
`docker exec gitea gitea dump -t /tmp` and copy the resulting
|
||
`gitea-dump-*.zip` off the box. That single archive bundles repos + DB +
|
||
config (`app.ini`/`SECRET_KEY`), so it inherently includes the encrypted
|
||
secrets *and* the key to decrypt them — the cleanest restore artifact.
|
||
|
||
**Rollback:** the migration keeps the old volumes intact (step 12 uses
|
||
`docker compose down`, **not** `down -v`). If anything goes sideways,
|
||
re-point `git.wizard.lu` DNS back to the prod IP and `docker compose up -d`
|
||
the old stack — it's untouched. Keep the old volumes until the new box is
|
||
fully verified.
|
||
|
||
**Steps:**
|
||
|
||
1. **Stage the stack on the new box.** Copy `~/gitea/docker-compose.yml` over.
|
||
**Reuse the exact existing env values** (especially `GITEA__database__PASSWD`
|
||
/ `POSTGRES_PASSWORD` — copy them from the current file; do not regenerate, or
|
||
the restored DB won't authenticate). Keep `ROOT_URL`/`DOMAIN`/`SSH_DOMAIN`
|
||
as `git.wizard.lu`.
|
||
2. **Announce downtime / stop writes** on the old Gitea.
|
||
3. **Dump the data on the old box:**
|
||
|
||
```bash
|
||
cd ~/gitea
|
||
docker exec gitea-db pg_dump -U gitea gitea > /tmp/gitea-db.sql
|
||
docker compose stop gitea # quiesce before copying the data volume
|
||
docker run --rm -v gitea_gitea-data:/data -v /tmp:/backup alpine \
|
||
tar czf /backup/gitea-data.tgz -C /data .
|
||
```
|
||
|
||
4. **Transfer** `/tmp/gitea-db.sql` + `/tmp/gitea-data.tgz` to the new box
|
||
(`scp`/`rsync`).
|
||
5. **Restore the DB** on the new box:
|
||
|
||
```bash
|
||
docker compose up -d gitea-db # wait until healthy
|
||
cat gitea-db.sql | docker exec -i gitea-db psql -U gitea -d gitea
|
||
```
|
||
|
||
6. **Restore the data volume** on the new box:
|
||
|
||
```bash
|
||
docker run --rm -v gitea_gitea-data:/data -v $PWD:/backup alpine \
|
||
sh -c "tar xzf /backup/gitea-data.tgz -C /data"
|
||
```
|
||
|
||
7. **Start Gitea:** `docker compose up -d gitea` and check `docker compose logs
|
||
gitea`.
|
||
8. **Firewall:** open `2222/tcp` (git SSH) on the new box's UFW; keep `3000`
|
||
bound to localhost (Caddy proxies it).
|
||
9. **Reverse proxy + SSL** on the new box: install Caddy (Step 14) and add the
|
||
`git.wizard.lu` block (same as prod):
|
||
|
||
```caddy
|
||
git.wizard.lu {
|
||
tls { issuer acme }
|
||
reverse_proxy localhost:3000
|
||
}
|
||
```
|
||
|
||
10. **DNS cutover:** point `git.wizard.lu` A/AAAA at the new box's IP (TTL 300 →
|
||
~5 min). Once propagated, Caddy on the new box auto-issues the TLS cert.
|
||
11. **No remote/runner URL changes needed** — the hostname `git.wizard.lu`
|
||
stays the same (only the IP moved), so your `gitea` git remote and the
|
||
runner's `--instance https://git.wizard.lu` keep working after DNS flips.
|
||
12. **Decommission Gitea on prod** (keep volumes + backups for a rollback
|
||
window):
|
||
|
||
```bash
|
||
cd ~/gitea && docker compose down # leaves volumes intact
|
||
```
|
||
|
||
Remove the `git.wizard.lu` block from prod's Caddyfile and reload Caddy;
|
||
optionally close `2222/tcp` on prod's UFW.
|
||
13. **Set up backups on the new box** (Step 17) — it's now stateful/critical.
|
||
14. **Verify:** web UI loads with valid SSL, clone/push over SSH (`:2222`)
|
||
works, a push triggers CI, and repos/actions history are intact. (See the
|
||
"Backup coverage & rollback" callout above if anything needs reverting.)
|
||
|
||
### View logs
|
||
|
||
```bash
|
||
# Follow all logs in real-time
|
||
docker compose --profile full logs -f
|
||
|
||
# Follow a specific service
|
||
docker compose --profile full logs -f api
|
||
docker compose --profile full logs -f celery-worker
|
||
docker compose --profile full logs -f celery-beat
|
||
docker compose --profile full logs -f flower
|
||
|
||
# View last N lines (useful for debugging crashes)
|
||
docker compose --profile full logs --tail=50 api
|
||
docker compose --profile full logs --tail=100 celery-worker
|
||
|
||
# Filter logs for errors
|
||
docker compose --profile full logs api | grep -i "error\|exception\|failed"
|
||
```
|
||
|
||
### Check container status
|
||
|
||
```bash
|
||
# Overview of all containers (health, uptime, ports)
|
||
docker compose --profile full ps
|
||
|
||
# Watch for containers stuck in "Restarting" — indicates a crash loop
|
||
# Healthy containers show: Up Xs (healthy)
|
||
```
|
||
|
||
### Restart services
|
||
|
||
```bash
|
||
# Restart a single service
|
||
docker compose --profile full restart api
|
||
|
||
# Restart everything
|
||
docker compose --profile full restart
|
||
|
||
# Full rebuild (after code changes)
|
||
docker compose --profile full up -d --build
|
||
```
|
||
|
||
### Quick access URLs
|
||
|
||
After Caddy is configured:
|
||
|
||
| Service | URL |
|
||
|---|---|
|
||
| Main Platform | `https://wizard.lu` |
|
||
| API Swagger docs | `https://api.wizard.lu/docs` |
|
||
| API ReDoc | `https://api.wizard.lu/redoc` |
|
||
| Admin panel | `https://wizard.lu/admin/login` |
|
||
| Health check | `https://api.wizard.lu/health` |
|
||
| Prometheus metrics | `https://api.wizard.lu/metrics` |
|
||
| Gitea | `https://git.wizard.lu` |
|
||
| Flower | `https://flower.wizard.lu` |
|
||
| Grafana | `https://grafana.wizard.lu` |
|
||
| OMS Platform | `https://omsflow.lu` |
|
||
| Loyalty+ Platform | `https://rewardflow.lu` |
|
||
|
||
Direct IP access (temporary, until firewall rules are removed):
|
||
|
||
| Service | URL |
|
||
|---|---|
|
||
| API | `http://91.99.65.229:8001/docs` |
|
||
| Gitea | `http://91.99.65.229:3000` |
|
||
| Flower | `http://91.99.65.229:5555` |
|