diff --git a/docs/architecture/observability.md b/docs/architecture/observability.md index 5935da81..ee1a8405 100644 --- a/docs/architecture/observability.md +++ b/docs/architecture/observability.md @@ -2,6 +2,38 @@ The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on. +## Production Stack + +The full monitoring stack runs as Docker containers alongside the application: + +| Container | Image | Port | Purpose | +|---|---|---|---| +| prometheus | `prom/prometheus` | 9090 (localhost) | Metrics storage, 15-day retention | +| grafana | `grafana/grafana` | 3001 (localhost) | Dashboards at `https://grafana.wizard.lu` | +| node-exporter | `prom/node-exporter` | 9100 (localhost) | Host CPU/RAM/disk metrics | +| cadvisor | `gcr.io/cadvisor/cadvisor` | 8080 (localhost) | Per-container resource metrics | + +All monitoring containers run under `profiles: [full]` in `docker-compose.yml` with memory limits (256 + 192 + 64 + 128 = 640 MB total). + +``` +┌──────────────┐ scrape ┌─────────────────┐ +│ Prometheus │◄────────────────│ Orion API │ /metrics +│ :9090 │◄────────────────│ node-exporter │ :9100 +│ │◄────────────────│ cAdvisor │ :8080 +└──────┬───────┘ └─────────────────┘ + │ query +┌──────▼───────┐ +│ Grafana │──── https://grafana.wizard.lu +│ :3001 │ +└──────────────┘ +``` + +Configuration files: + +- `monitoring/prometheus.yml` — scrape targets (orion-api, node-exporter, cadvisor, self) +- `monitoring/grafana/provisioning/datasources/datasource.yml` — auto-provisions Prometheus +- `monitoring/grafana/provisioning/dashboards/dashboard.yml` — file-based dashboard provider + ## Overview ``` @@ -326,47 +358,47 @@ init_observability( ### Application Lifespan +Observability is initialized in `app/core/lifespan.py` and the health router is mounted in `main.py`: + ```python -# main.py -from contextlib import asynccontextmanager -from fastapi import FastAPI -from app.core.observability import ( - init_observability, - shutdown_observability, - health_router, - register_module_health_checks, -) +# app/core/lifespan.py +from app.core.config import settings +from app.core.observability import init_observability, shutdown_observability @asynccontextmanager async def lifespan(app: FastAPI): # Startup init_observability( - enable_metrics=True, - sentry_dsn=settings.SENTRY_DSN, - environment=settings.ENVIRONMENT, - flower_url=settings.FLOWER_URL, - grafana_url=settings.GRAFANA_URL, + enable_metrics=settings.enable_metrics, + sentry_dsn=settings.sentry_dsn, + environment=settings.sentry_environment, + flower_url=settings.flower_url, + grafana_url=settings.grafana_url, ) - register_module_health_checks() - yield - # Shutdown shutdown_observability() - -app = FastAPI(lifespan=lifespan) -app.include_router(health_router) ``` +```python +# main.py +from app.core.observability import health_router +app.include_router(health_router) # /metrics, /health/live, /health/ready, /health/tools +``` + +Note: `/health` is defined separately in `main.py` with a richer response (DB check, feature list, docs links). The `health_router` provides the Kubernetes-style probes and Prometheus endpoint. + ### Environment Variables -| Variable | Description | Default | -|----------|-------------|---------| -| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) | -| `ENVIRONMENT` | Environment name | "development" | -| `ENABLE_METRICS` | Enable Prometheus metrics | False | -| `FLOWER_URL` | Flower dashboard URL | None | -| `GRAFANA_URL` | Grafana dashboard URL | None | +| Variable | Config field | Description | Default | +|----------|-------------|-------------|---------| +| `ENABLE_METRICS` | `enable_metrics` | Enable Prometheus metrics collection | `False` | +| `GRAFANA_URL` | `grafana_url` | Grafana dashboard URL | `https://grafana.wizard.lu` | +| `GRAFANA_ADMIN_USER` | — | Grafana admin username (docker-compose only) | `admin` | +| `GRAFANA_ADMIN_PASSWORD` | — | Grafana admin password (docker-compose only) | `changeme` | +| `SENTRY_DSN` | `sentry_dsn` | Sentry DSN for error tracking | `None` (disabled) | +| `SENTRY_ENVIRONMENT` | `sentry_environment` | Environment name for Sentry | `development` | +| `FLOWER_URL` | `flower_url` | Flower dashboard URL | `http://localhost:5555` | ## Kubernetes Integration @@ -424,6 +456,7 @@ spec: ## Related Documentation +- [Hetzner Server Setup — Step 18](../deployment/hetzner-server-setup.md#step-18-monitoring-observability) - Production monitoring deployment - [Module System](module-system.md) - Module health check integration - [Background Tasks](background-tasks.md) - Celery monitoring with Flower - [Deployment](../deployment/index.md) - Production deployment with monitoring diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md index 99826c0c..793f6232 100644 --- a/docs/deployment/docker.md +++ b/docs/deployment/docker.md @@ -36,11 +36,20 @@ make docker-down ### Current Services -| Service | Port | Purpose | -|---------|------|---------| -| db | 5432 | PostgreSQL database | -| redis | 6379 | Cache and queue broker | -| api | 8000 | FastAPI application | +| Service | Port | Profile | Purpose | +|---------|------|---------|---------| +| db | 5432 | (default) | PostgreSQL database | +| redis | 6379 | (default) | Cache and queue broker | +| api | 8000 | full | FastAPI application | +| celery-worker | — | full | Background task processing | +| celery-beat | — | full | Scheduled task scheduler | +| flower | 5555 | full | Celery monitoring dashboard | +| prometheus | 9090 | full | Metrics storage (15-day retention) | +| grafana | 3001 | full | Dashboards (`https://grafana.wizard.lu`) | +| node-exporter | 9100 | full | Host CPU/RAM/disk metrics | +| cadvisor | 8080 | full | Per-container resource metrics | + +Use `docker compose --profile full up -d` to start all services, or `docker compose up -d` for just db + redis (local development). --- @@ -368,44 +377,60 @@ docker compose -f docker-compose.prod.yml up -d api ## Backups -### Database Backup +Automated backup scripts handle daily pg_dump with rotation and optional Cloudflare R2 offsite sync: ```bash -# Create backup -docker compose -f docker-compose.prod.yml exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz +# Run backup (Orion + Gitea databases) +bash scripts/backup.sh -# Restore backup -gunzip -c backup_20240115.sql.gz | docker compose -f docker-compose.prod.yml exec -T db psql -U orion_user -d orion_db +# Run backup with R2 upload +bash scripts/backup.sh --upload + +# Restore from backup +bash scripts/restore.sh orion ~/backups/orion/daily/orion_20260214_030000.sql.gz +bash scripts/restore.sh gitea ~/backups/gitea/daily/gitea_20260214_030000.sql.gz ``` -### Volume Backup +Backups are stored in `~/backups/{orion,gitea}/{daily,weekly}/` with 7-day daily and 4-week weekly retention. A systemd timer runs the backup daily at 03:00. + +See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups) for full setup instructions. + +### Manual Database Backup ```bash -# Backup all volumes -docker run --rm \ - -v orion_postgres_data:/data \ - -v $(pwd)/backups:/backup \ - alpine tar czf /backup/postgres_$(date +%Y%m%d).tar.gz /data +# One-off backup +docker compose exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz + +# Restore +gunzip -c backup_20240115.sql.gz | docker compose exec -T db psql -U orion_user -d orion_db ``` --- ## Monitoring +The monitoring stack (Prometheus, Grafana, node-exporter, cAdvisor) runs under `profiles: [full]`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) for full setup and [Observability Framework](../architecture/observability.md) for the application-level metrics architecture. + ### Resource Usage ```bash -docker stats +docker stats --no-stream ``` ### Health Checks ```bash # Check service health -docker compose -f docker-compose.prod.yml ps +docker compose --profile full ps -# Test API health +# API health curl -s http://localhost:8000/health | jq + +# Prometheus metrics +curl -s http://localhost:8000/metrics | head -5 + +# Prometheus targets +curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health ``` --- diff --git a/docs/deployment/hetzner-server-setup.md b/docs/deployment/hetzner-server-setup.md index b894c1b2..fef4a6e2 100644 --- a/docs/deployment/hetzner-server-setup.md +++ b/docs/deployment/hetzner-server-setup.md @@ -444,6 +444,8 @@ Before setting up Caddy, point your domain's DNS to the server. |---|---|---|---| | A | `@` | `91.99.65.229` | 300 | | A | `www` | `91.99.65.229` | 300 | +| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 | +| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 | ### rewardflow.lu (Loyalty+ Platform) — TODO @@ -451,6 +453,8 @@ Before setting up Caddy, point your domain's DNS to the server. |---|---|---|---| | A | `@` | `91.99.65.229` | 300 | | A | `www` | `91.99.65.229` | 300 | +| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 | +| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 | ### IPv6 (AAAA) Records — TODO diff --git a/docs/deployment/production.md b/docs/deployment/production.md index 0dce6b6f..1ae0d09f 100644 --- a/docs/deployment/production.md +++ b/docs/deployment/production.md @@ -328,7 +328,10 @@ sudo systemctl restart orion orion-celery ## Backups -### Database Backup Script +!!! tip "Docker deployment" + For Docker-based deployments, use the automated backup scripts (`scripts/backup.sh` and `scripts/restore.sh`) with systemd timer. See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups). + +### Database Backup Script (VPS without Docker) ```bash sudo nano /home/orion/backup.sh @@ -373,6 +376,9 @@ sudo -u orion crontab -e ## Monitoring +!!! tip "Docker deployment" + For Docker-based deployments, a full Prometheus + Grafana + node-exporter + cAdvisor stack is included in `docker-compose.yml`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) and [Observability Framework](../architecture/observability.md). + ### Basic Health Check ```bash