docs: update observability and deployment docs to match production stack
Some checks failed
Some checks failed
Update observability.md with production container table, actual init code, and correct env var names. Update docker.md with full 10-service table and backup/monitoring cross-references. Add explicit AAAA records to DNS tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -2,6 +2,38 @@
|
||||
|
||||
The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
|
||||
|
||||
## Production Stack
|
||||
|
||||
The full monitoring stack runs as Docker containers alongside the application:
|
||||
|
||||
| Container | Image | Port | Purpose |
|
||||
|---|---|---|---|
|
||||
| prometheus | `prom/prometheus` | 9090 (localhost) | Metrics storage, 15-day retention |
|
||||
| grafana | `grafana/grafana` | 3001 (localhost) | Dashboards at `https://grafana.wizard.lu` |
|
||||
| node-exporter | `prom/node-exporter` | 9100 (localhost) | Host CPU/RAM/disk metrics |
|
||||
| cadvisor | `gcr.io/cadvisor/cadvisor` | 8080 (localhost) | Per-container resource metrics |
|
||||
|
||||
All monitoring containers run under `profiles: [full]` in `docker-compose.yml` with memory limits (256 + 192 + 64 + 128 = 640 MB total).
|
||||
|
||||
```
|
||||
┌──────────────┐ scrape ┌─────────────────┐
|
||||
│ Prometheus │◄────────────────│ Orion API │ /metrics
|
||||
│ :9090 │◄────────────────│ node-exporter │ :9100
|
||||
│ │◄────────────────│ cAdvisor │ :8080
|
||||
└──────┬───────┘ └─────────────────┘
|
||||
│ query
|
||||
┌──────▼───────┐
|
||||
│ Grafana │──── https://grafana.wizard.lu
|
||||
│ :3001 │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
Configuration files:
|
||||
|
||||
- `monitoring/prometheus.yml` — scrape targets (orion-api, node-exporter, cadvisor, self)
|
||||
- `monitoring/grafana/provisioning/datasources/datasource.yml` — auto-provisions Prometheus
|
||||
- `monitoring/grafana/provisioning/dashboards/dashboard.yml` — file-based dashboard provider
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
@@ -326,47 +358,47 @@ init_observability(
|
||||
|
||||
### Application Lifespan
|
||||
|
||||
Observability is initialized in `app/core/lifespan.py` and the health router is mounted in `main.py`:
|
||||
|
||||
```python
|
||||
# main.py
|
||||
from contextlib import asynccontextmanager
|
||||
from fastapi import FastAPI
|
||||
from app.core.observability import (
|
||||
init_observability,
|
||||
shutdown_observability,
|
||||
health_router,
|
||||
register_module_health_checks,
|
||||
)
|
||||
# app/core/lifespan.py
|
||||
from app.core.config import settings
|
||||
from app.core.observability import init_observability, shutdown_observability
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
# Startup
|
||||
init_observability(
|
||||
enable_metrics=True,
|
||||
sentry_dsn=settings.SENTRY_DSN,
|
||||
environment=settings.ENVIRONMENT,
|
||||
flower_url=settings.FLOWER_URL,
|
||||
grafana_url=settings.GRAFANA_URL,
|
||||
enable_metrics=settings.enable_metrics,
|
||||
sentry_dsn=settings.sentry_dsn,
|
||||
environment=settings.sentry_environment,
|
||||
flower_url=settings.flower_url,
|
||||
grafana_url=settings.grafana_url,
|
||||
)
|
||||
register_module_health_checks()
|
||||
|
||||
yield
|
||||
|
||||
# Shutdown
|
||||
shutdown_observability()
|
||||
|
||||
app = FastAPI(lifespan=lifespan)
|
||||
app.include_router(health_router)
|
||||
```
|
||||
|
||||
```python
|
||||
# main.py
|
||||
from app.core.observability import health_router
|
||||
app.include_router(health_router) # /metrics, /health/live, /health/ready, /health/tools
|
||||
```
|
||||
|
||||
Note: `/health` is defined separately in `main.py` with a richer response (DB check, feature list, docs links). The `health_router` provides the Kubernetes-style probes and Prometheus endpoint.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
|
||||
| `ENVIRONMENT` | Environment name | "development" |
|
||||
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
|
||||
| `FLOWER_URL` | Flower dashboard URL | None |
|
||||
| `GRAFANA_URL` | Grafana dashboard URL | None |
|
||||
| Variable | Config field | Description | Default |
|
||||
|----------|-------------|-------------|---------|
|
||||
| `ENABLE_METRICS` | `enable_metrics` | Enable Prometheus metrics collection | `False` |
|
||||
| `GRAFANA_URL` | `grafana_url` | Grafana dashboard URL | `https://grafana.wizard.lu` |
|
||||
| `GRAFANA_ADMIN_USER` | — | Grafana admin username (docker-compose only) | `admin` |
|
||||
| `GRAFANA_ADMIN_PASSWORD` | — | Grafana admin password (docker-compose only) | `changeme` |
|
||||
| `SENTRY_DSN` | `sentry_dsn` | Sentry DSN for error tracking | `None` (disabled) |
|
||||
| `SENTRY_ENVIRONMENT` | `sentry_environment` | Environment name for Sentry | `development` |
|
||||
| `FLOWER_URL` | `flower_url` | Flower dashboard URL | `http://localhost:5555` |
|
||||
|
||||
## Kubernetes Integration
|
||||
|
||||
@@ -424,6 +456,7 @@ spec:
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Hetzner Server Setup — Step 18](../deployment/hetzner-server-setup.md#step-18-monitoring-observability) - Production monitoring deployment
|
||||
- [Module System](module-system.md) - Module health check integration
|
||||
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
|
||||
- [Deployment](../deployment/index.md) - Production deployment with monitoring
|
||||
|
||||
@@ -36,11 +36,20 @@ make docker-down
|
||||
|
||||
### Current Services
|
||||
|
||||
| Service | Port | Purpose |
|
||||
|---------|------|---------|
|
||||
| db | 5432 | PostgreSQL database |
|
||||
| redis | 6379 | Cache and queue broker |
|
||||
| api | 8000 | FastAPI application |
|
||||
| Service | Port | Profile | Purpose |
|
||||
|---------|------|---------|---------|
|
||||
| db | 5432 | (default) | PostgreSQL database |
|
||||
| redis | 6379 | (default) | Cache and queue broker |
|
||||
| api | 8000 | full | FastAPI application |
|
||||
| celery-worker | — | full | Background task processing |
|
||||
| celery-beat | — | full | Scheduled task scheduler |
|
||||
| flower | 5555 | full | Celery monitoring dashboard |
|
||||
| prometheus | 9090 | full | Metrics storage (15-day retention) |
|
||||
| grafana | 3001 | full | Dashboards (`https://grafana.wizard.lu`) |
|
||||
| node-exporter | 9100 | full | Host CPU/RAM/disk metrics |
|
||||
| cadvisor | 8080 | full | Per-container resource metrics |
|
||||
|
||||
Use `docker compose --profile full up -d` to start all services, or `docker compose up -d` for just db + redis (local development).
|
||||
|
||||
---
|
||||
|
||||
@@ -368,44 +377,60 @@ docker compose -f docker-compose.prod.yml up -d api
|
||||
|
||||
## Backups
|
||||
|
||||
### Database Backup
|
||||
Automated backup scripts handle daily pg_dump with rotation and optional Cloudflare R2 offsite sync:
|
||||
|
||||
```bash
|
||||
# Create backup
|
||||
docker compose -f docker-compose.prod.yml exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz
|
||||
# Run backup (Orion + Gitea databases)
|
||||
bash scripts/backup.sh
|
||||
|
||||
# Restore backup
|
||||
gunzip -c backup_20240115.sql.gz | docker compose -f docker-compose.prod.yml exec -T db psql -U orion_user -d orion_db
|
||||
# Run backup with R2 upload
|
||||
bash scripts/backup.sh --upload
|
||||
|
||||
# Restore from backup
|
||||
bash scripts/restore.sh orion ~/backups/orion/daily/orion_20260214_030000.sql.gz
|
||||
bash scripts/restore.sh gitea ~/backups/gitea/daily/gitea_20260214_030000.sql.gz
|
||||
```
|
||||
|
||||
### Volume Backup
|
||||
Backups are stored in `~/backups/{orion,gitea}/{daily,weekly}/` with 7-day daily and 4-week weekly retention. A systemd timer runs the backup daily at 03:00.
|
||||
|
||||
See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups) for full setup instructions.
|
||||
|
||||
### Manual Database Backup
|
||||
|
||||
```bash
|
||||
# Backup all volumes
|
||||
docker run --rm \
|
||||
-v orion_postgres_data:/data \
|
||||
-v $(pwd)/backups:/backup \
|
||||
alpine tar czf /backup/postgres_$(date +%Y%m%d).tar.gz /data
|
||||
# One-off backup
|
||||
docker compose exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz
|
||||
|
||||
# Restore
|
||||
gunzip -c backup_20240115.sql.gz | docker compose exec -T db psql -U orion_user -d orion_db
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
The monitoring stack (Prometheus, Grafana, node-exporter, cAdvisor) runs under `profiles: [full]`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) for full setup and [Observability Framework](../architecture/observability.md) for the application-level metrics architecture.
|
||||
|
||||
### Resource Usage
|
||||
|
||||
```bash
|
||||
docker stats
|
||||
docker stats --no-stream
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Check service health
|
||||
docker compose -f docker-compose.prod.yml ps
|
||||
docker compose --profile full ps
|
||||
|
||||
# Test API health
|
||||
# API health
|
||||
curl -s http://localhost:8000/health | jq
|
||||
|
||||
# Prometheus metrics
|
||||
curl -s http://localhost:8000/metrics | head -5
|
||||
|
||||
# Prometheus targets
|
||||
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -444,6 +444,8 @@ Before setting up Caddy, point your domain's DNS to the server.
|
||||
|---|---|---|---|
|
||||
| A | `@` | `91.99.65.229` | 300 |
|
||||
| A | `www` | `91.99.65.229` | 300 |
|
||||
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||
|
||||
### rewardflow.lu (Loyalty+ Platform) — TODO
|
||||
|
||||
@@ -451,6 +453,8 @@ Before setting up Caddy, point your domain's DNS to the server.
|
||||
|---|---|---|---|
|
||||
| A | `@` | `91.99.65.229` | 300 |
|
||||
| A | `www` | `91.99.65.229` | 300 |
|
||||
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||
|
||||
### IPv6 (AAAA) Records — TODO
|
||||
|
||||
|
||||
@@ -328,7 +328,10 @@ sudo systemctl restart orion orion-celery
|
||||
|
||||
## Backups
|
||||
|
||||
### Database Backup Script
|
||||
!!! tip "Docker deployment"
|
||||
For Docker-based deployments, use the automated backup scripts (`scripts/backup.sh` and `scripts/restore.sh`) with systemd timer. See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups).
|
||||
|
||||
### Database Backup Script (VPS without Docker)
|
||||
|
||||
```bash
|
||||
sudo nano /home/orion/backup.sh
|
||||
@@ -373,6 +376,9 @@ sudo -u orion crontab -e
|
||||
|
||||
## Monitoring
|
||||
|
||||
!!! tip "Docker deployment"
|
||||
For Docker-based deployments, a full Prometheus + Grafana + node-exporter + cAdvisor stack is included in `docker-compose.yml`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) and [Observability Framework](../architecture/observability.md).
|
||||
|
||||
### Basic Health Check
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user