docs: update observability and deployment docs to match production stack
Some checks failed
Some checks failed
Update observability.md with production container table, actual init code, and correct env var names. Update docker.md with full 10-service table and backup/monitoring cross-references. Add explicit AAAA records to DNS tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -2,6 +2,38 @@
|
|||||||
|
|
||||||
The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
|
The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
|
||||||
|
|
||||||
|
## Production Stack
|
||||||
|
|
||||||
|
The full monitoring stack runs as Docker containers alongside the application:
|
||||||
|
|
||||||
|
| Container | Image | Port | Purpose |
|
||||||
|
|---|---|---|---|
|
||||||
|
| prometheus | `prom/prometheus` | 9090 (localhost) | Metrics storage, 15-day retention |
|
||||||
|
| grafana | `grafana/grafana` | 3001 (localhost) | Dashboards at `https://grafana.wizard.lu` |
|
||||||
|
| node-exporter | `prom/node-exporter` | 9100 (localhost) | Host CPU/RAM/disk metrics |
|
||||||
|
| cadvisor | `gcr.io/cadvisor/cadvisor` | 8080 (localhost) | Per-container resource metrics |
|
||||||
|
|
||||||
|
All monitoring containers run under `profiles: [full]` in `docker-compose.yml` with memory limits (256 + 192 + 64 + 128 = 640 MB total).
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────┐ scrape ┌─────────────────┐
|
||||||
|
│ Prometheus │◄────────────────│ Orion API │ /metrics
|
||||||
|
│ :9090 │◄────────────────│ node-exporter │ :9100
|
||||||
|
│ │◄────────────────│ cAdvisor │ :8080
|
||||||
|
└──────┬───────┘ └─────────────────┘
|
||||||
|
│ query
|
||||||
|
┌──────▼───────┐
|
||||||
|
│ Grafana │──── https://grafana.wizard.lu
|
||||||
|
│ :3001 │
|
||||||
|
└──────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
Configuration files:
|
||||||
|
|
||||||
|
- `monitoring/prometheus.yml` — scrape targets (orion-api, node-exporter, cadvisor, self)
|
||||||
|
- `monitoring/grafana/provisioning/datasources/datasource.yml` — auto-provisions Prometheus
|
||||||
|
- `monitoring/grafana/provisioning/dashboards/dashboard.yml` — file-based dashboard provider
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -326,47 +358,47 @@ init_observability(
|
|||||||
|
|
||||||
### Application Lifespan
|
### Application Lifespan
|
||||||
|
|
||||||
|
Observability is initialized in `app/core/lifespan.py` and the health router is mounted in `main.py`:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# main.py
|
# app/core/lifespan.py
|
||||||
from contextlib import asynccontextmanager
|
from app.core.config import settings
|
||||||
from fastapi import FastAPI
|
from app.core.observability import init_observability, shutdown_observability
|
||||||
from app.core.observability import (
|
|
||||||
init_observability,
|
|
||||||
shutdown_observability,
|
|
||||||
health_router,
|
|
||||||
register_module_health_checks,
|
|
||||||
)
|
|
||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def lifespan(app: FastAPI):
|
async def lifespan(app: FastAPI):
|
||||||
# Startup
|
# Startup
|
||||||
init_observability(
|
init_observability(
|
||||||
enable_metrics=True,
|
enable_metrics=settings.enable_metrics,
|
||||||
sentry_dsn=settings.SENTRY_DSN,
|
sentry_dsn=settings.sentry_dsn,
|
||||||
environment=settings.ENVIRONMENT,
|
environment=settings.sentry_environment,
|
||||||
flower_url=settings.FLOWER_URL,
|
flower_url=settings.flower_url,
|
||||||
grafana_url=settings.GRAFANA_URL,
|
grafana_url=settings.grafana_url,
|
||||||
)
|
)
|
||||||
register_module_health_checks()
|
|
||||||
|
|
||||||
yield
|
yield
|
||||||
|
|
||||||
# Shutdown
|
# Shutdown
|
||||||
shutdown_observability()
|
shutdown_observability()
|
||||||
|
|
||||||
app = FastAPI(lifespan=lifespan)
|
|
||||||
app.include_router(health_router)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
# main.py
|
||||||
|
from app.core.observability import health_router
|
||||||
|
app.include_router(health_router) # /metrics, /health/live, /health/ready, /health/tools
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: `/health` is defined separately in `main.py` with a richer response (DB check, feature list, docs links). The `health_router` provides the Kubernetes-style probes and Prometheus endpoint.
|
||||||
|
|
||||||
### Environment Variables
|
### Environment Variables
|
||||||
|
|
||||||
| Variable | Description | Default |
|
| Variable | Config field | Description | Default |
|
||||||
|----------|-------------|---------|
|
|----------|-------------|-------------|---------|
|
||||||
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
|
| `ENABLE_METRICS` | `enable_metrics` | Enable Prometheus metrics collection | `False` |
|
||||||
| `ENVIRONMENT` | Environment name | "development" |
|
| `GRAFANA_URL` | `grafana_url` | Grafana dashboard URL | `https://grafana.wizard.lu` |
|
||||||
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
|
| `GRAFANA_ADMIN_USER` | — | Grafana admin username (docker-compose only) | `admin` |
|
||||||
| `FLOWER_URL` | Flower dashboard URL | None |
|
| `GRAFANA_ADMIN_PASSWORD` | — | Grafana admin password (docker-compose only) | `changeme` |
|
||||||
| `GRAFANA_URL` | Grafana dashboard URL | None |
|
| `SENTRY_DSN` | `sentry_dsn` | Sentry DSN for error tracking | `None` (disabled) |
|
||||||
|
| `SENTRY_ENVIRONMENT` | `sentry_environment` | Environment name for Sentry | `development` |
|
||||||
|
| `FLOWER_URL` | `flower_url` | Flower dashboard URL | `http://localhost:5555` |
|
||||||
|
|
||||||
## Kubernetes Integration
|
## Kubernetes Integration
|
||||||
|
|
||||||
@@ -424,6 +456,7 @@ spec:
|
|||||||
|
|
||||||
## Related Documentation
|
## Related Documentation
|
||||||
|
|
||||||
|
- [Hetzner Server Setup — Step 18](../deployment/hetzner-server-setup.md#step-18-monitoring-observability) - Production monitoring deployment
|
||||||
- [Module System](module-system.md) - Module health check integration
|
- [Module System](module-system.md) - Module health check integration
|
||||||
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
|
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
|
||||||
- [Deployment](../deployment/index.md) - Production deployment with monitoring
|
- [Deployment](../deployment/index.md) - Production deployment with monitoring
|
||||||
|
|||||||
@@ -36,11 +36,20 @@ make docker-down
|
|||||||
|
|
||||||
### Current Services
|
### Current Services
|
||||||
|
|
||||||
| Service | Port | Purpose |
|
| Service | Port | Profile | Purpose |
|
||||||
|---------|------|---------|
|
|---------|------|---------|---------|
|
||||||
| db | 5432 | PostgreSQL database |
|
| db | 5432 | (default) | PostgreSQL database |
|
||||||
| redis | 6379 | Cache and queue broker |
|
| redis | 6379 | (default) | Cache and queue broker |
|
||||||
| api | 8000 | FastAPI application |
|
| api | 8000 | full | FastAPI application |
|
||||||
|
| celery-worker | — | full | Background task processing |
|
||||||
|
| celery-beat | — | full | Scheduled task scheduler |
|
||||||
|
| flower | 5555 | full | Celery monitoring dashboard |
|
||||||
|
| prometheus | 9090 | full | Metrics storage (15-day retention) |
|
||||||
|
| grafana | 3001 | full | Dashboards (`https://grafana.wizard.lu`) |
|
||||||
|
| node-exporter | 9100 | full | Host CPU/RAM/disk metrics |
|
||||||
|
| cadvisor | 8080 | full | Per-container resource metrics |
|
||||||
|
|
||||||
|
Use `docker compose --profile full up -d` to start all services, or `docker compose up -d` for just db + redis (local development).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -368,44 +377,60 @@ docker compose -f docker-compose.prod.yml up -d api
|
|||||||
|
|
||||||
## Backups
|
## Backups
|
||||||
|
|
||||||
### Database Backup
|
Automated backup scripts handle daily pg_dump with rotation and optional Cloudflare R2 offsite sync:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Create backup
|
# Run backup (Orion + Gitea databases)
|
||||||
docker compose -f docker-compose.prod.yml exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz
|
bash scripts/backup.sh
|
||||||
|
|
||||||
# Restore backup
|
# Run backup with R2 upload
|
||||||
gunzip -c backup_20240115.sql.gz | docker compose -f docker-compose.prod.yml exec -T db psql -U orion_user -d orion_db
|
bash scripts/backup.sh --upload
|
||||||
|
|
||||||
|
# Restore from backup
|
||||||
|
bash scripts/restore.sh orion ~/backups/orion/daily/orion_20260214_030000.sql.gz
|
||||||
|
bash scripts/restore.sh gitea ~/backups/gitea/daily/gitea_20260214_030000.sql.gz
|
||||||
```
|
```
|
||||||
|
|
||||||
### Volume Backup
|
Backups are stored in `~/backups/{orion,gitea}/{daily,weekly}/` with 7-day daily and 4-week weekly retention. A systemd timer runs the backup daily at 03:00.
|
||||||
|
|
||||||
|
See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups) for full setup instructions.
|
||||||
|
|
||||||
|
### Manual Database Backup
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Backup all volumes
|
# One-off backup
|
||||||
docker run --rm \
|
docker compose exec db pg_dump -U orion_user orion_db | gzip > backup_$(date +%Y%m%d).sql.gz
|
||||||
-v orion_postgres_data:/data \
|
|
||||||
-v $(pwd)/backups:/backup \
|
# Restore
|
||||||
alpine tar czf /backup/postgres_$(date +%Y%m%d).tar.gz /data
|
gunzip -c backup_20240115.sql.gz | docker compose exec -T db psql -U orion_user -d orion_db
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Monitoring
|
## Monitoring
|
||||||
|
|
||||||
|
The monitoring stack (Prometheus, Grafana, node-exporter, cAdvisor) runs under `profiles: [full]`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) for full setup and [Observability Framework](../architecture/observability.md) for the application-level metrics architecture.
|
||||||
|
|
||||||
### Resource Usage
|
### Resource Usage
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker stats
|
docker stats --no-stream
|
||||||
```
|
```
|
||||||
|
|
||||||
### Health Checks
|
### Health Checks
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Check service health
|
# Check service health
|
||||||
docker compose -f docker-compose.prod.yml ps
|
docker compose --profile full ps
|
||||||
|
|
||||||
# Test API health
|
# API health
|
||||||
curl -s http://localhost:8000/health | jq
|
curl -s http://localhost:8000/health | jq
|
||||||
|
|
||||||
|
# Prometheus metrics
|
||||||
|
curl -s http://localhost:8000/metrics | head -5
|
||||||
|
|
||||||
|
# Prometheus targets
|
||||||
|
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep health
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
@@ -444,6 +444,8 @@ Before setting up Caddy, point your domain's DNS to the server.
|
|||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| A | `@` | `91.99.65.229` | 300 |
|
| A | `@` | `91.99.65.229` | 300 |
|
||||||
| A | `www` | `91.99.65.229` | 300 |
|
| A | `www` | `91.99.65.229` | 300 |
|
||||||
|
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||||
|
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||||
|
|
||||||
### rewardflow.lu (Loyalty+ Platform) — TODO
|
### rewardflow.lu (Loyalty+ Platform) — TODO
|
||||||
|
|
||||||
@@ -451,6 +453,8 @@ Before setting up Caddy, point your domain's DNS to the server.
|
|||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| A | `@` | `91.99.65.229` | 300 |
|
| A | `@` | `91.99.65.229` | 300 |
|
||||||
| A | `www` | `91.99.65.229` | 300 |
|
| A | `www` | `91.99.65.229` | 300 |
|
||||||
|
| AAAA | `@` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||||
|
| AAAA | `www` | `2a01:4f8:1c1a:b39c::1` | 300 |
|
||||||
|
|
||||||
### IPv6 (AAAA) Records — TODO
|
### IPv6 (AAAA) Records — TODO
|
||||||
|
|
||||||
|
|||||||
@@ -328,7 +328,10 @@ sudo systemctl restart orion orion-celery
|
|||||||
|
|
||||||
## Backups
|
## Backups
|
||||||
|
|
||||||
### Database Backup Script
|
!!! tip "Docker deployment"
|
||||||
|
For Docker-based deployments, use the automated backup scripts (`scripts/backup.sh` and `scripts/restore.sh`) with systemd timer. See [Hetzner Server Setup — Step 17](hetzner-server-setup.md#step-17-backups).
|
||||||
|
|
||||||
|
### Database Backup Script (VPS without Docker)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
sudo nano /home/orion/backup.sh
|
sudo nano /home/orion/backup.sh
|
||||||
@@ -373,6 +376,9 @@ sudo -u orion crontab -e
|
|||||||
|
|
||||||
## Monitoring
|
## Monitoring
|
||||||
|
|
||||||
|
!!! tip "Docker deployment"
|
||||||
|
For Docker-based deployments, a full Prometheus + Grafana + node-exporter + cAdvisor stack is included in `docker-compose.yml`. See [Hetzner Server Setup — Step 18](hetzner-server-setup.md#step-18-monitoring-observability) and [Observability Framework](../architecture/observability.md).
|
||||||
|
|
||||||
### Basic Health Check
|
### Basic Health Check
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
Reference in New Issue
Block a user