Files
orion/docs/architecture/observability.md
Samir Boulahtit 677e5211f9
Some checks failed
CI / ruff (push) Successful in 12s
CI / docs (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / validate (push) Has been cancelled
CI / dependency-scanning (push) Has been cancelled
CI / pytest (push) Has been cancelled
docs: update observability and deployment docs to match production stack
Update observability.md with production container table, actual init code,
and correct env var names. Update docker.md with full 10-service table and
backup/monitoring cross-references. Add explicit AAAA records to DNS tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:44:05 +01:00

463 lines
14 KiB
Markdown

# Observability Framework
The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
## Production Stack
The full monitoring stack runs as Docker containers alongside the application:
| Container | Image | Port | Purpose |
|---|---|---|---|
| prometheus | `prom/prometheus` | 9090 (localhost) | Metrics storage, 15-day retention |
| grafana | `grafana/grafana` | 3001 (localhost) | Dashboards at `https://grafana.wizard.lu` |
| node-exporter | `prom/node-exporter` | 9100 (localhost) | Host CPU/RAM/disk metrics |
| cadvisor | `gcr.io/cadvisor/cadvisor` | 8080 (localhost) | Per-container resource metrics |
All monitoring containers run under `profiles: [full]` in `docker-compose.yml` with memory limits (256 + 192 + 64 + 128 = 640 MB total).
```
┌──────────────┐ scrape ┌─────────────────┐
│ Prometheus │◄────────────────│ Orion API │ /metrics
│ :9090 │◄────────────────│ node-exporter │ :9100
│ │◄────────────────│ cAdvisor │ :8080
└──────┬───────┘ └─────────────────┘
│ query
┌──────▼───────┐
│ Grafana │──── https://grafana.wizard.lu
│ :3001 │
└──────────────┘
```
Configuration files:
- `monitoring/prometheus.yml` — scrape targets (orion-api, node-exporter, cadvisor, self)
- `monitoring/grafana/provisioning/datasources/datasource.yml` — auto-provisions Prometheus
- `monitoring/grafana/provisioning/dashboards/dashboard.yml` — file-based dashboard provider
## Overview
```
┌─────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY FRAMEWORK │
│ app/core/observability.py │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
│ │ Registry │ │ Metrics │ │ Integration │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ API Endpoints │ │
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────┐
│ External Tools │
│ Flower │ Grafana │ Prometheus│
└───────────────────────────────┘
```
## Health Checks
### Health Check Registry
Components register health checks that are aggregated into a single endpoint.
```python
from app.core.observability import (
health_registry,
HealthCheckResult,
HealthStatus,
)
# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
try:
db.execute("SELECT 1")
return HealthCheckResult(
name="database",
status=HealthStatus.HEALTHY,
message="Database connection OK"
)
except Exception as e:
return HealthCheckResult(
name="database",
status=HealthStatus.UNHEALTHY,
message=str(e)
)
# Or register directly
health_registry.register_check("redis", check_redis_connection)
```
### Health Status Levels
| Status | Description | HTTP Response |
|--------|-------------|---------------|
| `HEALTHY` | All systems operational | 200 |
| `DEGRADED` | Partial functionality available | 200 |
| `UNHEALTHY` | Critical failure | 200 (check response body) |
### HealthCheckResult Fields
| Field | Type | Description |
|-------|------|-------------|
| `name` | str | Check identifier |
| `status` | HealthStatus | Health status level |
| `message` | str | Optional status message |
| `latency_ms` | float | Check execution time |
| `details` | dict | Additional diagnostic data |
| `checked_at` | datetime | Timestamp of check |
## API Endpoints
### GET /health
Aggregated health check endpoint. Returns combined status from all registered checks.
**Response:**
```json
{
"status": "healthy",
"timestamp": "2026-01-27T10:30:00Z",
"checks": [
{
"name": "database",
"status": "healthy",
"message": "Connection OK",
"latency_ms": 2.5,
"details": {}
},
{
"name": "module:billing",
"status": "healthy",
"message": "",
"latency_ms": 0.1,
"details": {}
}
]
}
```
**Overall Status Logic:**
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
- Otherwise → `HEALTHY`
### GET /health/live
Kubernetes liveness probe. Returns 200 if application is running.
**Response:**
```json
{"status": "alive"}
```
### GET /health/ready
Kubernetes readiness probe. Returns ready status based on health checks.
**Response:**
```json
{
"status": "ready",
"health": "healthy"
}
```
Or if unhealthy:
```json
{
"status": "not_ready",
"health": "unhealthy"
}
```
### GET /metrics
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
**Response:**
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...
```
### GET /health/tools
Returns URLs to external monitoring tools.
**Response:**
```json
{
"flower": "http://flower.example.com:5555",
"grafana": "http://grafana.example.com:3000",
"prometheus": null
}
```
## Prometheus Metrics
### MetricsRegistry
The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.
```python
from app.core.observability import metrics_registry
# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
"http_request_duration_seconds",
"HTTP request latency",
["endpoint"],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)
# Gauge - tracks current values
active_connections = metrics_registry.gauge(
"active_connections",
"Number of active connections",
["pool"]
)
active_connections.labels(pool="database").set(42)
```
### Enabling Metrics
Metrics are disabled by default. Enable during initialization:
```python
from app.core.observability import init_observability
init_observability(
enable_metrics=True,
# ... other options
)
```
### Dummy Metrics
When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
## Sentry Integration
### Configuration
```python
from app.core.observability import sentry, init_observability
# Initialize via init_observability
init_observability(
sentry_dsn="https://key@sentry.io/project",
environment="production",
)
# Or initialize directly
sentry.init(
dsn="https://key@sentry.io/project",
environment="production",
release="1.0.0"
)
```
### Capturing Errors
```python
from app.core.observability import sentry
try:
risky_operation()
except Exception as e:
event_id = sentry.capture_exception(e)
logger.error(f"Operation failed, Sentry event: {event_id}")
# Capture messages
sentry.capture_message("User reached rate limit", level="warning")
```
### Without Sentry
If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.
## Module Health Checks
Modules can provide health checks that are automatically registered.
### Defining Module Health Check
```python
# In module definition
from app.modules.base import ModuleDefinition
def check_billing_health() -> dict:
"""Check billing service health."""
try:
# Check Stripe connection
stripe.Account.retrieve()
return {"status": "healthy", "message": "Stripe connected"}
except Exception as e:
return {"status": "unhealthy", "message": str(e)}
billing_module = ModuleDefinition(
code="billing",
name="Billing",
health_check=check_billing_health,
# ...
)
```
### Registering Module Health Checks
```python
from app.core.observability import register_module_health_checks
# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()
```
This registers health checks as `module:{code}` (e.g., `module:billing`).
## External Tools
### Flower (Celery Monitoring)
Configure Flower URL to include in `/health/tools`:
```python
init_observability(
flower_url="http://flower:5555",
)
```
### Grafana
Configure Grafana URL:
```python
init_observability(
grafana_url="http://grafana:3000",
)
```
## Initialization
### Application Lifespan
Observability is initialized in `app/core/lifespan.py` and the health router is mounted in `main.py`:
```python
# app/core/lifespan.py
from app.core.config import settings
from app.core.observability import init_observability, shutdown_observability
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
init_observability(
enable_metrics=settings.enable_metrics,
sentry_dsn=settings.sentry_dsn,
environment=settings.sentry_environment,
flower_url=settings.flower_url,
grafana_url=settings.grafana_url,
)
yield
# Shutdown
shutdown_observability()
```
```python
# main.py
from app.core.observability import health_router
app.include_router(health_router) # /metrics, /health/live, /health/ready, /health/tools
```
Note: `/health` is defined separately in `main.py` with a richer response (DB check, feature list, docs links). The `health_router` provides the Kubernetes-style probes and Prometheus endpoint.
### Environment Variables
| Variable | Config field | Description | Default |
|----------|-------------|-------------|---------|
| `ENABLE_METRICS` | `enable_metrics` | Enable Prometheus metrics collection | `False` |
| `GRAFANA_URL` | `grafana_url` | Grafana dashboard URL | `https://grafana.wizard.lu` |
| `GRAFANA_ADMIN_USER` | — | Grafana admin username (docker-compose only) | `admin` |
| `GRAFANA_ADMIN_PASSWORD` | — | Grafana admin password (docker-compose only) | `changeme` |
| `SENTRY_DSN` | `sentry_dsn` | Sentry DSN for error tracking | `None` (disabled) |
| `SENTRY_ENVIRONMENT` | `sentry_environment` | Environment name for Sentry | `development` |
| `FLOWER_URL` | `flower_url` | Flower dashboard URL | `http://localhost:5555` |
## Kubernetes Integration
### Deployment Configuration
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
```
### Prometheus ServiceMonitor
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- port: http
path: /metrics
interval: 15s
```
## Best Practices
### Do
- Register health checks for critical dependencies (database, cache, external APIs)
- Use appropriate metric types (counter for counts, histogram for latency)
- Include meaningful labels but avoid high cardinality
- Set up alerts based on health status changes
### Don't
- Create health checks that are slow or have side effects
- Add high-cardinality labels to metrics (e.g., user IDs)
- Ignore Sentry errors in production
- Skip readiness probes in Kubernetes deployments
## Related Documentation
- [Hetzner Server Setup — Step 18](../deployment/hetzner-server-setup.md#step-18-monitoring-observability) - Production monitoring deployment
- [Module System](module-system.md) - Module health check integration
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
- [Deployment](../deployment/index.md) - Production deployment with monitoring