Update observability.md with production container table, actual init code, and correct env var names. Update docker.md with full 10-service table and backup/monitoring cross-references. Add explicit AAAA records to DNS tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14 KiB
Observability Framework
The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
Production Stack
The full monitoring stack runs as Docker containers alongside the application:
| Container | Image | Port | Purpose |
|---|---|---|---|
| prometheus | prom/prometheus |
9090 (localhost) | Metrics storage, 15-day retention |
| grafana | grafana/grafana |
3001 (localhost) | Dashboards at https://grafana.wizard.lu |
| node-exporter | prom/node-exporter |
9100 (localhost) | Host CPU/RAM/disk metrics |
| cadvisor | gcr.io/cadvisor/cadvisor |
8080 (localhost) | Per-container resource metrics |
All monitoring containers run under profiles: [full] in docker-compose.yml with memory limits (256 + 192 + 64 + 128 = 640 MB total).
┌──────────────┐ scrape ┌─────────────────┐
│ Prometheus │◄────────────────│ Orion API │ /metrics
│ :9090 │◄────────────────│ node-exporter │ :9100
│ │◄────────────────│ cAdvisor │ :8080
└──────┬───────┘ └─────────────────┘
│ query
┌──────▼───────┐
│ Grafana │──── https://grafana.wizard.lu
│ :3001 │
└──────────────┘
Configuration files:
monitoring/prometheus.yml— scrape targets (orion-api, node-exporter, cadvisor, self)monitoring/grafana/provisioning/datasources/datasource.yml— auto-provisions Prometheusmonitoring/grafana/provisioning/dashboards/dashboard.yml— file-based dashboard provider
Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY FRAMEWORK │
│ app/core/observability.py │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
│ │ Registry │ │ Metrics │ │ Integration │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ API Endpoints │ │
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ External Tools │
│ Flower │ Grafana │ Prometheus│
└───────────────────────────────┘
Health Checks
Health Check Registry
Components register health checks that are aggregated into a single endpoint.
from app.core.observability import (
health_registry,
HealthCheckResult,
HealthStatus,
)
# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
try:
db.execute("SELECT 1")
return HealthCheckResult(
name="database",
status=HealthStatus.HEALTHY,
message="Database connection OK"
)
except Exception as e:
return HealthCheckResult(
name="database",
status=HealthStatus.UNHEALTHY,
message=str(e)
)
# Or register directly
health_registry.register_check("redis", check_redis_connection)
Health Status Levels
| Status | Description | HTTP Response |
|---|---|---|
HEALTHY |
All systems operational | 200 |
DEGRADED |
Partial functionality available | 200 |
UNHEALTHY |
Critical failure | 200 (check response body) |
HealthCheckResult Fields
| Field | Type | Description |
|---|---|---|
name |
str | Check identifier |
status |
HealthStatus | Health status level |
message |
str | Optional status message |
latency_ms |
float | Check execution time |
details |
dict | Additional diagnostic data |
checked_at |
datetime | Timestamp of check |
API Endpoints
GET /health
Aggregated health check endpoint. Returns combined status from all registered checks.
Response:
{
"status": "healthy",
"timestamp": "2026-01-27T10:30:00Z",
"checks": [
{
"name": "database",
"status": "healthy",
"message": "Connection OK",
"latency_ms": 2.5,
"details": {}
},
{
"name": "module:billing",
"status": "healthy",
"message": "",
"latency_ms": 0.1,
"details": {}
}
]
}
Overall Status Logic:
- If any check is
UNHEALTHY→ overall isUNHEALTHY - If any check is
DEGRADEDand noneUNHEALTHY→ overall isDEGRADED - Otherwise →
HEALTHY
GET /health/live
Kubernetes liveness probe. Returns 200 if application is running.
Response:
{"status": "alive"}
GET /health/ready
Kubernetes readiness probe. Returns ready status based on health checks.
Response:
{
"status": "ready",
"health": "healthy"
}
Or if unhealthy:
{
"status": "not_ready",
"health": "unhealthy"
}
GET /metrics
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
Response:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...
GET /health/tools
Returns URLs to external monitoring tools.
Response:
{
"flower": "http://flower.example.com:5555",
"grafana": "http://grafana.example.com:3000",
"prometheus": null
}
Prometheus Metrics
MetricsRegistry
The metrics registry provides a wrapper around prometheus_client with fallback when the library isn't installed.
from app.core.observability import metrics_registry
# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
"http_request_duration_seconds",
"HTTP request latency",
["endpoint"],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)
# Gauge - tracks current values
active_connections = metrics_registry.gauge(
"active_connections",
"Number of active connections",
["pool"]
)
active_connections.labels(pool="database").set(42)
Enabling Metrics
Metrics are disabled by default. Enable during initialization:
from app.core.observability import init_observability
init_observability(
enable_metrics=True,
# ... other options
)
Dummy Metrics
When prometheus_client isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
Sentry Integration
Configuration
from app.core.observability import sentry, init_observability
# Initialize via init_observability
init_observability(
sentry_dsn="https://key@sentry.io/project",
environment="production",
)
# Or initialize directly
sentry.init(
dsn="https://key@sentry.io/project",
environment="production",
release="1.0.0"
)
Capturing Errors
from app.core.observability import sentry
try:
risky_operation()
except Exception as e:
event_id = sentry.capture_exception(e)
logger.error(f"Operation failed, Sentry event: {event_id}")
# Capture messages
sentry.capture_message("User reached rate limit", level="warning")
Without Sentry
If sentry_sdk isn't installed or DSN isn't provided, all capture calls silently return None.
Module Health Checks
Modules can provide health checks that are automatically registered.
Defining Module Health Check
# In module definition
from app.modules.base import ModuleDefinition
def check_billing_health() -> dict:
"""Check billing service health."""
try:
# Check Stripe connection
stripe.Account.retrieve()
return {"status": "healthy", "message": "Stripe connected"}
except Exception as e:
return {"status": "unhealthy", "message": str(e)}
billing_module = ModuleDefinition(
code="billing",
name="Billing",
health_check=check_billing_health,
# ...
)
Registering Module Health Checks
from app.core.observability import register_module_health_checks
# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()
This registers health checks as module:{code} (e.g., module:billing).
External Tools
Flower (Celery Monitoring)
Configure Flower URL to include in /health/tools:
init_observability(
flower_url="http://flower:5555",
)
Grafana
Configure Grafana URL:
init_observability(
grafana_url="http://grafana:3000",
)
Initialization
Application Lifespan
Observability is initialized in app/core/lifespan.py and the health router is mounted in main.py:
# app/core/lifespan.py
from app.core.config import settings
from app.core.observability import init_observability, shutdown_observability
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
init_observability(
enable_metrics=settings.enable_metrics,
sentry_dsn=settings.sentry_dsn,
environment=settings.sentry_environment,
flower_url=settings.flower_url,
grafana_url=settings.grafana_url,
)
yield
# Shutdown
shutdown_observability()
# main.py
from app.core.observability import health_router
app.include_router(health_router) # /metrics, /health/live, /health/ready, /health/tools
Note: /health is defined separately in main.py with a richer response (DB check, feature list, docs links). The health_router provides the Kubernetes-style probes and Prometheus endpoint.
Environment Variables
| Variable | Config field | Description | Default |
|---|---|---|---|
ENABLE_METRICS |
enable_metrics |
Enable Prometheus metrics collection | False |
GRAFANA_URL |
grafana_url |
Grafana dashboard URL | https://grafana.wizard.lu |
GRAFANA_ADMIN_USER |
— | Grafana admin username (docker-compose only) | admin |
GRAFANA_ADMIN_PASSWORD |
— | Grafana admin password (docker-compose only) | changeme |
SENTRY_DSN |
sentry_dsn |
Sentry DSN for error tracking | None (disabled) |
SENTRY_ENVIRONMENT |
sentry_environment |
Environment name for Sentry | development |
FLOWER_URL |
flower_url |
Flower dashboard URL | http://localhost:5555 |
Kubernetes Integration
Deployment Configuration
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- port: http
path: /metrics
interval: 15s
Best Practices
Do
- Register health checks for critical dependencies (database, cache, external APIs)
- Use appropriate metric types (counter for counts, histogram for latency)
- Include meaningful labels but avoid high cardinality
- Set up alerts based on health status changes
Don't
- Create health checks that are slow or have side effects
- Add high-cardinality labels to metrics (e.g., user IDs)
- Ignore Sentry errors in production
- Skip readiness probes in Kubernetes deployments
Related Documentation
- Hetzner Server Setup — Step 18 - Production monitoring deployment
- Module System - Module health check integration
- Background Tasks - Celery monitoring with Flower
- Deployment - Production deployment with monitoring