Files
orion/docs/architecture/observability.md
Samir Boulahtit 677e5211f9
Some checks failed
CI / ruff (push) Successful in 12s
CI / docs (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / validate (push) Has been cancelled
CI / dependency-scanning (push) Has been cancelled
CI / pytest (push) Has been cancelled
docs: update observability and deployment docs to match production stack
Update observability.md with production container table, actual init code,
and correct env var names. Update docker.md with full 10-service table and
backup/monitoring cross-references. Add explicit AAAA records to DNS tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 16:44:05 +01:00

14 KiB

Observability Framework

The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.

Production Stack

The full monitoring stack runs as Docker containers alongside the application:

Container Image Port Purpose
prometheus prom/prometheus 9090 (localhost) Metrics storage, 15-day retention
grafana grafana/grafana 3001 (localhost) Dashboards at https://grafana.wizard.lu
node-exporter prom/node-exporter 9100 (localhost) Host CPU/RAM/disk metrics
cadvisor gcr.io/cadvisor/cadvisor 8080 (localhost) Per-container resource metrics

All monitoring containers run under profiles: [full] in docker-compose.yml with memory limits (256 + 192 + 64 + 128 = 640 MB total).

┌──────────────┐     scrape      ┌─────────────────┐
│  Prometheus  │◄────────────────│  Orion API       │ /metrics
│  :9090       │◄────────────────│  node-exporter   │ :9100
│              │◄────────────────│  cAdvisor        │ :8080
└──────┬───────┘                 └─────────────────┘
       │ query
┌──────▼───────┐
│   Grafana    │──── https://grafana.wizard.lu
│   :3001      │
└──────────────┘

Configuration files:

  • monitoring/prometheus.yml — scrape targets (orion-api, node-exporter, cadvisor, self)
  • monitoring/grafana/provisioning/datasources/datasource.yml — auto-provisions Prometheus
  • monitoring/grafana/provisioning/dashboards/dashboard.yml — file-based dashboard provider

Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                     OBSERVABILITY FRAMEWORK                              │
│                    app/core/observability.py                            │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│  │ Health Checks   │  │ Prometheus      │  │ Sentry          │        │
│  │ Registry        │  │ Metrics         │  │ Integration     │        │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│           │                    │                    │                   │
│           ▼                    ▼                    ▼                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      API Endpoints                               │   │
│  │  /health  │  /health/live  │  /health/ready  │  /metrics        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │      External Tools           │
                    │  Flower │ Grafana │ Prometheus│
                    └───────────────────────────────┘

Health Checks

Health Check Registry

Components register health checks that are aggregated into a single endpoint.

from app.core.observability import (
    health_registry,
    HealthCheckResult,
    HealthStatus,
)

# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
    try:
        db.execute("SELECT 1")
        return HealthCheckResult(
            name="database",
            status=HealthStatus.HEALTHY,
            message="Database connection OK"
        )
    except Exception as e:
        return HealthCheckResult(
            name="database",
            status=HealthStatus.UNHEALTHY,
            message=str(e)
        )

# Or register directly
health_registry.register_check("redis", check_redis_connection)

Health Status Levels

Status Description HTTP Response
HEALTHY All systems operational 200
DEGRADED Partial functionality available 200
UNHEALTHY Critical failure 200 (check response body)

HealthCheckResult Fields

Field Type Description
name str Check identifier
status HealthStatus Health status level
message str Optional status message
latency_ms float Check execution time
details dict Additional diagnostic data
checked_at datetime Timestamp of check

API Endpoints

GET /health

Aggregated health check endpoint. Returns combined status from all registered checks.

Response:

{
    "status": "healthy",
    "timestamp": "2026-01-27T10:30:00Z",
    "checks": [
        {
            "name": "database",
            "status": "healthy",
            "message": "Connection OK",
            "latency_ms": 2.5,
            "details": {}
        },
        {
            "name": "module:billing",
            "status": "healthy",
            "message": "",
            "latency_ms": 0.1,
            "details": {}
        }
    ]
}

Overall Status Logic:

  • If any check is UNHEALTHY → overall is UNHEALTHY
  • If any check is DEGRADED and none UNHEALTHY → overall is DEGRADED
  • Otherwise → HEALTHY

GET /health/live

Kubernetes liveness probe. Returns 200 if application is running.

Response:

{"status": "alive"}

GET /health/ready

Kubernetes readiness probe. Returns ready status based on health checks.

Response:

{
    "status": "ready",
    "health": "healthy"
}

Or if unhealthy:

{
    "status": "not_ready",
    "health": "unhealthy"
}

GET /metrics

Prometheus metrics endpoint. Returns metrics in Prometheus text format.

Response:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...

GET /health/tools

Returns URLs to external monitoring tools.

Response:

{
    "flower": "http://flower.example.com:5555",
    "grafana": "http://grafana.example.com:3000",
    "prometheus": null
}

Prometheus Metrics

MetricsRegistry

The metrics registry provides a wrapper around prometheus_client with fallback when the library isn't installed.

from app.core.observability import metrics_registry

# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()

# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)

# Gauge - tracks current values
active_connections = metrics_registry.gauge(
    "active_connections",
    "Number of active connections",
    ["pool"]
)
active_connections.labels(pool="database").set(42)

Enabling Metrics

Metrics are disabled by default. Enable during initialization:

from app.core.observability import init_observability

init_observability(
    enable_metrics=True,
    # ... other options
)

Dummy Metrics

When prometheus_client isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.

Sentry Integration

Configuration

from app.core.observability import sentry, init_observability

# Initialize via init_observability
init_observability(
    sentry_dsn="https://key@sentry.io/project",
    environment="production",
)

# Or initialize directly
sentry.init(
    dsn="https://key@sentry.io/project",
    environment="production",
    release="1.0.0"
)

Capturing Errors

from app.core.observability import sentry

try:
    risky_operation()
except Exception as e:
    event_id = sentry.capture_exception(e)
    logger.error(f"Operation failed, Sentry event: {event_id}")

# Capture messages
sentry.capture_message("User reached rate limit", level="warning")

Without Sentry

If sentry_sdk isn't installed or DSN isn't provided, all capture calls silently return None.

Module Health Checks

Modules can provide health checks that are automatically registered.

Defining Module Health Check

# In module definition
from app.modules.base import ModuleDefinition

def check_billing_health() -> dict:
    """Check billing service health."""
    try:
        # Check Stripe connection
        stripe.Account.retrieve()
        return {"status": "healthy", "message": "Stripe connected"}
    except Exception as e:
        return {"status": "unhealthy", "message": str(e)}

billing_module = ModuleDefinition(
    code="billing",
    name="Billing",
    health_check=check_billing_health,
    # ...
)

Registering Module Health Checks

from app.core.observability import register_module_health_checks

# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()

This registers health checks as module:{code} (e.g., module:billing).

External Tools

Flower (Celery Monitoring)

Configure Flower URL to include in /health/tools:

init_observability(
    flower_url="http://flower:5555",
)

Grafana

Configure Grafana URL:

init_observability(
    grafana_url="http://grafana:3000",
)

Initialization

Application Lifespan

Observability is initialized in app/core/lifespan.py and the health router is mounted in main.py:

# app/core/lifespan.py
from app.core.config import settings
from app.core.observability import init_observability, shutdown_observability

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    init_observability(
        enable_metrics=settings.enable_metrics,
        sentry_dsn=settings.sentry_dsn,
        environment=settings.sentry_environment,
        flower_url=settings.flower_url,
        grafana_url=settings.grafana_url,
    )
    yield
    # Shutdown
    shutdown_observability()
# main.py
from app.core.observability import health_router
app.include_router(health_router)  # /metrics, /health/live, /health/ready, /health/tools

Note: /health is defined separately in main.py with a richer response (DB check, feature list, docs links). The health_router provides the Kubernetes-style probes and Prometheus endpoint.

Environment Variables

Variable Config field Description Default
ENABLE_METRICS enable_metrics Enable Prometheus metrics collection False
GRAFANA_URL grafana_url Grafana dashboard URL https://grafana.wizard.lu
GRAFANA_ADMIN_USER Grafana admin username (docker-compose only) admin
GRAFANA_ADMIN_PASSWORD Grafana admin password (docker-compose only) changeme
SENTRY_DSN sentry_dsn Sentry DSN for error tracking None (disabled)
SENTRY_ENVIRONMENT sentry_environment Environment name for Sentry development
FLOWER_URL flower_url Flower dashboard URL http://localhost:5555

Kubernetes Integration

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Best Practices

Do

  • Register health checks for critical dependencies (database, cache, external APIs)
  • Use appropriate metric types (counter for counts, histogram for latency)
  • Include meaningful labels but avoid high cardinality
  • Set up alerts based on health status changes

Don't

  • Create health checks that are slow or have side effects
  • Add high-cardinality labels to metrics (e.g., user IDs)
  • Ignore Sentry errors in production
  • Skip readiness probes in Kubernetes deployments