# Observability Framework The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on. ## Overview ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ OBSERVABILITY FRAMEWORK │ │ app/core/observability.py │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Health Checks │ │ Prometheus │ │ Sentry │ │ │ │ Registry │ │ Metrics │ │ Integration │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ API Endpoints │ │ │ │ /health │ /health/live │ /health/ready │ /metrics │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌───────────────────────────────┐ │ External Tools │ │ Flower │ Grafana │ Prometheus│ └───────────────────────────────┘ ``` ## Health Checks ### Health Check Registry Components register health checks that are aggregated into a single endpoint. ```python from app.core.observability import ( health_registry, HealthCheckResult, HealthStatus, ) # Register using decorator @health_registry.register("database") def check_database() -> HealthCheckResult: try: db.execute("SELECT 1") return HealthCheckResult( name="database", status=HealthStatus.HEALTHY, message="Database connection OK" ) except Exception as e: return HealthCheckResult( name="database", status=HealthStatus.UNHEALTHY, message=str(e) ) # Or register directly health_registry.register_check("redis", check_redis_connection) ``` ### Health Status Levels | Status | Description | HTTP Response | |--------|-------------|---------------| | `HEALTHY` | All systems operational | 200 | | `DEGRADED` | Partial functionality available | 200 | | `UNHEALTHY` | Critical failure | 200 (check response body) | ### HealthCheckResult Fields | Field | Type | Description | |-------|------|-------------| | `name` | str | Check identifier | | `status` | HealthStatus | Health status level | | `message` | str | Optional status message | | `latency_ms` | float | Check execution time | | `details` | dict | Additional diagnostic data | | `checked_at` | datetime | Timestamp of check | ## API Endpoints ### GET /health Aggregated health check endpoint. Returns combined status from all registered checks. **Response:** ```json { "status": "healthy", "timestamp": "2026-01-27T10:30:00Z", "checks": [ { "name": "database", "status": "healthy", "message": "Connection OK", "latency_ms": 2.5, "details": {} }, { "name": "module:billing", "status": "healthy", "message": "", "latency_ms": 0.1, "details": {} } ] } ``` **Overall Status Logic:** - If any check is `UNHEALTHY` → overall is `UNHEALTHY` - If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED` - Otherwise → `HEALTHY` ### GET /health/live Kubernetes liveness probe. Returns 200 if application is running. **Response:** ```json {"status": "alive"} ``` ### GET /health/ready Kubernetes readiness probe. Returns ready status based on health checks. **Response:** ```json { "status": "ready", "health": "healthy" } ``` Or if unhealthy: ```json { "status": "not_ready", "health": "unhealthy" } ``` ### GET /metrics Prometheus metrics endpoint. Returns metrics in Prometheus text format. **Response:** ``` # HELP http_requests_total Total HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234 ... ``` ### GET /health/tools Returns URLs to external monitoring tools. **Response:** ```json { "flower": "http://flower.example.com:5555", "grafana": "http://grafana.example.com:3000", "prometheus": null } ``` ## Prometheus Metrics ### MetricsRegistry The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed. ```python from app.core.observability import metrics_registry # Counter - tracks cumulative values request_counter = metrics_registry.counter( "http_requests_total", "Total HTTP requests", ["method", "endpoint", "status"] ) request_counter.labels(method="GET", endpoint="/api/products", status="200").inc() # Histogram - tracks distributions (latency, sizes) request_latency = metrics_registry.histogram( "http_request_duration_seconds", "HTTP request latency", ["endpoint"], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0] ) request_latency.labels(endpoint="/api/products").observe(0.045) # Gauge - tracks current values active_connections = metrics_registry.gauge( "active_connections", "Number of active connections", ["pool"] ) active_connections.labels(pool="database").set(42) ``` ### Enabling Metrics Metrics are disabled by default. Enable during initialization: ```python from app.core.observability import init_observability init_observability( enable_metrics=True, # ... other options ) ``` ### Dummy Metrics When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled. ## Sentry Integration ### Configuration ```python from app.core.observability import sentry, init_observability # Initialize via init_observability init_observability( sentry_dsn="https://key@sentry.io/project", environment="production", ) # Or initialize directly sentry.init( dsn="https://key@sentry.io/project", environment="production", release="1.0.0" ) ``` ### Capturing Errors ```python from app.core.observability import sentry try: risky_operation() except Exception as e: event_id = sentry.capture_exception(e) logger.error(f"Operation failed, Sentry event: {event_id}") # Capture messages sentry.capture_message("User reached rate limit", level="warning") ``` ### Without Sentry If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`. ## Module Health Checks Modules can provide health checks that are automatically registered. ### Defining Module Health Check ```python # In module definition from app.modules.base import ModuleDefinition def check_billing_health() -> dict: """Check billing service health.""" try: # Check Stripe connection stripe.Account.retrieve() return {"status": "healthy", "message": "Stripe connected"} except Exception as e: return {"status": "unhealthy", "message": str(e)} billing_module = ModuleDefinition( code="billing", name="Billing", health_check=check_billing_health, # ... ) ``` ### Registering Module Health Checks ```python from app.core.observability import register_module_health_checks # Call after modules are loaded (e.g., in app lifespan) register_module_health_checks() ``` This registers health checks as `module:{code}` (e.g., `module:billing`). ## External Tools ### Flower (Celery Monitoring) Configure Flower URL to include in `/health/tools`: ```python init_observability( flower_url="http://flower:5555", ) ``` ### Grafana Configure Grafana URL: ```python init_observability( grafana_url="http://grafana:3000", ) ``` ## Initialization ### Application Lifespan ```python # main.py from contextlib import asynccontextmanager from fastapi import FastAPI from app.core.observability import ( init_observability, shutdown_observability, health_router, register_module_health_checks, ) @asynccontextmanager async def lifespan(app: FastAPI): # Startup init_observability( enable_metrics=True, sentry_dsn=settings.SENTRY_DSN, environment=settings.ENVIRONMENT, flower_url=settings.FLOWER_URL, grafana_url=settings.GRAFANA_URL, ) register_module_health_checks() yield # Shutdown shutdown_observability() app = FastAPI(lifespan=lifespan) app.include_router(health_router) ``` ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) | | `ENVIRONMENT` | Environment name | "development" | | `ENABLE_METRICS` | Enable Prometheus metrics | False | | `FLOWER_URL` | Flower dashboard URL | None | | `GRAFANA_URL` | Grafana dashboard URL | None | ## Kubernetes Integration ### Deployment Configuration ```yaml apiVersion: apps/v1 kind: Deployment spec: template: spec: containers: - name: app livenessProbe: httpGet: path: /health/live port: 8000 initialDelaySeconds: 10 periodSeconds: 30 readinessProbe: httpGet: path: /health/ready port: 8000 initialDelaySeconds: 5 periodSeconds: 10 ``` ### Prometheus ServiceMonitor ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor spec: endpoints: - port: http path: /metrics interval: 15s ``` ## Best Practices ### Do - Register health checks for critical dependencies (database, cache, external APIs) - Use appropriate metric types (counter for counts, histogram for latency) - Include meaningful labels but avoid high cardinality - Set up alerts based on health status changes ### Don't - Create health checks that are slow or have side effects - Add high-cardinality labels to metrics (e.g., user IDs) - Ignore Sentry errors in production - Skip readiness probes in Kubernetes deployments ## Related Documentation - [Module System](module-system.md) - Module health check integration - [Background Tasks](background-tasks.md) - Celery monitoring with Flower - [Deployment](../deployment/index.md) - Production deployment with monitoring