orion/docs/architecture/observability.md

# Observability Framework

The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.

## Overview

```
┌─────────────────────────────────────────────────────────────────────────┐
│                     OBSERVABILITY FRAMEWORK                              │
│                    app/core/observability.py                            │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│  │ Health Checks   │  │ Prometheus      │  │ Sentry          │        │
│  │ Registry        │  │ Metrics         │  │ Integration     │        │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│           │                    │                    │                   │
│           ▼                    ▼                    ▼                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      API Endpoints                               │   │
│  │  /health  │  /health/live  │  /health/ready  │  /metrics        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │      External Tools           │
                    │  Flower │ Grafana │ Prometheus│
                    └───────────────────────────────┘
```

## Health Checks

### Health Check Registry

Components register health checks that are aggregated into a single endpoint.

```python
from app.core.observability import (
    health_registry,
    HealthCheckResult,
    HealthStatus,
)

# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
    try:
        db.execute("SELECT 1")
        return HealthCheckResult(
            name="database",
            status=HealthStatus.HEALTHY,
            message="Database connection OK"
        )
    except Exception as e:
        return HealthCheckResult(
            name="database",
            status=HealthStatus.UNHEALTHY,
            message=str(e)
        )

# Or register directly
health_registry.register_check("redis", check_redis_connection)
```

### Health Status Levels

| Status | Description | HTTP Response |
|--------|-------------|---------------|
| `HEALTHY` | All systems operational | 200 |
| `DEGRADED` | Partial functionality available | 200 |
| `UNHEALTHY` | Critical failure | 200 (check response body) |

### HealthCheckResult Fields

| Field | Type | Description |
|-------|------|-------------|
| `name` | str | Check identifier |
| `status` | HealthStatus | Health status level |
| `message` | str | Optional status message |
| `latency_ms` | float | Check execution time |
| `details` | dict | Additional diagnostic data |
| `checked_at` | datetime | Timestamp of check |

## API Endpoints

### GET /health

Aggregated health check endpoint. Returns combined status from all registered checks.

**Response:**
```json
{
    "status": "healthy",
    "timestamp": "2026-01-27T10:30:00Z",
    "checks": [
        {
            "name": "database",
            "status": "healthy",
            "message": "Connection OK",
            "latency_ms": 2.5,
            "details": {}
        },
        {
            "name": "module:billing",
            "status": "healthy",
            "message": "",
            "latency_ms": 0.1,
            "details": {}
        }
    ]
}
```

**Overall Status Logic:**
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
- Otherwise → `HEALTHY`

### GET /health/live

Kubernetes liveness probe. Returns 200 if application is running.

**Response:**
```json
{"status": "alive"}
```

### GET /health/ready

Kubernetes readiness probe. Returns ready status based on health checks.

**Response:**
```json
{
    "status": "ready",
    "health": "healthy"
}
```

Or if unhealthy:
```json
{
    "status": "not_ready",
    "health": "unhealthy"
}
```

### GET /metrics

Prometheus metrics endpoint. Returns metrics in Prometheus text format.

**Response:**
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...
```

### GET /health/tools

Returns URLs to external monitoring tools.

**Response:**
```json
{
    "flower": "http://flower.example.com:5555",
    "grafana": "http://grafana.example.com:3000",
    "prometheus": null
}
```

## Prometheus Metrics

### MetricsRegistry

The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.

```python
from app.core.observability import metrics_registry

# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()

# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)

# Gauge - tracks current values
active_connections = metrics_registry.gauge(
    "active_connections",
    "Number of active connections",
    ["pool"]
)
active_connections.labels(pool="database").set(42)
```

### Enabling Metrics

Metrics are disabled by default. Enable during initialization:

```python
from app.core.observability import init_observability

init_observability(
    enable_metrics=True,
    # ... other options
)
```

### Dummy Metrics

When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.

## Sentry Integration

### Configuration

```python
from app.core.observability import sentry, init_observability

# Initialize via init_observability
init_observability(
    sentry_dsn="https://key@sentry.io/project",
    environment="production",
)

# Or initialize directly
sentry.init(
    dsn="https://key@sentry.io/project",
    environment="production",
    release="1.0.0"
)
```

### Capturing Errors

```python
from app.core.observability import sentry

try:
    risky_operation()
except Exception as e:
    event_id = sentry.capture_exception(e)
    logger.error(f"Operation failed, Sentry event: {event_id}")

# Capture messages
sentry.capture_message("User reached rate limit", level="warning")
```

### Without Sentry

If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.

## Module Health Checks

Modules can provide health checks that are automatically registered.

### Defining Module Health Check

```python
# In module definition
from app.modules.base import ModuleDefinition

def check_billing_health() -> dict:
    """Check billing service health."""
    try:
        # Check Stripe connection
        stripe.Account.retrieve()
        return {"status": "healthy", "message": "Stripe connected"}
    except Exception as e:
        return {"status": "unhealthy", "message": str(e)}

billing_module = ModuleDefinition(
    code="billing",
    name="Billing",
    health_check=check_billing_health,
    # ...
)
```

### Registering Module Health Checks

```python
from app.core.observability import register_module_health_checks

# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()
```

This registers health checks as `module:{code}` (e.g., `module:billing`).

## External Tools

### Flower (Celery Monitoring)

Configure Flower URL to include in `/health/tools`:

```python
init_observability(
    flower_url="http://flower:5555",
)
```

### Grafana

Configure Grafana URL:

```python
init_observability(
    grafana_url="http://grafana:3000",
)
```

## Initialization

### Application Lifespan

```python
# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.core.observability import (
    init_observability,
    shutdown_observability,
    health_router,
    register_module_health_checks,
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    init_observability(
        enable_metrics=True,
        sentry_dsn=settings.SENTRY_DSN,
        environment=settings.ENVIRONMENT,
        flower_url=settings.FLOWER_URL,
        grafana_url=settings.GRAFANA_URL,
    )
    register_module_health_checks()

    yield

    # Shutdown
    shutdown_observability()

app = FastAPI(lifespan=lifespan)
app.include_router(health_router)
```

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
| `ENVIRONMENT` | Environment name | "development" |
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
| `FLOWER_URL` | Flower dashboard URL | None |
| `GRAFANA_URL` | Grafana dashboard URL | None |

## Kubernetes Integration

### Deployment Configuration

```yaml
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10
```

### Prometheus ServiceMonitor

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
```

## Best Practices

### Do

- Register health checks for critical dependencies (database, cache, external APIs)
- Use appropriate metric types (counter for counts, histogram for latency)
- Include meaningful labels but avoid high cardinality
- Set up alerts based on health status changes

### Don't

- Create health checks that are slow or have side effects
- Add high-cardinality labels to metrics (e.g., user IDs)
- Ignore Sentry errors in production
- Skip readiness probes in Kubernetes deployments

## Related Documentation

- [Module System](module-system.md) - Module health check integration
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
- [Deployment](../deployment/index.md) - Production deployment with monitoring