orion/docs/architecture/observability.md

# Observability Framework

The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.

## Production Stack

The full monitoring stack runs as Docker containers alongside the application:

| Container | Image | Port | Purpose |
|---|---|---|---|
| prometheus | `prom/prometheus` | 9090 (localhost) | Metrics storage, 15-day retention |
| grafana | `grafana/grafana` | 3001 (localhost) | Dashboards at `https://grafana.wizard.lu` |
| node-exporter | `prom/node-exporter` | 9100 (localhost) | Host CPU/RAM/disk metrics |
| cadvisor | `gcr.io/cadvisor/cadvisor` | 8080 (localhost) | Per-container resource metrics |

All monitoring containers run under `profiles: [full]` in `docker-compose.yml` with memory limits (256 + 192 + 64 + 128 = 640 MB total).

```
┌──────────────┐     scrape      ┌─────────────────┐
│  Prometheus  │◄────────────────│  Orion API       │ /metrics
│  :9090       │◄────────────────│  node-exporter   │ :9100
│              │◄────────────────│  cAdvisor        │ :8080
└──────┬───────┘                 └─────────────────┘
       │ query
┌──────▼───────┐
│   Grafana    │──── https://grafana.wizard.lu
│   :3001      │
└──────────────┘
```

Configuration files:

- `monitoring/prometheus.yml` — scrape targets (orion-api, node-exporter, cadvisor, self)
- `monitoring/grafana/provisioning/datasources/datasource.yml` — auto-provisions Prometheus
- `monitoring/grafana/provisioning/dashboards/dashboard.yml` — file-based dashboard provider

## Overview

```
┌─────────────────────────────────────────────────────────────────────────┐
│                     OBSERVABILITY FRAMEWORK                              │
│                    app/core/observability.py                            │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│  │ Health Checks   │  │ Prometheus      │  │ Sentry          │        │
│  │ Registry        │  │ Metrics         │  │ Integration     │        │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│           │                    │                    │                   │
│           ▼                    ▼                    ▼                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      API Endpoints                               │   │
│  │  /health  │  /health/live  │  /health/ready  │  /metrics        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │      External Tools           │
                    │  Flower │ Grafana │ Prometheus│
                    └───────────────────────────────┘
```

## Health Checks

### Health Check Registry

Components register health checks that are aggregated into a single endpoint.

```python
from app.core.observability import (
    health_registry,
    HealthCheckResult,
    HealthStatus,
)

# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
    try:
        db.execute("SELECT 1")
        return HealthCheckResult(
            name="database",
            status=HealthStatus.HEALTHY,
            message="Database connection OK"
        )
    except Exception as e:
        return HealthCheckResult(
            name="database",
            status=HealthStatus.UNHEALTHY,
            message=str(e)
        )

# Or register directly
health_registry.register_check("redis", check_redis_connection)
```

### Health Status Levels

| Status | Description | HTTP Response |
|--------|-------------|---------------|
| `HEALTHY` | All systems operational | 200 |
| `DEGRADED` | Partial functionality available | 200 |
| `UNHEALTHY` | Critical failure | 200 (check response body) |

### HealthCheckResult Fields

| Field | Type | Description |
|-------|------|-------------|
| `name` | str | Check identifier |
| `status` | HealthStatus | Health status level |
| `message` | str | Optional status message |
| `latency_ms` | float | Check execution time |
| `details` | dict | Additional diagnostic data |
| `checked_at` | datetime | Timestamp of check |

## API Endpoints

### GET /health

Aggregated health check endpoint. Returns combined status from all registered checks.

**Response:**
```json
{
    "status": "healthy",
    "timestamp": "2026-01-27T10:30:00Z",
    "checks": [
        {
            "name": "database",
            "status": "healthy",
            "message": "Connection OK",
            "latency_ms": 2.5,
            "details": {}
        },
        {
            "name": "module:billing",
            "status": "healthy",
            "message": "",
            "latency_ms": 0.1,
            "details": {}
        }
    ]
}
```

**Overall Status Logic:**
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
- Otherwise → `HEALTHY`

### GET /health/live

Kubernetes liveness probe. Returns 200 if application is running.

**Response:**
```json
{"status": "alive"}
```

### GET /health/ready

Kubernetes readiness probe. Returns ready status based on health checks.

**Response:**
```json
{
    "status": "ready",
    "health": "healthy"
}
```

Or if unhealthy:
```json
{
    "status": "not_ready",
    "health": "unhealthy"
}
```

### GET /metrics

Prometheus metrics endpoint. Returns metrics in Prometheus text format.

**Response:**
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...
```

### GET /health/tools

Returns URLs to external monitoring tools.

**Response:**
```json
{
    "flower": "http://flower.example.com:5555",
    "grafana": "http://grafana.example.com:3000",
    "prometheus": null
}
```

## Prometheus Metrics

### MetricsRegistry

The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.

```python
from app.core.observability import metrics_registry

# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()

# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)

# Gauge - tracks current values
active_connections = metrics_registry.gauge(
    "active_connections",
    "Number of active connections",
    ["pool"]
)
active_connections.labels(pool="database").set(42)
```

### Enabling Metrics

Metrics are disabled by default. Enable during initialization:

```python
from app.core.observability import init_observability

init_observability(
    enable_metrics=True,
    # ... other options
)
```

### Dummy Metrics

When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.

## Sentry Integration

### Configuration

```python
from app.core.observability import sentry, init_observability

# Initialize via init_observability
init_observability(
    sentry_dsn="https://key@sentry.io/project",
    environment="production",
)

# Or initialize directly
sentry.init(
    dsn="https://key@sentry.io/project",
    environment="production",
    release="1.0.0"
)
```

### Capturing Errors

```python
from app.core.observability import sentry

try:
    risky_operation()
except Exception as e:
    event_id = sentry.capture_exception(e)
    logger.error(f"Operation failed, Sentry event: {event_id}")

# Capture messages
sentry.capture_message("User reached rate limit", level="warning")
```

### Without Sentry

If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.

## Module Health Checks

Modules can provide health checks that are automatically registered.

### Defining Module Health Check

```python
# In module definition
from app.modules.base import ModuleDefinition

def check_billing_health() -> dict:
    """Check billing service health."""
    try:
        # Check Stripe connection
        stripe.Account.retrieve()
        return {"status": "healthy", "message": "Stripe connected"}
    except Exception as e:
        return {"status": "unhealthy", "message": str(e)}

billing_module = ModuleDefinition(
    code="billing",
    name="Billing",
    health_check=check_billing_health,
    # ...
)
```

### Registering Module Health Checks

```python
from app.core.observability import register_module_health_checks

# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()
```

This registers health checks as `module:{code}` (e.g., `module:billing`).

## External Tools

### Flower (Celery Monitoring)

Configure Flower URL to include in `/health/tools`:

```python
init_observability(
    flower_url="http://flower:5555",
)
```

### Grafana

Configure Grafana URL:

```python
init_observability(
    grafana_url="http://grafana:3000",
)
```

## Initialization

### Application Lifespan

Observability is initialized in `app/core/lifespan.py` and the health router is mounted in `main.py`:

```python
# app/core/lifespan.py
from app.core.config import settings
from app.core.observability import init_observability, shutdown_observability

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    init_observability(
        enable_metrics=settings.enable_metrics,
        sentry_dsn=settings.sentry_dsn,
        environment=settings.sentry_environment,
        flower_url=settings.flower_url,
        grafana_url=settings.grafana_url,
    )
    yield
    # Shutdown
    shutdown_observability()
```

```python
# main.py
from app.core.observability import health_router
app.include_router(health_router)  # /metrics, /health/live, /health/ready, /health/tools
```

Note: `/health` is defined separately in `main.py` with a richer response (DB check, feature list, docs links). The `health_router` provides the Kubernetes-style probes and Prometheus endpoint.

### Environment Variables

| Variable | Config field | Description | Default |
|----------|-------------|-------------|---------|
| `ENABLE_METRICS` | `enable_metrics` | Enable Prometheus metrics collection | `False` |
| `GRAFANA_URL` | `grafana_url` | Grafana dashboard URL | `https://grafana.wizard.lu` |
| `GRAFANA_ADMIN_USER` | — | Grafana admin username (docker-compose only) | `admin` |
| `GRAFANA_ADMIN_PASSWORD` | — | Grafana admin password (docker-compose only) | `changeme` |
| `SENTRY_DSN` | `sentry_dsn` | Sentry DSN for error tracking | `None` (disabled) |
| `SENTRY_ENVIRONMENT` | `sentry_environment` | Environment name for Sentry | `development` |
| `FLOWER_URL` | `flower_url` | Flower dashboard URL | `http://localhost:5555` |

## Kubernetes Integration

### Deployment Configuration

```yaml
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10
```

### Prometheus ServiceMonitor

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
```

## Best Practices

### Do

- Register health checks for critical dependencies (database, cache, external APIs)
- Use appropriate metric types (counter for counts, histogram for latency)
- Include meaningful labels but avoid high cardinality
- Set up alerts based on health status changes

### Don't

- Create health checks that are slow or have side effects
- Add high-cardinality labels to metrics (e.g., user IDs)
- Ignore Sentry errors in production
- Skip readiness probes in Kubernetes deployments

## Related Documentation

- [Hetzner Server Setup — Step 18](../deployment/hetzner-server-setup.md#step-18-monitoring-observability) - Production monitoring deployment
- [Module System](module-system.md) - Module health check integration
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
- [Deployment](../deployment/index.md) - Production deployment with monitoring