sboulahtit/orion

Fork 0

Files

Samir Boulahtit 677e5211f9

CI / ruff (push) Successful in 12s

Details

CI / docs (push) Has been cancelled

Details

CI / deploy (push) Has been cancelled

Details

CI / validate (push) Has been cancelled

Details

CI / dependency-scanning (push) Has been cancelled

Details

CI / pytest (push) Has been cancelled

Details

docs: update observability and deployment docs to match production stack

Update observability.md with production container table, actual init code,
and correct env var names. Update docker.md with full 10-service table and
backup/monitoring cross-references. Add explicit AAAA records to DNS tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-15 16:44:05 +01:00

14 KiB

Raw Blame History

Observability Framework

The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.

Production Stack

The full monitoring stack runs as Docker containers alongside the application:

Container	Image	Port	Purpose
prometheus	`prom/prometheus`	9090 (localhost)	Metrics storage, 15-day retention
grafana	`grafana/grafana`	3001 (localhost)	Dashboards at `https://grafana.wizard.lu`
node-exporter	`prom/node-exporter`	9100 (localhost)	Host CPU/RAM/disk metrics
cadvisor	`gcr.io/cadvisor/cadvisor`	8080 (localhost)	Per-container resource metrics

All monitoring containers run under profiles: [full] in docker-compose.yml with memory limits (256 + 192 + 64 + 128 = 640 MB total).

┌──────────────┐     scrape      ┌─────────────────┐
│  Prometheus  │◄────────────────│  Orion API       │ /metrics
│  :9090       │◄────────────────│  node-exporter   │ :9100
│              │◄────────────────│  cAdvisor        │ :8080
└──────┬───────┘                 └─────────────────┘
       │ query
┌──────▼───────┐
│   Grafana    │──── https://grafana.wizard.lu
│   :3001      │
└──────────────┘

Configuration files:

monitoring/prometheus.yml — scrape targets (orion-api, node-exporter, cadvisor, self)
monitoring/grafana/provisioning/datasources/datasource.yml — auto-provisions Prometheus
monitoring/grafana/provisioning/dashboards/dashboard.yml — file-based dashboard provider

Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                     OBSERVABILITY FRAMEWORK                              │
│                    app/core/observability.py                            │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│  │ Health Checks   │  │ Prometheus      │  │ Sentry          │        │
│  │ Registry        │  │ Metrics         │  │ Integration     │        │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│           │                    │                    │                   │
│           ▼                    ▼                    ▼                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      API Endpoints                               │   │
│  │  /health  │  /health/live  │  /health/ready  │  /metrics        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │      External Tools           │
                    │  Flower │ Grafana │ Prometheus│
                    └───────────────────────────────┘

Health Checks

Health Check Registry

Components register health checks that are aggregated into a single endpoint.

from app.core.observability import (
    health_registry,
    HealthCheckResult,
    HealthStatus,
)

# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
    try:
        db.execute("SELECT 1")
        return HealthCheckResult(
            name="database",
            status=HealthStatus.HEALTHY,
            message="Database connection OK"
        )
    except Exception as e:
        return HealthCheckResult(
            name="database",
            status=HealthStatus.UNHEALTHY,
            message=str(e)
        )

# Or register directly
health_registry.register_check("redis", check_redis_connection)

Health Status Levels

Status	Description	HTTP Response
`HEALTHY`	All systems operational	200
`DEGRADED`	Partial functionality available	200
`UNHEALTHY`	Critical failure	200 (check response body)

HealthCheckResult Fields

Field	Type	Description
`name`	str	Check identifier
`status`	HealthStatus	Health status level
`message`	str	Optional status message
`latency_ms`	float	Check execution time
`details`	dict	Additional diagnostic data
`checked_at`	datetime	Timestamp of check

API Endpoints

GET /health

Aggregated health check endpoint. Returns combined status from all registered checks.

Response:

{
    "status": "healthy",
    "timestamp": "2026-01-27T10:30:00Z",
    "checks": [
        {
            "name": "database",
            "status": "healthy",
            "message": "Connection OK",
            "latency_ms": 2.5,
            "details": {}
        },
        {
            "name": "module:billing",
            "status": "healthy",
            "message": "",
            "latency_ms": 0.1,
            "details": {}
        }
    ]
}

Overall Status Logic:

If any check is UNHEALTHY → overall is UNHEALTHY
If any check is DEGRADED and none UNHEALTHY → overall is DEGRADED
Otherwise → HEALTHY

GET /health/live

Kubernetes liveness probe. Returns 200 if application is running.

Response:

{"status": "alive"}

GET /health/ready

Kubernetes readiness probe. Returns ready status based on health checks.

Response:

{
    "status": "ready",
    "health": "healthy"
}

Or if unhealthy:

{
    "status": "not_ready",
    "health": "unhealthy"
}

GET /metrics

Prometheus metrics endpoint. Returns metrics in Prometheus text format.

Response:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...

GET /health/tools

Returns URLs to external monitoring tools.

Response:

{
    "flower": "http://flower.example.com:5555",
    "grafana": "http://grafana.example.com:3000",
    "prometheus": null
}

Prometheus Metrics

MetricsRegistry

The metrics registry provides a wrapper around prometheus_client with fallback when the library isn't installed.

from app.core.observability import metrics_registry

# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()

# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)

# Gauge - tracks current values
active_connections = metrics_registry.gauge(
    "active_connections",
    "Number of active connections",
    ["pool"]
)
active_connections.labels(pool="database").set(42)

Enabling Metrics

Metrics are disabled by default. Enable during initialization:

from app.core.observability import init_observability

init_observability(
    enable_metrics=True,
    # ... other options
)

Dummy Metrics

When prometheus_client isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.

Sentry Integration

Configuration

from app.core.observability import sentry, init_observability

# Initialize via init_observability
init_observability(
    sentry_dsn="https://key@sentry.io/project",
    environment="production",
)

# Or initialize directly
sentry.init(
    dsn="https://key@sentry.io/project",
    environment="production",
    release="1.0.0"
)

Capturing Errors

from app.core.observability import sentry

try:
    risky_operation()
except Exception as e:
    event_id = sentry.capture_exception(e)
    logger.error(f"Operation failed, Sentry event: {event_id}")

# Capture messages
sentry.capture_message("User reached rate limit", level="warning")

Without Sentry

If sentry_sdk isn't installed or DSN isn't provided, all capture calls silently return None.

Module Health Checks

Modules can provide health checks that are automatically registered.

Defining Module Health Check

# In module definition
from app.modules.base import ModuleDefinition

def check_billing_health() -> dict:
    """Check billing service health."""
    try:
        # Check Stripe connection
        stripe.Account.retrieve()
        return {"status": "healthy", "message": "Stripe connected"}
    except Exception as e:
        return {"status": "unhealthy", "message": str(e)}

billing_module = ModuleDefinition(
    code="billing",
    name="Billing",
    health_check=check_billing_health,
    # ...
)

Registering Module Health Checks

from app.core.observability import register_module_health_checks

# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()

This registers health checks as module:{code} (e.g., module:billing).

External Tools

Flower (Celery Monitoring)

Configure Flower URL to include in /health/tools:

init_observability(
    flower_url="http://flower:5555",
)

Grafana

Configure Grafana URL:

init_observability(
    grafana_url="http://grafana:3000",
)

Initialization

Application Lifespan

Observability is initialized in app/core/lifespan.py and the health router is mounted in main.py:

# app/core/lifespan.py
from app.core.config import settings
from app.core.observability import init_observability, shutdown_observability

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    init_observability(
        enable_metrics=settings.enable_metrics,
        sentry_dsn=settings.sentry_dsn,
        environment=settings.sentry_environment,
        flower_url=settings.flower_url,
        grafana_url=settings.grafana_url,
    )
    yield
    # Shutdown
    shutdown_observability()

# main.py
from app.core.observability import health_router
app.include_router(health_router)  # /metrics, /health/live, /health/ready, /health/tools

Note: /health is defined separately in main.py with a richer response (DB check, feature list, docs links). The health_router provides the Kubernetes-style probes and Prometheus endpoint.

Environment Variables

Variable	Config field	Description	Default
`ENABLE_METRICS`	`enable_metrics`	Enable Prometheus metrics collection	`False`
`GRAFANA_URL`	`grafana_url`	Grafana dashboard URL	`https://grafana.wizard.lu`
`GRAFANA_ADMIN_USER`	—	Grafana admin username (docker-compose only)	`admin`
`GRAFANA_ADMIN_PASSWORD`	—	Grafana admin password (docker-compose only)	`changeme`
`SENTRY_DSN`	`sentry_dsn`	Sentry DSN for error tracking	`None` (disabled)
`SENTRY_ENVIRONMENT`	`sentry_environment`	Environment name for Sentry	`development`
`FLOWER_URL`	`flower_url`	Flower dashboard URL	`http://localhost:5555`

Kubernetes Integration

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Best Practices

Do

Register health checks for critical dependencies (database, cache, external APIs)
Use appropriate metric types (counter for counts, histogram for latency)
Include meaningful labels but avoid high cardinality
Set up alerts based on health status changes

Don't

Create health checks that are slow or have side effects
Add high-cardinality labels to metrics (e.g., user IDs)
Ignore Sentry errors in production
Skip readiness probes in Kubernetes deployments

Hetzner Server Setup — Step 18 - Production monitoring deployment
Module System - Module health check integration
Background Tasks - Celery monitoring with Flower
Deployment - Production deployment with monitoring

14 KiB Raw Blame History

Observability Framework

Production Stack

Overview

Health Checks

Health Check Registry

Health Status Levels

HealthCheckResult Fields

API Endpoints

GET /health

GET /health/live

GET /health/ready

GET /metrics

GET /health/tools

Prometheus Metrics

MetricsRegistry

Enabling Metrics

Dummy Metrics

Sentry Integration

Configuration

Capturing Errors

Without Sentry

Module Health Checks

Defining Module Health Check

Registering Module Health Checks

External Tools

Flower (Celery Monitoring)

Grafana

Initialization

Application Lifespan

Environment Variables

Kubernetes Integration

Deployment Configuration

Prometheus ServiceMonitor

Best Practices

Do

Don't

Related Documentation

14 KiB

Raw Blame History