Files
orion/docs/architecture/observability.md
Samir Boulahtit e9253fbd84 refactor: rename Wizamart to Orion across entire codebase
Replace all ~1,086 occurrences of Wizamart/wizamart/WIZAMART/WizaMart
with Orion/orion/ORION across 184 files. This includes database
identifiers, email addresses, domain references, R2 bucket names,
DNS prefixes, encryption salt, Celery app name, config defaults,
Docker configs, CI configs, documentation, seed data, and templates.

Renames homepage-wizamart.html template to homepage-orion.html.
Fixes duplicate file_pattern key in api.yaml architecture rule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 16:46:56 +01:00

12 KiB

Observability Framework

The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.

Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                     OBSERVABILITY FRAMEWORK                              │
│                    app/core/observability.py                            │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│  │ Health Checks   │  │ Prometheus      │  │ Sentry          │        │
│  │ Registry        │  │ Metrics         │  │ Integration     │        │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│           │                    │                    │                   │
│           ▼                    ▼                    ▼                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      API Endpoints                               │   │
│  │  /health  │  /health/live  │  /health/ready  │  /metrics        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │      External Tools           │
                    │  Flower │ Grafana │ Prometheus│
                    └───────────────────────────────┘

Health Checks

Health Check Registry

Components register health checks that are aggregated into a single endpoint.

from app.core.observability import (
    health_registry,
    HealthCheckResult,
    HealthStatus,
)

# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
    try:
        db.execute("SELECT 1")
        return HealthCheckResult(
            name="database",
            status=HealthStatus.HEALTHY,
            message="Database connection OK"
        )
    except Exception as e:
        return HealthCheckResult(
            name="database",
            status=HealthStatus.UNHEALTHY,
            message=str(e)
        )

# Or register directly
health_registry.register_check("redis", check_redis_connection)

Health Status Levels

Status Description HTTP Response
HEALTHY All systems operational 200
DEGRADED Partial functionality available 200
UNHEALTHY Critical failure 200 (check response body)

HealthCheckResult Fields

Field Type Description
name str Check identifier
status HealthStatus Health status level
message str Optional status message
latency_ms float Check execution time
details dict Additional diagnostic data
checked_at datetime Timestamp of check

API Endpoints

GET /health

Aggregated health check endpoint. Returns combined status from all registered checks.

Response:

{
    "status": "healthy",
    "timestamp": "2026-01-27T10:30:00Z",
    "checks": [
        {
            "name": "database",
            "status": "healthy",
            "message": "Connection OK",
            "latency_ms": 2.5,
            "details": {}
        },
        {
            "name": "module:billing",
            "status": "healthy",
            "message": "",
            "latency_ms": 0.1,
            "details": {}
        }
    ]
}

Overall Status Logic:

  • If any check is UNHEALTHY → overall is UNHEALTHY
  • If any check is DEGRADED and none UNHEALTHY → overall is DEGRADED
  • Otherwise → HEALTHY

GET /health/live

Kubernetes liveness probe. Returns 200 if application is running.

Response:

{"status": "alive"}

GET /health/ready

Kubernetes readiness probe. Returns ready status based on health checks.

Response:

{
    "status": "ready",
    "health": "healthy"
}

Or if unhealthy:

{
    "status": "not_ready",
    "health": "unhealthy"
}

GET /metrics

Prometheus metrics endpoint. Returns metrics in Prometheus text format.

Response:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...

GET /health/tools

Returns URLs to external monitoring tools.

Response:

{
    "flower": "http://flower.example.com:5555",
    "grafana": "http://grafana.example.com:3000",
    "prometheus": null
}

Prometheus Metrics

MetricsRegistry

The metrics registry provides a wrapper around prometheus_client with fallback when the library isn't installed.

from app.core.observability import metrics_registry

# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()

# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)

# Gauge - tracks current values
active_connections = metrics_registry.gauge(
    "active_connections",
    "Number of active connections",
    ["pool"]
)
active_connections.labels(pool="database").set(42)

Enabling Metrics

Metrics are disabled by default. Enable during initialization:

from app.core.observability import init_observability

init_observability(
    enable_metrics=True,
    # ... other options
)

Dummy Metrics

When prometheus_client isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.

Sentry Integration

Configuration

from app.core.observability import sentry, init_observability

# Initialize via init_observability
init_observability(
    sentry_dsn="https://key@sentry.io/project",
    environment="production",
)

# Or initialize directly
sentry.init(
    dsn="https://key@sentry.io/project",
    environment="production",
    release="1.0.0"
)

Capturing Errors

from app.core.observability import sentry

try:
    risky_operation()
except Exception as e:
    event_id = sentry.capture_exception(e)
    logger.error(f"Operation failed, Sentry event: {event_id}")

# Capture messages
sentry.capture_message("User reached rate limit", level="warning")

Without Sentry

If sentry_sdk isn't installed or DSN isn't provided, all capture calls silently return None.

Module Health Checks

Modules can provide health checks that are automatically registered.

Defining Module Health Check

# In module definition
from app.modules.base import ModuleDefinition

def check_billing_health() -> dict:
    """Check billing service health."""
    try:
        # Check Stripe connection
        stripe.Account.retrieve()
        return {"status": "healthy", "message": "Stripe connected"}
    except Exception as e:
        return {"status": "unhealthy", "message": str(e)}

billing_module = ModuleDefinition(
    code="billing",
    name="Billing",
    health_check=check_billing_health,
    # ...
)

Registering Module Health Checks

from app.core.observability import register_module_health_checks

# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()

This registers health checks as module:{code} (e.g., module:billing).

External Tools

Flower (Celery Monitoring)

Configure Flower URL to include in /health/tools:

init_observability(
    flower_url="http://flower:5555",
)

Grafana

Configure Grafana URL:

init_observability(
    grafana_url="http://grafana:3000",
)

Initialization

Application Lifespan

# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.core.observability import (
    init_observability,
    shutdown_observability,
    health_router,
    register_module_health_checks,
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    init_observability(
        enable_metrics=True,
        sentry_dsn=settings.SENTRY_DSN,
        environment=settings.ENVIRONMENT,
        flower_url=settings.FLOWER_URL,
        grafana_url=settings.GRAFANA_URL,
    )
    register_module_health_checks()

    yield

    # Shutdown
    shutdown_observability()

app = FastAPI(lifespan=lifespan)
app.include_router(health_router)

Environment Variables

Variable Description Default
SENTRY_DSN Sentry DSN for error tracking None (disabled)
ENVIRONMENT Environment name "development"
ENABLE_METRICS Enable Prometheus metrics False
FLOWER_URL Flower dashboard URL None
GRAFANA_URL Grafana dashboard URL None

Kubernetes Integration

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Best Practices

Do

  • Register health checks for critical dependencies (database, cache, external APIs)
  • Use appropriate metric types (counter for counts, histogram for latency)
  • Include meaningful labels but avoid high cardinality
  • Set up alerts based on health status changes

Don't

  • Create health checks that are slow or have side effects
  • Add high-cardinality labels to metrics (e.g., user IDs)
  • Ignore Sentry errors in production
  • Skip readiness probes in Kubernetes deployments