Files

Samir Boulahtit 7dbdbd4c7e docs: add observability, creating modules guide, and unified migration plan

- Add observability framework documentation (health checks, metrics, Sentry)
- Add developer guide for creating modules
- Add comprehensive module migration plan with Celery task integration
- Update architecture overview with module system and observability sections
- Update module-system.md with links to new docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-27 22:41:19 +01:00

12 KiB

Raw Blame History

Observability Framework

The Wizamart platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.

Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                     OBSERVABILITY FRAMEWORK                              │
│                    app/core/observability.py                            │
│                                                                          │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
│  │ Health Checks   │  │ Prometheus      │  │ Sentry          │        │
│  │ Registry        │  │ Metrics         │  │ Integration     │        │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
│           │                    │                    │                   │
│           ▼                    ▼                    ▼                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      API Endpoints                               │   │
│  │  /health  │  /health/live  │  /health/ready  │  /metrics        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────────┐
                    │      External Tools           │
                    │  Flower │ Grafana │ Prometheus│
                    └───────────────────────────────┘

Health Checks

Health Check Registry

Components register health checks that are aggregated into a single endpoint.

from app.core.observability import (
    health_registry,
    HealthCheckResult,
    HealthStatus,
)

# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
    try:
        db.execute("SELECT 1")
        return HealthCheckResult(
            name="database",
            status=HealthStatus.HEALTHY,
            message="Database connection OK"
        )
    except Exception as e:
        return HealthCheckResult(
            name="database",
            status=HealthStatus.UNHEALTHY,
            message=str(e)
        )

# Or register directly
health_registry.register_check("redis", check_redis_connection)

Health Status Levels

Status	Description	HTTP Response
`HEALTHY`	All systems operational	200
`DEGRADED`	Partial functionality available	200
`UNHEALTHY`	Critical failure	200 (check response body)

HealthCheckResult Fields

Field	Type	Description
`name`	str	Check identifier
`status`	HealthStatus	Health status level
`message`	str	Optional status message
`latency_ms`	float	Check execution time
`details`	dict	Additional diagnostic data
`checked_at`	datetime	Timestamp of check

API Endpoints

GET /health

Aggregated health check endpoint. Returns combined status from all registered checks.

Response:

{
    "status": "healthy",
    "timestamp": "2026-01-27T10:30:00Z",
    "checks": [
        {
            "name": "database",
            "status": "healthy",
            "message": "Connection OK",
            "latency_ms": 2.5,
            "details": {}
        },
        {
            "name": "module:billing",
            "status": "healthy",
            "message": "",
            "latency_ms": 0.1,
            "details": {}
        }
    ]
}

Overall Status Logic:

If any check is UNHEALTHY → overall is UNHEALTHY
If any check is DEGRADED and none UNHEALTHY → overall is DEGRADED
Otherwise → HEALTHY

GET /health/live

Kubernetes liveness probe. Returns 200 if application is running.

Response:

{"status": "alive"}

GET /health/ready

Kubernetes readiness probe. Returns ready status based on health checks.

Response:

{
    "status": "ready",
    "health": "healthy"
}

Or if unhealthy:

{
    "status": "not_ready",
    "health": "unhealthy"
}

GET /metrics

Prometheus metrics endpoint. Returns metrics in Prometheus text format.

Response:

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...

GET /health/tools

Returns URLs to external monitoring tools.

Response:

{
    "flower": "http://flower.example.com:5555",
    "grafana": "http://grafana.example.com:3000",
    "prometheus": null
}

Prometheus Metrics

MetricsRegistry

The metrics registry provides a wrapper around prometheus_client with fallback when the library isn't installed.

from app.core.observability import metrics_registry

# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()

# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)

# Gauge - tracks current values
active_connections = metrics_registry.gauge(
    "active_connections",
    "Number of active connections",
    ["pool"]
)
active_connections.labels(pool="database").set(42)

Enabling Metrics

Metrics are disabled by default. Enable during initialization:

from app.core.observability import init_observability

init_observability(
    enable_metrics=True,
    # ... other options
)

Dummy Metrics

When prometheus_client isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.

Sentry Integration

Configuration

from app.core.observability import sentry, init_observability

# Initialize via init_observability
init_observability(
    sentry_dsn="https://key@sentry.io/project",
    environment="production",
)

# Or initialize directly
sentry.init(
    dsn="https://key@sentry.io/project",
    environment="production",
    release="1.0.0"
)

Capturing Errors

from app.core.observability import sentry

try:
    risky_operation()
except Exception as e:
    event_id = sentry.capture_exception(e)
    logger.error(f"Operation failed, Sentry event: {event_id}")

# Capture messages
sentry.capture_message("User reached rate limit", level="warning")

Without Sentry

If sentry_sdk isn't installed or DSN isn't provided, all capture calls silently return None.

Module Health Checks

Modules can provide health checks that are automatically registered.

Defining Module Health Check

# In module definition
from app.modules.base import ModuleDefinition

def check_billing_health() -> dict:
    """Check billing service health."""
    try:
        # Check Stripe connection
        stripe.Account.retrieve()
        return {"status": "healthy", "message": "Stripe connected"}
    except Exception as e:
        return {"status": "unhealthy", "message": str(e)}

billing_module = ModuleDefinition(
    code="billing",
    name="Billing",
    health_check=check_billing_health,
    # ...
)

Registering Module Health Checks

from app.core.observability import register_module_health_checks

# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()

This registers health checks as module:{code} (e.g., module:billing).

External Tools

Flower (Celery Monitoring)

Configure Flower URL to include in /health/tools:

init_observability(
    flower_url="http://flower:5555",
)

Grafana

Configure Grafana URL:

init_observability(
    grafana_url="http://grafana:3000",
)

Initialization

Application Lifespan

# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.core.observability import (
    init_observability,
    shutdown_observability,
    health_router,
    register_module_health_checks,
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    init_observability(
        enable_metrics=True,
        sentry_dsn=settings.SENTRY_DSN,
        environment=settings.ENVIRONMENT,
        flower_url=settings.FLOWER_URL,
        grafana_url=settings.GRAFANA_URL,
    )
    register_module_health_checks()

    yield

    # Shutdown
    shutdown_observability()

app = FastAPI(lifespan=lifespan)
app.include_router(health_router)

Environment Variables

Variable	Description	Default
`SENTRY_DSN`	Sentry DSN for error tracking	None (disabled)
`ENVIRONMENT`	Environment name	"development"
`ENABLE_METRICS`	Enable Prometheus metrics	False
`FLOWER_URL`	Flower dashboard URL	None
`GRAFANA_URL`	Grafana dashboard URL	None

Kubernetes Integration

Deployment Configuration

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Best Practices

Do

Register health checks for critical dependencies (database, cache, external APIs)
Use appropriate metric types (counter for counts, histogram for latency)
Include meaningful labels but avoid high cardinality
Set up alerts based on health status changes

Don't

Create health checks that are slow or have side effects
Add high-cardinality labels to metrics (e.g., user IDs)
Ignore Sentry errors in production
Skip readiness probes in Kubernetes deployments

Module System - Module health check integration
Background Tasks - Celery monitoring with Flower
Deployment - Production deployment with monitoring

12 KiB Raw Blame History

Observability Framework

Overview

Health Checks

Health Check Registry

Health Status Levels

HealthCheckResult Fields

API Endpoints

GET /health

GET /health/live

GET /health/ready

GET /metrics

GET /health/tools

Prometheus Metrics

MetricsRegistry

Enabling Metrics

Dummy Metrics

Sentry Integration

Configuration

Capturing Errors

Without Sentry

Module Health Checks

Defining Module Health Check

Registering Module Health Checks

External Tools

Flower (Celery Monitoring)

Grafana

Initialization

Application Lifespan

Environment Variables

Kubernetes Integration

Deployment Configuration

Prometheus ServiceMonitor

Best Practices

Do

Don't

Related Documentation

12 KiB

Raw Blame History