- Add observability framework documentation (health checks, metrics, Sentry) - Add developer guide for creating modules - Add comprehensive module migration plan with Celery task integration - Update architecture overview with module system and observability sections - Update module-system.md with links to new docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
Observability Framework
The Wizamart platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY FRAMEWORK │
│ app/core/observability.py │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
│ │ Registry │ │ Metrics │ │ Integration │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ API Endpoints │ │
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ External Tools │
│ Flower │ Grafana │ Prometheus│
└───────────────────────────────┘
Health Checks
Health Check Registry
Components register health checks that are aggregated into a single endpoint.
from app.core.observability import (
health_registry,
HealthCheckResult,
HealthStatus,
)
# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
try:
db.execute("SELECT 1")
return HealthCheckResult(
name="database",
status=HealthStatus.HEALTHY,
message="Database connection OK"
)
except Exception as e:
return HealthCheckResult(
name="database",
status=HealthStatus.UNHEALTHY,
message=str(e)
)
# Or register directly
health_registry.register_check("redis", check_redis_connection)
Health Status Levels
| Status | Description | HTTP Response |
|---|---|---|
HEALTHY |
All systems operational | 200 |
DEGRADED |
Partial functionality available | 200 |
UNHEALTHY |
Critical failure | 200 (check response body) |
HealthCheckResult Fields
| Field | Type | Description |
|---|---|---|
name |
str | Check identifier |
status |
HealthStatus | Health status level |
message |
str | Optional status message |
latency_ms |
float | Check execution time |
details |
dict | Additional diagnostic data |
checked_at |
datetime | Timestamp of check |
API Endpoints
GET /health
Aggregated health check endpoint. Returns combined status from all registered checks.
Response:
{
"status": "healthy",
"timestamp": "2026-01-27T10:30:00Z",
"checks": [
{
"name": "database",
"status": "healthy",
"message": "Connection OK",
"latency_ms": 2.5,
"details": {}
},
{
"name": "module:billing",
"status": "healthy",
"message": "",
"latency_ms": 0.1,
"details": {}
}
]
}
Overall Status Logic:
- If any check is
UNHEALTHY→ overall isUNHEALTHY - If any check is
DEGRADEDand noneUNHEALTHY→ overall isDEGRADED - Otherwise →
HEALTHY
GET /health/live
Kubernetes liveness probe. Returns 200 if application is running.
Response:
{"status": "alive"}
GET /health/ready
Kubernetes readiness probe. Returns ready status based on health checks.
Response:
{
"status": "ready",
"health": "healthy"
}
Or if unhealthy:
{
"status": "not_ready",
"health": "unhealthy"
}
GET /metrics
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
Response:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...
GET /health/tools
Returns URLs to external monitoring tools.
Response:
{
"flower": "http://flower.example.com:5555",
"grafana": "http://grafana.example.com:3000",
"prometheus": null
}
Prometheus Metrics
MetricsRegistry
The metrics registry provides a wrapper around prometheus_client with fallback when the library isn't installed.
from app.core.observability import metrics_registry
# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
"http_request_duration_seconds",
"HTTP request latency",
["endpoint"],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)
# Gauge - tracks current values
active_connections = metrics_registry.gauge(
"active_connections",
"Number of active connections",
["pool"]
)
active_connections.labels(pool="database").set(42)
Enabling Metrics
Metrics are disabled by default. Enable during initialization:
from app.core.observability import init_observability
init_observability(
enable_metrics=True,
# ... other options
)
Dummy Metrics
When prometheus_client isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
Sentry Integration
Configuration
from app.core.observability import sentry, init_observability
# Initialize via init_observability
init_observability(
sentry_dsn="https://key@sentry.io/project",
environment="production",
)
# Or initialize directly
sentry.init(
dsn="https://key@sentry.io/project",
environment="production",
release="1.0.0"
)
Capturing Errors
from app.core.observability import sentry
try:
risky_operation()
except Exception as e:
event_id = sentry.capture_exception(e)
logger.error(f"Operation failed, Sentry event: {event_id}")
# Capture messages
sentry.capture_message("User reached rate limit", level="warning")
Without Sentry
If sentry_sdk isn't installed or DSN isn't provided, all capture calls silently return None.
Module Health Checks
Modules can provide health checks that are automatically registered.
Defining Module Health Check
# In module definition
from app.modules.base import ModuleDefinition
def check_billing_health() -> dict:
"""Check billing service health."""
try:
# Check Stripe connection
stripe.Account.retrieve()
return {"status": "healthy", "message": "Stripe connected"}
except Exception as e:
return {"status": "unhealthy", "message": str(e)}
billing_module = ModuleDefinition(
code="billing",
name="Billing",
health_check=check_billing_health,
# ...
)
Registering Module Health Checks
from app.core.observability import register_module_health_checks
# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()
This registers health checks as module:{code} (e.g., module:billing).
External Tools
Flower (Celery Monitoring)
Configure Flower URL to include in /health/tools:
init_observability(
flower_url="http://flower:5555",
)
Grafana
Configure Grafana URL:
init_observability(
grafana_url="http://grafana:3000",
)
Initialization
Application Lifespan
# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.core.observability import (
init_observability,
shutdown_observability,
health_router,
register_module_health_checks,
)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
init_observability(
enable_metrics=True,
sentry_dsn=settings.SENTRY_DSN,
environment=settings.ENVIRONMENT,
flower_url=settings.FLOWER_URL,
grafana_url=settings.GRAFANA_URL,
)
register_module_health_checks()
yield
# Shutdown
shutdown_observability()
app = FastAPI(lifespan=lifespan)
app.include_router(health_router)
Environment Variables
| Variable | Description | Default |
|---|---|---|
SENTRY_DSN |
Sentry DSN for error tracking | None (disabled) |
ENVIRONMENT |
Environment name | "development" |
ENABLE_METRICS |
Enable Prometheus metrics | False |
FLOWER_URL |
Flower dashboard URL | None |
GRAFANA_URL |
Grafana dashboard URL | None |
Kubernetes Integration
Deployment Configuration
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- port: http
path: /metrics
interval: 15s
Best Practices
Do
- Register health checks for critical dependencies (database, cache, external APIs)
- Use appropriate metric types (counter for counts, histogram for latency)
- Include meaningful labels but avoid high cardinality
- Set up alerts based on health status changes
Don't
- Create health checks that are slow or have side effects
- Add high-cardinality labels to metrics (e.g., user IDs)
- Ignore Sentry errors in production
- Skip readiness probes in Kubernetes deployments
Related Documentation
- Module System - Module health check integration
- Background Tasks - Celery monitoring with Flower
- Deployment - Production deployment with monitoring