Files
orion/docs/architecture/observability.md
Samir Boulahtit 7dbdbd4c7e docs: add observability, creating modules guide, and unified migration plan
- Add observability framework documentation (health checks, metrics, Sentry)
- Add developer guide for creating modules
- Add comprehensive module migration plan with Celery task integration
- Update architecture overview with module system and observability sections
- Update module-system.md with links to new docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 22:41:19 +01:00

430 lines
12 KiB
Markdown

# Observability Framework
The Wizamart platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
## Overview
```
┌─────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY FRAMEWORK │
│ app/core/observability.py │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
│ │ Registry │ │ Metrics │ │ Integration │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ API Endpoints │ │
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────┐
│ External Tools │
│ Flower │ Grafana │ Prometheus│
└───────────────────────────────┘
```
## Health Checks
### Health Check Registry
Components register health checks that are aggregated into a single endpoint.
```python
from app.core.observability import (
health_registry,
HealthCheckResult,
HealthStatus,
)
# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
try:
db.execute("SELECT 1")
return HealthCheckResult(
name="database",
status=HealthStatus.HEALTHY,
message="Database connection OK"
)
except Exception as e:
return HealthCheckResult(
name="database",
status=HealthStatus.UNHEALTHY,
message=str(e)
)
# Or register directly
health_registry.register_check("redis", check_redis_connection)
```
### Health Status Levels
| Status | Description | HTTP Response |
|--------|-------------|---------------|
| `HEALTHY` | All systems operational | 200 |
| `DEGRADED` | Partial functionality available | 200 |
| `UNHEALTHY` | Critical failure | 200 (check response body) |
### HealthCheckResult Fields
| Field | Type | Description |
|-------|------|-------------|
| `name` | str | Check identifier |
| `status` | HealthStatus | Health status level |
| `message` | str | Optional status message |
| `latency_ms` | float | Check execution time |
| `details` | dict | Additional diagnostic data |
| `checked_at` | datetime | Timestamp of check |
## API Endpoints
### GET /health
Aggregated health check endpoint. Returns combined status from all registered checks.
**Response:**
```json
{
"status": "healthy",
"timestamp": "2026-01-27T10:30:00Z",
"checks": [
{
"name": "database",
"status": "healthy",
"message": "Connection OK",
"latency_ms": 2.5,
"details": {}
},
{
"name": "module:billing",
"status": "healthy",
"message": "",
"latency_ms": 0.1,
"details": {}
}
]
}
```
**Overall Status Logic:**
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
- Otherwise → `HEALTHY`
### GET /health/live
Kubernetes liveness probe. Returns 200 if application is running.
**Response:**
```json
{"status": "alive"}
```
### GET /health/ready
Kubernetes readiness probe. Returns ready status based on health checks.
**Response:**
```json
{
"status": "ready",
"health": "healthy"
}
```
Or if unhealthy:
```json
{
"status": "not_ready",
"health": "unhealthy"
}
```
### GET /metrics
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
**Response:**
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...
```
### GET /health/tools
Returns URLs to external monitoring tools.
**Response:**
```json
{
"flower": "http://flower.example.com:5555",
"grafana": "http://grafana.example.com:3000",
"prometheus": null
}
```
## Prometheus Metrics
### MetricsRegistry
The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.
```python
from app.core.observability import metrics_registry
# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
"http_request_duration_seconds",
"HTTP request latency",
["endpoint"],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)
# Gauge - tracks current values
active_connections = metrics_registry.gauge(
"active_connections",
"Number of active connections",
["pool"]
)
active_connections.labels(pool="database").set(42)
```
### Enabling Metrics
Metrics are disabled by default. Enable during initialization:
```python
from app.core.observability import init_observability
init_observability(
enable_metrics=True,
# ... other options
)
```
### Dummy Metrics
When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
## Sentry Integration
### Configuration
```python
from app.core.observability import sentry, init_observability
# Initialize via init_observability
init_observability(
sentry_dsn="https://key@sentry.io/project",
environment="production",
)
# Or initialize directly
sentry.init(
dsn="https://key@sentry.io/project",
environment="production",
release="1.0.0"
)
```
### Capturing Errors
```python
from app.core.observability import sentry
try:
risky_operation()
except Exception as e:
event_id = sentry.capture_exception(e)
logger.error(f"Operation failed, Sentry event: {event_id}")
# Capture messages
sentry.capture_message("User reached rate limit", level="warning")
```
### Without Sentry
If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.
## Module Health Checks
Modules can provide health checks that are automatically registered.
### Defining Module Health Check
```python
# In module definition
from app.modules.base import ModuleDefinition
def check_billing_health() -> dict:
"""Check billing service health."""
try:
# Check Stripe connection
stripe.Account.retrieve()
return {"status": "healthy", "message": "Stripe connected"}
except Exception as e:
return {"status": "unhealthy", "message": str(e)}
billing_module = ModuleDefinition(
code="billing",
name="Billing",
health_check=check_billing_health,
# ...
)
```
### Registering Module Health Checks
```python
from app.core.observability import register_module_health_checks
# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()
```
This registers health checks as `module:{code}` (e.g., `module:billing`).
## External Tools
### Flower (Celery Monitoring)
Configure Flower URL to include in `/health/tools`:
```python
init_observability(
flower_url="http://flower:5555",
)
```
### Grafana
Configure Grafana URL:
```python
init_observability(
grafana_url="http://grafana:3000",
)
```
## Initialization
### Application Lifespan
```python
# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.core.observability import (
init_observability,
shutdown_observability,
health_router,
register_module_health_checks,
)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
init_observability(
enable_metrics=True,
sentry_dsn=settings.SENTRY_DSN,
environment=settings.ENVIRONMENT,
flower_url=settings.FLOWER_URL,
grafana_url=settings.GRAFANA_URL,
)
register_module_health_checks()
yield
# Shutdown
shutdown_observability()
app = FastAPI(lifespan=lifespan)
app.include_router(health_router)
```
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
| `ENVIRONMENT` | Environment name | "development" |
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
| `FLOWER_URL` | Flower dashboard URL | None |
| `GRAFANA_URL` | Grafana dashboard URL | None |
## Kubernetes Integration
### Deployment Configuration
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
```
### Prometheus ServiceMonitor
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- port: http
path: /metrics
interval: 15s
```
## Best Practices
### Do
- Register health checks for critical dependencies (database, cache, external APIs)
- Use appropriate metric types (counter for counts, histogram for latency)
- Include meaningful labels but avoid high cardinality
- Set up alerts based on health status changes
### Don't
- Create health checks that are slow or have side effects
- Add high-cardinality labels to metrics (e.g., user IDs)
- Ignore Sentry errors in production
- Skip readiness probes in Kubernetes deployments
## Related Documentation
- [Module System](module-system.md) - Module health check integration
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
- [Deployment](../deployment/index.md) - Production deployment with monitoring