Files
orion/docs/architecture/observability.md
Samir Boulahtit e9253fbd84 refactor: rename Wizamart to Orion across entire codebase
Replace all ~1,086 occurrences of Wizamart/wizamart/WIZAMART/WizaMart
with Orion/orion/ORION across 184 files. This includes database
identifiers, email addresses, domain references, R2 bucket names,
DNS prefixes, encryption salt, Celery app name, config defaults,
Docker configs, CI configs, documentation, seed data, and templates.

Renames homepage-wizamart.html template to homepage-orion.html.
Fixes duplicate file_pattern key in api.yaml architecture rule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 16:46:56 +01:00

430 lines
12 KiB
Markdown

# Observability Framework
The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
## Overview
```
┌─────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY FRAMEWORK │
│ app/core/observability.py │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
│ │ Registry │ │ Metrics │ │ Integration │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ API Endpoints │ │
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────┐
│ External Tools │
│ Flower │ Grafana │ Prometheus│
└───────────────────────────────┘
```
## Health Checks
### Health Check Registry
Components register health checks that are aggregated into a single endpoint.
```python
from app.core.observability import (
health_registry,
HealthCheckResult,
HealthStatus,
)
# Register using decorator
@health_registry.register("database")
def check_database() -> HealthCheckResult:
try:
db.execute("SELECT 1")
return HealthCheckResult(
name="database",
status=HealthStatus.HEALTHY,
message="Database connection OK"
)
except Exception as e:
return HealthCheckResult(
name="database",
status=HealthStatus.UNHEALTHY,
message=str(e)
)
# Or register directly
health_registry.register_check("redis", check_redis_connection)
```
### Health Status Levels
| Status | Description | HTTP Response |
|--------|-------------|---------------|
| `HEALTHY` | All systems operational | 200 |
| `DEGRADED` | Partial functionality available | 200 |
| `UNHEALTHY` | Critical failure | 200 (check response body) |
### HealthCheckResult Fields
| Field | Type | Description |
|-------|------|-------------|
| `name` | str | Check identifier |
| `status` | HealthStatus | Health status level |
| `message` | str | Optional status message |
| `latency_ms` | float | Check execution time |
| `details` | dict | Additional diagnostic data |
| `checked_at` | datetime | Timestamp of check |
## API Endpoints
### GET /health
Aggregated health check endpoint. Returns combined status from all registered checks.
**Response:**
```json
{
"status": "healthy",
"timestamp": "2026-01-27T10:30:00Z",
"checks": [
{
"name": "database",
"status": "healthy",
"message": "Connection OK",
"latency_ms": 2.5,
"details": {}
},
{
"name": "module:billing",
"status": "healthy",
"message": "",
"latency_ms": 0.1,
"details": {}
}
]
}
```
**Overall Status Logic:**
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
- Otherwise → `HEALTHY`
### GET /health/live
Kubernetes liveness probe. Returns 200 if application is running.
**Response:**
```json
{"status": "alive"}
```
### GET /health/ready
Kubernetes readiness probe. Returns ready status based on health checks.
**Response:**
```json
{
"status": "ready",
"health": "healthy"
}
```
Or if unhealthy:
```json
{
"status": "not_ready",
"health": "unhealthy"
}
```
### GET /metrics
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
**Response:**
```
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
...
```
### GET /health/tools
Returns URLs to external monitoring tools.
**Response:**
```json
{
"flower": "http://flower.example.com:5555",
"grafana": "http://grafana.example.com:3000",
"prometheus": null
}
```
## Prometheus Metrics
### MetricsRegistry
The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.
```python
from app.core.observability import metrics_registry
# Counter - tracks cumulative values
request_counter = metrics_registry.counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
# Histogram - tracks distributions (latency, sizes)
request_latency = metrics_registry.histogram(
"http_request_duration_seconds",
"HTTP request latency",
["endpoint"],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_latency.labels(endpoint="/api/products").observe(0.045)
# Gauge - tracks current values
active_connections = metrics_registry.gauge(
"active_connections",
"Number of active connections",
["pool"]
)
active_connections.labels(pool="database").set(42)
```
### Enabling Metrics
Metrics are disabled by default. Enable during initialization:
```python
from app.core.observability import init_observability
init_observability(
enable_metrics=True,
# ... other options
)
```
### Dummy Metrics
When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
## Sentry Integration
### Configuration
```python
from app.core.observability import sentry, init_observability
# Initialize via init_observability
init_observability(
sentry_dsn="https://key@sentry.io/project",
environment="production",
)
# Or initialize directly
sentry.init(
dsn="https://key@sentry.io/project",
environment="production",
release="1.0.0"
)
```
### Capturing Errors
```python
from app.core.observability import sentry
try:
risky_operation()
except Exception as e:
event_id = sentry.capture_exception(e)
logger.error(f"Operation failed, Sentry event: {event_id}")
# Capture messages
sentry.capture_message("User reached rate limit", level="warning")
```
### Without Sentry
If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.
## Module Health Checks
Modules can provide health checks that are automatically registered.
### Defining Module Health Check
```python
# In module definition
from app.modules.base import ModuleDefinition
def check_billing_health() -> dict:
"""Check billing service health."""
try:
# Check Stripe connection
stripe.Account.retrieve()
return {"status": "healthy", "message": "Stripe connected"}
except Exception as e:
return {"status": "unhealthy", "message": str(e)}
billing_module = ModuleDefinition(
code="billing",
name="Billing",
health_check=check_billing_health,
# ...
)
```
### Registering Module Health Checks
```python
from app.core.observability import register_module_health_checks
# Call after modules are loaded (e.g., in app lifespan)
register_module_health_checks()
```
This registers health checks as `module:{code}` (e.g., `module:billing`).
## External Tools
### Flower (Celery Monitoring)
Configure Flower URL to include in `/health/tools`:
```python
init_observability(
flower_url="http://flower:5555",
)
```
### Grafana
Configure Grafana URL:
```python
init_observability(
grafana_url="http://grafana:3000",
)
```
## Initialization
### Application Lifespan
```python
# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.core.observability import (
init_observability,
shutdown_observability,
health_router,
register_module_health_checks,
)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
init_observability(
enable_metrics=True,
sentry_dsn=settings.SENTRY_DSN,
environment=settings.ENVIRONMENT,
flower_url=settings.FLOWER_URL,
grafana_url=settings.GRAFANA_URL,
)
register_module_health_checks()
yield
# Shutdown
shutdown_observability()
app = FastAPI(lifespan=lifespan)
app.include_router(health_router)
```
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
| `ENVIRONMENT` | Environment name | "development" |
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
| `FLOWER_URL` | Flower dashboard URL | None |
| `GRAFANA_URL` | Grafana dashboard URL | None |
## Kubernetes Integration
### Deployment Configuration
```yaml
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
```
### Prometheus ServiceMonitor
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
endpoints:
- port: http
path: /metrics
interval: 15s
```
## Best Practices
### Do
- Register health checks for critical dependencies (database, cache, external APIs)
- Use appropriate metric types (counter for counts, histogram for latency)
- Include meaningful labels but avoid high cardinality
- Set up alerts based on health status changes
### Don't
- Create health checks that are slow or have side effects
- Add high-cardinality labels to metrics (e.g., user IDs)
- Ignore Sentry errors in production
- Skip readiness probes in Kubernetes deployments
## Related Documentation
- [Module System](module-system.md) - Module health check integration
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
- [Deployment](../deployment/index.md) - Production deployment with monitoring