- Add observability framework documentation (health checks, metrics, Sentry) - Add developer guide for creating modules - Add comprehensive module migration plan with Celery task integration - Update architecture overview with module system and observability sections - Update module-system.md with links to new docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
430 lines
12 KiB
Markdown
430 lines
12 KiB
Markdown
# Observability Framework
|
|
|
|
The Wizamart platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
|
|
|
|
## Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ OBSERVABILITY FRAMEWORK │
|
|
│ app/core/observability.py │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
|
|
│ │ Registry │ │ Metrics │ │ Integration │ │
|
|
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ API Endpoints │ │
|
|
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────────────┐
|
|
│ External Tools │
|
|
│ Flower │ Grafana │ Prometheus│
|
|
└───────────────────────────────┘
|
|
```
|
|
|
|
## Health Checks
|
|
|
|
### Health Check Registry
|
|
|
|
Components register health checks that are aggregated into a single endpoint.
|
|
|
|
```python
|
|
from app.core.observability import (
|
|
health_registry,
|
|
HealthCheckResult,
|
|
HealthStatus,
|
|
)
|
|
|
|
# Register using decorator
|
|
@health_registry.register("database")
|
|
def check_database() -> HealthCheckResult:
|
|
try:
|
|
db.execute("SELECT 1")
|
|
return HealthCheckResult(
|
|
name="database",
|
|
status=HealthStatus.HEALTHY,
|
|
message="Database connection OK"
|
|
)
|
|
except Exception as e:
|
|
return HealthCheckResult(
|
|
name="database",
|
|
status=HealthStatus.UNHEALTHY,
|
|
message=str(e)
|
|
)
|
|
|
|
# Or register directly
|
|
health_registry.register_check("redis", check_redis_connection)
|
|
```
|
|
|
|
### Health Status Levels
|
|
|
|
| Status | Description | HTTP Response |
|
|
|--------|-------------|---------------|
|
|
| `HEALTHY` | All systems operational | 200 |
|
|
| `DEGRADED` | Partial functionality available | 200 |
|
|
| `UNHEALTHY` | Critical failure | 200 (check response body) |
|
|
|
|
### HealthCheckResult Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `name` | str | Check identifier |
|
|
| `status` | HealthStatus | Health status level |
|
|
| `message` | str | Optional status message |
|
|
| `latency_ms` | float | Check execution time |
|
|
| `details` | dict | Additional diagnostic data |
|
|
| `checked_at` | datetime | Timestamp of check |
|
|
|
|
## API Endpoints
|
|
|
|
### GET /health
|
|
|
|
Aggregated health check endpoint. Returns combined status from all registered checks.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"timestamp": "2026-01-27T10:30:00Z",
|
|
"checks": [
|
|
{
|
|
"name": "database",
|
|
"status": "healthy",
|
|
"message": "Connection OK",
|
|
"latency_ms": 2.5,
|
|
"details": {}
|
|
},
|
|
{
|
|
"name": "module:billing",
|
|
"status": "healthy",
|
|
"message": "",
|
|
"latency_ms": 0.1,
|
|
"details": {}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Overall Status Logic:**
|
|
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
|
|
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
|
|
- Otherwise → `HEALTHY`
|
|
|
|
### GET /health/live
|
|
|
|
Kubernetes liveness probe. Returns 200 if application is running.
|
|
|
|
**Response:**
|
|
```json
|
|
{"status": "alive"}
|
|
```
|
|
|
|
### GET /health/ready
|
|
|
|
Kubernetes readiness probe. Returns ready status based on health checks.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "ready",
|
|
"health": "healthy"
|
|
}
|
|
```
|
|
|
|
Or if unhealthy:
|
|
```json
|
|
{
|
|
"status": "not_ready",
|
|
"health": "unhealthy"
|
|
}
|
|
```
|
|
|
|
### GET /metrics
|
|
|
|
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
|
|
|
|
**Response:**
|
|
```
|
|
# HELP http_requests_total Total HTTP requests
|
|
# TYPE http_requests_total counter
|
|
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
|
|
...
|
|
```
|
|
|
|
### GET /health/tools
|
|
|
|
Returns URLs to external monitoring tools.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"flower": "http://flower.example.com:5555",
|
|
"grafana": "http://grafana.example.com:3000",
|
|
"prometheus": null
|
|
}
|
|
```
|
|
|
|
## Prometheus Metrics
|
|
|
|
### MetricsRegistry
|
|
|
|
The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.
|
|
|
|
```python
|
|
from app.core.observability import metrics_registry
|
|
|
|
# Counter - tracks cumulative values
|
|
request_counter = metrics_registry.counter(
|
|
"http_requests_total",
|
|
"Total HTTP requests",
|
|
["method", "endpoint", "status"]
|
|
)
|
|
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
|
|
|
|
# Histogram - tracks distributions (latency, sizes)
|
|
request_latency = metrics_registry.histogram(
|
|
"http_request_duration_seconds",
|
|
"HTTP request latency",
|
|
["endpoint"],
|
|
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
|
|
)
|
|
request_latency.labels(endpoint="/api/products").observe(0.045)
|
|
|
|
# Gauge - tracks current values
|
|
active_connections = metrics_registry.gauge(
|
|
"active_connections",
|
|
"Number of active connections",
|
|
["pool"]
|
|
)
|
|
active_connections.labels(pool="database").set(42)
|
|
```
|
|
|
|
### Enabling Metrics
|
|
|
|
Metrics are disabled by default. Enable during initialization:
|
|
|
|
```python
|
|
from app.core.observability import init_observability
|
|
|
|
init_observability(
|
|
enable_metrics=True,
|
|
# ... other options
|
|
)
|
|
```
|
|
|
|
### Dummy Metrics
|
|
|
|
When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
|
|
|
|
## Sentry Integration
|
|
|
|
### Configuration
|
|
|
|
```python
|
|
from app.core.observability import sentry, init_observability
|
|
|
|
# Initialize via init_observability
|
|
init_observability(
|
|
sentry_dsn="https://key@sentry.io/project",
|
|
environment="production",
|
|
)
|
|
|
|
# Or initialize directly
|
|
sentry.init(
|
|
dsn="https://key@sentry.io/project",
|
|
environment="production",
|
|
release="1.0.0"
|
|
)
|
|
```
|
|
|
|
### Capturing Errors
|
|
|
|
```python
|
|
from app.core.observability import sentry
|
|
|
|
try:
|
|
risky_operation()
|
|
except Exception as e:
|
|
event_id = sentry.capture_exception(e)
|
|
logger.error(f"Operation failed, Sentry event: {event_id}")
|
|
|
|
# Capture messages
|
|
sentry.capture_message("User reached rate limit", level="warning")
|
|
```
|
|
|
|
### Without Sentry
|
|
|
|
If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.
|
|
|
|
## Module Health Checks
|
|
|
|
Modules can provide health checks that are automatically registered.
|
|
|
|
### Defining Module Health Check
|
|
|
|
```python
|
|
# In module definition
|
|
from app.modules.base import ModuleDefinition
|
|
|
|
def check_billing_health() -> dict:
|
|
"""Check billing service health."""
|
|
try:
|
|
# Check Stripe connection
|
|
stripe.Account.retrieve()
|
|
return {"status": "healthy", "message": "Stripe connected"}
|
|
except Exception as e:
|
|
return {"status": "unhealthy", "message": str(e)}
|
|
|
|
billing_module = ModuleDefinition(
|
|
code="billing",
|
|
name="Billing",
|
|
health_check=check_billing_health,
|
|
# ...
|
|
)
|
|
```
|
|
|
|
### Registering Module Health Checks
|
|
|
|
```python
|
|
from app.core.observability import register_module_health_checks
|
|
|
|
# Call after modules are loaded (e.g., in app lifespan)
|
|
register_module_health_checks()
|
|
```
|
|
|
|
This registers health checks as `module:{code}` (e.g., `module:billing`).
|
|
|
|
## External Tools
|
|
|
|
### Flower (Celery Monitoring)
|
|
|
|
Configure Flower URL to include in `/health/tools`:
|
|
|
|
```python
|
|
init_observability(
|
|
flower_url="http://flower:5555",
|
|
)
|
|
```
|
|
|
|
### Grafana
|
|
|
|
Configure Grafana URL:
|
|
|
|
```python
|
|
init_observability(
|
|
grafana_url="http://grafana:3000",
|
|
)
|
|
```
|
|
|
|
## Initialization
|
|
|
|
### Application Lifespan
|
|
|
|
```python
|
|
# main.py
|
|
from contextlib import asynccontextmanager
|
|
from fastapi import FastAPI
|
|
from app.core.observability import (
|
|
init_observability,
|
|
shutdown_observability,
|
|
health_router,
|
|
register_module_health_checks,
|
|
)
|
|
|
|
@asynccontextmanager
|
|
async def lifespan(app: FastAPI):
|
|
# Startup
|
|
init_observability(
|
|
enable_metrics=True,
|
|
sentry_dsn=settings.SENTRY_DSN,
|
|
environment=settings.ENVIRONMENT,
|
|
flower_url=settings.FLOWER_URL,
|
|
grafana_url=settings.GRAFANA_URL,
|
|
)
|
|
register_module_health_checks()
|
|
|
|
yield
|
|
|
|
# Shutdown
|
|
shutdown_observability()
|
|
|
|
app = FastAPI(lifespan=lifespan)
|
|
app.include_router(health_router)
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
|
|
| `ENVIRONMENT` | Environment name | "development" |
|
|
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
|
|
| `FLOWER_URL` | Flower dashboard URL | None |
|
|
| `GRAFANA_URL` | Grafana dashboard URL | None |
|
|
|
|
## Kubernetes Integration
|
|
|
|
### Deployment Configuration
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health/live
|
|
port: 8000
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 30
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health/ready
|
|
port: 8000
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 10
|
|
```
|
|
|
|
### Prometheus ServiceMonitor
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
spec:
|
|
endpoints:
|
|
- port: http
|
|
path: /metrics
|
|
interval: 15s
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Do
|
|
|
|
- Register health checks for critical dependencies (database, cache, external APIs)
|
|
- Use appropriate metric types (counter for counts, histogram for latency)
|
|
- Include meaningful labels but avoid high cardinality
|
|
- Set up alerts based on health status changes
|
|
|
|
### Don't
|
|
|
|
- Create health checks that are slow or have side effects
|
|
- Add high-cardinality labels to metrics (e.g., user IDs)
|
|
- Ignore Sentry errors in production
|
|
- Skip readiness probes in Kubernetes deployments
|
|
|
|
## Related Documentation
|
|
|
|
- [Module System](module-system.md) - Module health check integration
|
|
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
|
|
- [Deployment](../deployment/index.md) - Production deployment with monitoring
|