Replace all ~1,086 occurrences of Wizamart/wizamart/WIZAMART/WizaMart with Orion/orion/ORION across 184 files. This includes database identifiers, email addresses, domain references, R2 bucket names, DNS prefixes, encryption salt, Celery app name, config defaults, Docker configs, CI configs, documentation, seed data, and templates. Renames homepage-wizamart.html template to homepage-orion.html. Fixes duplicate file_pattern key in api.yaml architecture rule. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
430 lines
12 KiB
Markdown
430 lines
12 KiB
Markdown
# Observability Framework
|
|
|
|
The Orion platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
|
|
|
|
## Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ OBSERVABILITY FRAMEWORK │
|
|
│ app/core/observability.py │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
|
|
│ │ Registry │ │ Metrics │ │ Integration │ │
|
|
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ API Endpoints │ │
|
|
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────────────────────┐
|
|
│ External Tools │
|
|
│ Flower │ Grafana │ Prometheus│
|
|
└───────────────────────────────┘
|
|
```
|
|
|
|
## Health Checks
|
|
|
|
### Health Check Registry
|
|
|
|
Components register health checks that are aggregated into a single endpoint.
|
|
|
|
```python
|
|
from app.core.observability import (
|
|
health_registry,
|
|
HealthCheckResult,
|
|
HealthStatus,
|
|
)
|
|
|
|
# Register using decorator
|
|
@health_registry.register("database")
|
|
def check_database() -> HealthCheckResult:
|
|
try:
|
|
db.execute("SELECT 1")
|
|
return HealthCheckResult(
|
|
name="database",
|
|
status=HealthStatus.HEALTHY,
|
|
message="Database connection OK"
|
|
)
|
|
except Exception as e:
|
|
return HealthCheckResult(
|
|
name="database",
|
|
status=HealthStatus.UNHEALTHY,
|
|
message=str(e)
|
|
)
|
|
|
|
# Or register directly
|
|
health_registry.register_check("redis", check_redis_connection)
|
|
```
|
|
|
|
### Health Status Levels
|
|
|
|
| Status | Description | HTTP Response |
|
|
|--------|-------------|---------------|
|
|
| `HEALTHY` | All systems operational | 200 |
|
|
| `DEGRADED` | Partial functionality available | 200 |
|
|
| `UNHEALTHY` | Critical failure | 200 (check response body) |
|
|
|
|
### HealthCheckResult Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `name` | str | Check identifier |
|
|
| `status` | HealthStatus | Health status level |
|
|
| `message` | str | Optional status message |
|
|
| `latency_ms` | float | Check execution time |
|
|
| `details` | dict | Additional diagnostic data |
|
|
| `checked_at` | datetime | Timestamp of check |
|
|
|
|
## API Endpoints
|
|
|
|
### GET /health
|
|
|
|
Aggregated health check endpoint. Returns combined status from all registered checks.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"timestamp": "2026-01-27T10:30:00Z",
|
|
"checks": [
|
|
{
|
|
"name": "database",
|
|
"status": "healthy",
|
|
"message": "Connection OK",
|
|
"latency_ms": 2.5,
|
|
"details": {}
|
|
},
|
|
{
|
|
"name": "module:billing",
|
|
"status": "healthy",
|
|
"message": "",
|
|
"latency_ms": 0.1,
|
|
"details": {}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Overall Status Logic:**
|
|
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
|
|
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
|
|
- Otherwise → `HEALTHY`
|
|
|
|
### GET /health/live
|
|
|
|
Kubernetes liveness probe. Returns 200 if application is running.
|
|
|
|
**Response:**
|
|
```json
|
|
{"status": "alive"}
|
|
```
|
|
|
|
### GET /health/ready
|
|
|
|
Kubernetes readiness probe. Returns ready status based on health checks.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "ready",
|
|
"health": "healthy"
|
|
}
|
|
```
|
|
|
|
Or if unhealthy:
|
|
```json
|
|
{
|
|
"status": "not_ready",
|
|
"health": "unhealthy"
|
|
}
|
|
```
|
|
|
|
### GET /metrics
|
|
|
|
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
|
|
|
|
**Response:**
|
|
```
|
|
# HELP http_requests_total Total HTTP requests
|
|
# TYPE http_requests_total counter
|
|
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
|
|
...
|
|
```
|
|
|
|
### GET /health/tools
|
|
|
|
Returns URLs to external monitoring tools.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"flower": "http://flower.example.com:5555",
|
|
"grafana": "http://grafana.example.com:3000",
|
|
"prometheus": null
|
|
}
|
|
```
|
|
|
|
## Prometheus Metrics
|
|
|
|
### MetricsRegistry
|
|
|
|
The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.
|
|
|
|
```python
|
|
from app.core.observability import metrics_registry
|
|
|
|
# Counter - tracks cumulative values
|
|
request_counter = metrics_registry.counter(
|
|
"http_requests_total",
|
|
"Total HTTP requests",
|
|
["method", "endpoint", "status"]
|
|
)
|
|
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
|
|
|
|
# Histogram - tracks distributions (latency, sizes)
|
|
request_latency = metrics_registry.histogram(
|
|
"http_request_duration_seconds",
|
|
"HTTP request latency",
|
|
["endpoint"],
|
|
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
|
|
)
|
|
request_latency.labels(endpoint="/api/products").observe(0.045)
|
|
|
|
# Gauge - tracks current values
|
|
active_connections = metrics_registry.gauge(
|
|
"active_connections",
|
|
"Number of active connections",
|
|
["pool"]
|
|
)
|
|
active_connections.labels(pool="database").set(42)
|
|
```
|
|
|
|
### Enabling Metrics
|
|
|
|
Metrics are disabled by default. Enable during initialization:
|
|
|
|
```python
|
|
from app.core.observability import init_observability
|
|
|
|
init_observability(
|
|
enable_metrics=True,
|
|
# ... other options
|
|
)
|
|
```
|
|
|
|
### Dummy Metrics
|
|
|
|
When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
|
|
|
|
## Sentry Integration
|
|
|
|
### Configuration
|
|
|
|
```python
|
|
from app.core.observability import sentry, init_observability
|
|
|
|
# Initialize via init_observability
|
|
init_observability(
|
|
sentry_dsn="https://key@sentry.io/project",
|
|
environment="production",
|
|
)
|
|
|
|
# Or initialize directly
|
|
sentry.init(
|
|
dsn="https://key@sentry.io/project",
|
|
environment="production",
|
|
release="1.0.0"
|
|
)
|
|
```
|
|
|
|
### Capturing Errors
|
|
|
|
```python
|
|
from app.core.observability import sentry
|
|
|
|
try:
|
|
risky_operation()
|
|
except Exception as e:
|
|
event_id = sentry.capture_exception(e)
|
|
logger.error(f"Operation failed, Sentry event: {event_id}")
|
|
|
|
# Capture messages
|
|
sentry.capture_message("User reached rate limit", level="warning")
|
|
```
|
|
|
|
### Without Sentry
|
|
|
|
If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.
|
|
|
|
## Module Health Checks
|
|
|
|
Modules can provide health checks that are automatically registered.
|
|
|
|
### Defining Module Health Check
|
|
|
|
```python
|
|
# In module definition
|
|
from app.modules.base import ModuleDefinition
|
|
|
|
def check_billing_health() -> dict:
|
|
"""Check billing service health."""
|
|
try:
|
|
# Check Stripe connection
|
|
stripe.Account.retrieve()
|
|
return {"status": "healthy", "message": "Stripe connected"}
|
|
except Exception as e:
|
|
return {"status": "unhealthy", "message": str(e)}
|
|
|
|
billing_module = ModuleDefinition(
|
|
code="billing",
|
|
name="Billing",
|
|
health_check=check_billing_health,
|
|
# ...
|
|
)
|
|
```
|
|
|
|
### Registering Module Health Checks
|
|
|
|
```python
|
|
from app.core.observability import register_module_health_checks
|
|
|
|
# Call after modules are loaded (e.g., in app lifespan)
|
|
register_module_health_checks()
|
|
```
|
|
|
|
This registers health checks as `module:{code}` (e.g., `module:billing`).
|
|
|
|
## External Tools
|
|
|
|
### Flower (Celery Monitoring)
|
|
|
|
Configure Flower URL to include in `/health/tools`:
|
|
|
|
```python
|
|
init_observability(
|
|
flower_url="http://flower:5555",
|
|
)
|
|
```
|
|
|
|
### Grafana
|
|
|
|
Configure Grafana URL:
|
|
|
|
```python
|
|
init_observability(
|
|
grafana_url="http://grafana:3000",
|
|
)
|
|
```
|
|
|
|
## Initialization
|
|
|
|
### Application Lifespan
|
|
|
|
```python
|
|
# main.py
|
|
from contextlib import asynccontextmanager
|
|
from fastapi import FastAPI
|
|
from app.core.observability import (
|
|
init_observability,
|
|
shutdown_observability,
|
|
health_router,
|
|
register_module_health_checks,
|
|
)
|
|
|
|
@asynccontextmanager
|
|
async def lifespan(app: FastAPI):
|
|
# Startup
|
|
init_observability(
|
|
enable_metrics=True,
|
|
sentry_dsn=settings.SENTRY_DSN,
|
|
environment=settings.ENVIRONMENT,
|
|
flower_url=settings.FLOWER_URL,
|
|
grafana_url=settings.GRAFANA_URL,
|
|
)
|
|
register_module_health_checks()
|
|
|
|
yield
|
|
|
|
# Shutdown
|
|
shutdown_observability()
|
|
|
|
app = FastAPI(lifespan=lifespan)
|
|
app.include_router(health_router)
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
|
|
| `ENVIRONMENT` | Environment name | "development" |
|
|
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
|
|
| `FLOWER_URL` | Flower dashboard URL | None |
|
|
| `GRAFANA_URL` | Grafana dashboard URL | None |
|
|
|
|
## Kubernetes Integration
|
|
|
|
### Deployment Configuration
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health/live
|
|
port: 8000
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 30
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health/ready
|
|
port: 8000
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 10
|
|
```
|
|
|
|
### Prometheus ServiceMonitor
|
|
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
spec:
|
|
endpoints:
|
|
- port: http
|
|
path: /metrics
|
|
interval: 15s
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Do
|
|
|
|
- Register health checks for critical dependencies (database, cache, external APIs)
|
|
- Use appropriate metric types (counter for counts, histogram for latency)
|
|
- Include meaningful labels but avoid high cardinality
|
|
- Set up alerts based on health status changes
|
|
|
|
### Don't
|
|
|
|
- Create health checks that are slow or have side effects
|
|
- Add high-cardinality labels to metrics (e.g., user IDs)
|
|
- Ignore Sentry errors in production
|
|
- Skip readiness probes in Kubernetes deployments
|
|
|
|
## Related Documentation
|
|
|
|
- [Module System](module-system.md) - Module health check integration
|
|
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
|
|
- [Deployment](../deployment/index.md) - Production deployment with monitoring
|