docs: add observability, creating modules guide, and unified migration plan
- Add observability framework documentation (health checks, metrics, Sentry) - Add developer guide for creating modules - Add comprehensive module migration plan with Celery task integration - Update architecture overview with module system and observability sections - Update module-system.md with links to new docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -417,5 +417,7 @@ register_module_health_checks()
|
||||
## Related Documentation
|
||||
|
||||
- [Menu Management](menu-management.md) - Sidebar and menu configuration
|
||||
- [Creating Modules](../development/creating-modules.md) - Developer guide for building modules
|
||||
- [Observability](observability.md) - Health checks and module health integration
|
||||
- [Multi-Tenant System](multi-tenant.md) - Platform isolation
|
||||
- [Feature Gating](../implementation/feature-gating-system.md) - Tier-based access
|
||||
|
||||
429
docs/architecture/observability.md
Normal file
429
docs/architecture/observability.md
Normal file
@@ -0,0 +1,429 @@
|
||||
# Observability Framework
|
||||
|
||||
The Wizamart platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ OBSERVABILITY FRAMEWORK │
|
||||
│ app/core/observability.py │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
||||
│ │ Health Checks │ │ Prometheus │ │ Sentry │ │
|
||||
│ │ Registry │ │ Metrics │ │ Integration │ │
|
||||
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ API Endpoints │ │
|
||||
│ │ /health │ /health/live │ /health/ready │ /metrics │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────────┐
|
||||
│ External Tools │
|
||||
│ Flower │ Grafana │ Prometheus│
|
||||
└───────────────────────────────┘
|
||||
```
|
||||
|
||||
## Health Checks
|
||||
|
||||
### Health Check Registry
|
||||
|
||||
Components register health checks that are aggregated into a single endpoint.
|
||||
|
||||
```python
|
||||
from app.core.observability import (
|
||||
health_registry,
|
||||
HealthCheckResult,
|
||||
HealthStatus,
|
||||
)
|
||||
|
||||
# Register using decorator
|
||||
@health_registry.register("database")
|
||||
def check_database() -> HealthCheckResult:
|
||||
try:
|
||||
db.execute("SELECT 1")
|
||||
return HealthCheckResult(
|
||||
name="database",
|
||||
status=HealthStatus.HEALTHY,
|
||||
message="Database connection OK"
|
||||
)
|
||||
except Exception as e:
|
||||
return HealthCheckResult(
|
||||
name="database",
|
||||
status=HealthStatus.UNHEALTHY,
|
||||
message=str(e)
|
||||
)
|
||||
|
||||
# Or register directly
|
||||
health_registry.register_check("redis", check_redis_connection)
|
||||
```
|
||||
|
||||
### Health Status Levels
|
||||
|
||||
| Status | Description | HTTP Response |
|
||||
|--------|-------------|---------------|
|
||||
| `HEALTHY` | All systems operational | 200 |
|
||||
| `DEGRADED` | Partial functionality available | 200 |
|
||||
| `UNHEALTHY` | Critical failure | 200 (check response body) |
|
||||
|
||||
### HealthCheckResult Fields
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `name` | str | Check identifier |
|
||||
| `status` | HealthStatus | Health status level |
|
||||
| `message` | str | Optional status message |
|
||||
| `latency_ms` | float | Check execution time |
|
||||
| `details` | dict | Additional diagnostic data |
|
||||
| `checked_at` | datetime | Timestamp of check |
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### GET /health
|
||||
|
||||
Aggregated health check endpoint. Returns combined status from all registered checks.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"timestamp": "2026-01-27T10:30:00Z",
|
||||
"checks": [
|
||||
{
|
||||
"name": "database",
|
||||
"status": "healthy",
|
||||
"message": "Connection OK",
|
||||
"latency_ms": 2.5,
|
||||
"details": {}
|
||||
},
|
||||
{
|
||||
"name": "module:billing",
|
||||
"status": "healthy",
|
||||
"message": "",
|
||||
"latency_ms": 0.1,
|
||||
"details": {}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Overall Status Logic:**
|
||||
- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
|
||||
- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
|
||||
- Otherwise → `HEALTHY`
|
||||
|
||||
### GET /health/live
|
||||
|
||||
Kubernetes liveness probe. Returns 200 if application is running.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{"status": "alive"}
|
||||
```
|
||||
|
||||
### GET /health/ready
|
||||
|
||||
Kubernetes readiness probe. Returns ready status based on health checks.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"status": "ready",
|
||||
"health": "healthy"
|
||||
}
|
||||
```
|
||||
|
||||
Or if unhealthy:
|
||||
```json
|
||||
{
|
||||
"status": "not_ready",
|
||||
"health": "unhealthy"
|
||||
}
|
||||
```
|
||||
|
||||
### GET /metrics
|
||||
|
||||
Prometheus metrics endpoint. Returns metrics in Prometheus text format.
|
||||
|
||||
**Response:**
|
||||
```
|
||||
# HELP http_requests_total Total HTTP requests
|
||||
# TYPE http_requests_total counter
|
||||
http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
|
||||
...
|
||||
```
|
||||
|
||||
### GET /health/tools
|
||||
|
||||
Returns URLs to external monitoring tools.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"flower": "http://flower.example.com:5555",
|
||||
"grafana": "http://grafana.example.com:3000",
|
||||
"prometheus": null
|
||||
}
|
||||
```
|
||||
|
||||
## Prometheus Metrics
|
||||
|
||||
### MetricsRegistry
|
||||
|
||||
The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.
|
||||
|
||||
```python
|
||||
from app.core.observability import metrics_registry
|
||||
|
||||
# Counter - tracks cumulative values
|
||||
request_counter = metrics_registry.counter(
|
||||
"http_requests_total",
|
||||
"Total HTTP requests",
|
||||
["method", "endpoint", "status"]
|
||||
)
|
||||
request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
|
||||
|
||||
# Histogram - tracks distributions (latency, sizes)
|
||||
request_latency = metrics_registry.histogram(
|
||||
"http_request_duration_seconds",
|
||||
"HTTP request latency",
|
||||
["endpoint"],
|
||||
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
|
||||
)
|
||||
request_latency.labels(endpoint="/api/products").observe(0.045)
|
||||
|
||||
# Gauge - tracks current values
|
||||
active_connections = metrics_registry.gauge(
|
||||
"active_connections",
|
||||
"Number of active connections",
|
||||
["pool"]
|
||||
)
|
||||
active_connections.labels(pool="database").set(42)
|
||||
```
|
||||
|
||||
### Enabling Metrics
|
||||
|
||||
Metrics are disabled by default. Enable during initialization:
|
||||
|
||||
```python
|
||||
from app.core.observability import init_observability
|
||||
|
||||
init_observability(
|
||||
enable_metrics=True,
|
||||
# ... other options
|
||||
)
|
||||
```
|
||||
|
||||
### Dummy Metrics
|
||||
|
||||
When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
|
||||
|
||||
## Sentry Integration
|
||||
|
||||
### Configuration
|
||||
|
||||
```python
|
||||
from app.core.observability import sentry, init_observability
|
||||
|
||||
# Initialize via init_observability
|
||||
init_observability(
|
||||
sentry_dsn="https://key@sentry.io/project",
|
||||
environment="production",
|
||||
)
|
||||
|
||||
# Or initialize directly
|
||||
sentry.init(
|
||||
dsn="https://key@sentry.io/project",
|
||||
environment="production",
|
||||
release="1.0.0"
|
||||
)
|
||||
```
|
||||
|
||||
### Capturing Errors
|
||||
|
||||
```python
|
||||
from app.core.observability import sentry
|
||||
|
||||
try:
|
||||
risky_operation()
|
||||
except Exception as e:
|
||||
event_id = sentry.capture_exception(e)
|
||||
logger.error(f"Operation failed, Sentry event: {event_id}")
|
||||
|
||||
# Capture messages
|
||||
sentry.capture_message("User reached rate limit", level="warning")
|
||||
```
|
||||
|
||||
### Without Sentry
|
||||
|
||||
If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.
|
||||
|
||||
## Module Health Checks
|
||||
|
||||
Modules can provide health checks that are automatically registered.
|
||||
|
||||
### Defining Module Health Check
|
||||
|
||||
```python
|
||||
# In module definition
|
||||
from app.modules.base import ModuleDefinition
|
||||
|
||||
def check_billing_health() -> dict:
|
||||
"""Check billing service health."""
|
||||
try:
|
||||
# Check Stripe connection
|
||||
stripe.Account.retrieve()
|
||||
return {"status": "healthy", "message": "Stripe connected"}
|
||||
except Exception as e:
|
||||
return {"status": "unhealthy", "message": str(e)}
|
||||
|
||||
billing_module = ModuleDefinition(
|
||||
code="billing",
|
||||
name="Billing",
|
||||
health_check=check_billing_health,
|
||||
# ...
|
||||
)
|
||||
```
|
||||
|
||||
### Registering Module Health Checks
|
||||
|
||||
```python
|
||||
from app.core.observability import register_module_health_checks
|
||||
|
||||
# Call after modules are loaded (e.g., in app lifespan)
|
||||
register_module_health_checks()
|
||||
```
|
||||
|
||||
This registers health checks as `module:{code}` (e.g., `module:billing`).
|
||||
|
||||
## External Tools
|
||||
|
||||
### Flower (Celery Monitoring)
|
||||
|
||||
Configure Flower URL to include in `/health/tools`:
|
||||
|
||||
```python
|
||||
init_observability(
|
||||
flower_url="http://flower:5555",
|
||||
)
|
||||
```
|
||||
|
||||
### Grafana
|
||||
|
||||
Configure Grafana URL:
|
||||
|
||||
```python
|
||||
init_observability(
|
||||
grafana_url="http://grafana:3000",
|
||||
)
|
||||
```
|
||||
|
||||
## Initialization
|
||||
|
||||
### Application Lifespan
|
||||
|
||||
```python
|
||||
# main.py
|
||||
from contextlib import asynccontextmanager
|
||||
from fastapi import FastAPI
|
||||
from app.core.observability import (
|
||||
init_observability,
|
||||
shutdown_observability,
|
||||
health_router,
|
||||
register_module_health_checks,
|
||||
)
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI):
|
||||
# Startup
|
||||
init_observability(
|
||||
enable_metrics=True,
|
||||
sentry_dsn=settings.SENTRY_DSN,
|
||||
environment=settings.ENVIRONMENT,
|
||||
flower_url=settings.FLOWER_URL,
|
||||
grafana_url=settings.GRAFANA_URL,
|
||||
)
|
||||
register_module_health_checks()
|
||||
|
||||
yield
|
||||
|
||||
# Shutdown
|
||||
shutdown_observability()
|
||||
|
||||
app = FastAPI(lifespan=lifespan)
|
||||
app.include_router(health_router)
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
|
||||
| `ENVIRONMENT` | Environment name | "development" |
|
||||
| `ENABLE_METRICS` | Enable Prometheus metrics | False |
|
||||
| `FLOWER_URL` | Flower dashboard URL | None |
|
||||
| `GRAFANA_URL` | Grafana dashboard URL | None |
|
||||
|
||||
## Kubernetes Integration
|
||||
|
||||
### Deployment Configuration
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: app
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8000
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 30
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8000
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
```
|
||||
|
||||
### Prometheus ServiceMonitor
|
||||
|
||||
```yaml
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
spec:
|
||||
endpoints:
|
||||
- port: http
|
||||
path: /metrics
|
||||
interval: 15s
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do
|
||||
|
||||
- Register health checks for critical dependencies (database, cache, external APIs)
|
||||
- Use appropriate metric types (counter for counts, histogram for latency)
|
||||
- Include meaningful labels but avoid high cardinality
|
||||
- Set up alerts based on health status changes
|
||||
|
||||
### Don't
|
||||
|
||||
- Create health checks that are slow or have side effects
|
||||
- Add high-cardinality labels to metrics (e.g., user IDs)
|
||||
- Ignore Sentry errors in production
|
||||
- Skip readiness probes in Kubernetes deployments
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Module System](module-system.md) - Module health check integration
|
||||
- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
|
||||
- [Deployment](../deployment/index.md) - Production deployment with monitoring
|
||||
@@ -76,7 +76,42 @@ Custom middleware handles:
|
||||
|
||||
**See:** [Authentication & RBAC](auth-rbac.md) for details
|
||||
|
||||
### 4. Request Flow
|
||||
### 4. Module System
|
||||
|
||||
The platform uses a modular architecture with three-tier classification:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ FRAMEWORK LAYER │
|
||||
│ (Infrastructure - always available, not modules) │
|
||||
│ Config │ Database │ Auth │ Permissions │ Observability │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ MODULE LAYER │
|
||||
│ │
|
||||
│ CORE (Always Enabled) OPTIONAL (Per-Platform) │
|
||||
│ ├── core ├── payments │
|
||||
│ ├── tenancy ├── billing │
|
||||
│ ├── cms ├── inventory │
|
||||
│ └── customers ├── orders │
|
||||
│ ├── marketplace │
|
||||
│ INTERNAL (Admin-Only) ├── analytics │
|
||||
│ ├── dev-tools └── messaging │
|
||||
│ └── monitoring │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Module Features:**
|
||||
- Enable/disable modules per platform
|
||||
- Module dependencies (billing requires payments)
|
||||
- Health checks and lifecycle hooks
|
||||
- Self-contained modules with own services, models, migrations
|
||||
|
||||
**See:** [Module System](module-system.md) for complete documentation
|
||||
|
||||
### 5. Request Flow
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
@@ -252,7 +287,15 @@ project/
|
||||
│ ├── api/ # API routes
|
||||
│ ├── routes/ # Page routes (HTML)
|
||||
│ ├── services/ # Business logic
|
||||
│ ├── core/ # Core functionality
|
||||
│ ├── core/ # Core functionality (config, db, observability)
|
||||
│ ├── modules/ # Module definitions and self-contained modules
|
||||
│ │ ├── base.py # ModuleDefinition class
|
||||
│ │ ├── registry.py # Module registry (CORE, OPTIONAL, INTERNAL)
|
||||
│ │ ├── service.py # Module enablement service
|
||||
│ │ ├── events.py # Module event bus
|
||||
│ │ ├── cms/ # Self-contained CMS module
|
||||
│ │ ├── payments/ # Self-contained payments module
|
||||
│ │ └── ... # Other modules
|
||||
│ └── exceptions/ # Custom exceptions
|
||||
│
|
||||
├── middleware/ # Custom middleware
|
||||
@@ -288,6 +331,27 @@ project/
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
The platform includes a comprehensive observability framework:
|
||||
|
||||
### Health Checks
|
||||
|
||||
- Aggregated health endpoint (`/health`)
|
||||
- Kubernetes probes (`/health/live`, `/health/ready`)
|
||||
- Module health check integration
|
||||
- External tool links (`/health/tools`)
|
||||
|
||||
### Metrics
|
||||
|
||||
- Prometheus metrics endpoint (`/metrics`)
|
||||
- Request latency histograms
|
||||
- Counter and gauge support
|
||||
|
||||
### Error Tracking
|
||||
|
||||
- Sentry integration for production
|
||||
- Exception capture with context
|
||||
- Environment-aware configuration
|
||||
|
||||
### Logging
|
||||
|
||||
- Structured logging with Python logging module
|
||||
@@ -295,12 +359,7 @@ project/
|
||||
- Error tracking with stack traces
|
||||
- Audit logging for admin actions
|
||||
|
||||
### Performance Monitoring
|
||||
|
||||
- Request timing headers (`X-Process-Time`)
|
||||
- Database query monitoring (SQLAlchemy echo)
|
||||
- Slow query identification
|
||||
- Memory usage tracking
|
||||
**See:** [Observability](observability.md) for complete documentation
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
@@ -332,6 +391,9 @@ Internet ───────────│ Load Balancer│
|
||||
## Related Documentation
|
||||
|
||||
- [Multi-Tenant System](multi-tenant.md) - Detailed multi-tenancy implementation
|
||||
- [Module System](module-system.md) - Module architecture and classification
|
||||
- [Menu Management](menu-management.md) - Sidebar and menu configuration
|
||||
- [Observability](observability.md) - Health checks, metrics, error tracking
|
||||
- [Middleware Stack](middleware.md) - Complete middleware documentation
|
||||
- [Authentication & RBAC](auth-rbac.md) - Security and access control
|
||||
- [Request Flow](request-flow.md) - Detailed request processing
|
||||
|
||||
Reference in New Issue
Block a user