docs: add observability, creating modules guide, and unified migration plan

- Add observability framework documentation (health checks, metrics, Sentry) - Add developer guide for creating modules - Add comprehensive module migration plan with Celery task integration - Update architecture overview with module system and observability sections - Update module-system.md with links to new docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 22:41:19 +01:00
parent e3cab29c1a
commit 7dbdbd4c7e
6 changed files with 1583 additions and 8 deletions
--- a/docs/architecture/module-system.md
+++ b/docs/architecture/module-system.md
@@ -417,5 +417,7 @@ register_module_health_checks()
 ## Related Documentation

 - [Menu Management](menu-management.md) - Sidebar and menu configuration
+- [Creating Modules](../development/creating-modules.md) - Developer guide for building modules
+- [Observability](observability.md) - Health checks and module health integration
 - [Multi-Tenant System](multi-tenant.md) - Platform isolation
 - [Feature Gating](../implementation/feature-gating-system.md) - Tier-based access
--- a/docs/architecture/observability.md
+++ b/docs/architecture/observability.md
@@ -0,0 +1,429 @@
+# Observability Framework
+
+The Wizamart platform includes a comprehensive observability framework for monitoring application health, collecting metrics, and tracking errors. This is part of the Framework Layer - infrastructure that modules depend on.
+
+## Overview
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                     OBSERVABILITY FRAMEWORK                              │
+│                    app/core/observability.py                            │
+│                                                                          │
+│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐        │
+│  │ Health Checks   │  │ Prometheus      │  │ Sentry          │        │
+│  │ Registry        │  │ Metrics         │  │ Integration     │        │
+│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘        │
+│           │                    │                    │                   │
+│           ▼                    ▼                    ▼                   │
+│  ┌─────────────────────────────────────────────────────────────────┐   │
+│  │                      API Endpoints                               │   │
+│  │  /health  │  /health/live  │  /health/ready  │  /metrics        │   │
+│  └─────────────────────────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────────────────────┘
+                                    │
+                                    ▼
+                    ┌───────────────────────────────┐
+                    │      External Tools           │
+                    │  Flower │ Grafana │ Prometheus│
+                    └───────────────────────────────┘
+```
+
+## Health Checks
+
+### Health Check Registry
+
+Components register health checks that are aggregated into a single endpoint.
+
+```python
+from app.core.observability import (
+    health_registry,
+    HealthCheckResult,
+    HealthStatus,
+)
+
+# Register using decorator
+@health_registry.register("database")
+def check_database() -> HealthCheckResult:
+    try:
+        db.execute("SELECT 1")
+        return HealthCheckResult(
+            name="database",
+            status=HealthStatus.HEALTHY,
+            message="Database connection OK"
+        )
+    except Exception as e:
+        return HealthCheckResult(
+            name="database",
+            status=HealthStatus.UNHEALTHY,
+            message=str(e)
+        )
+
+# Or register directly
+health_registry.register_check("redis", check_redis_connection)
+```
+
+### Health Status Levels
+
+| Status | Description | HTTP Response |
+|--------|-------------|---------------|
+| `HEALTHY` | All systems operational | 200 |
+| `DEGRADED` | Partial functionality available | 200 |
+| `UNHEALTHY` | Critical failure | 200 (check response body) |
+
+### HealthCheckResult Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `name` | str | Check identifier |
+| `status` | HealthStatus | Health status level |
+| `message` | str | Optional status message |
+| `latency_ms` | float | Check execution time |
+| `details` | dict | Additional diagnostic data |
+| `checked_at` | datetime | Timestamp of check |
+
+## API Endpoints
+
+### GET /health
+
+Aggregated health check endpoint. Returns combined status from all registered checks.
+
+**Response:**
+```json
+{
+    "status": "healthy",
+    "timestamp": "2026-01-27T10:30:00Z",
+    "checks": [
+        {
+            "name": "database",
+            "status": "healthy",
+            "message": "Connection OK",
+            "latency_ms": 2.5,
+            "details": {}
+        },
+        {
+            "name": "module:billing",
+            "status": "healthy",
+            "message": "",
+            "latency_ms": 0.1,
+            "details": {}
+        }
+    ]
+}
+```
+
+**Overall Status Logic:**
+- If any check is `UNHEALTHY` → overall is `UNHEALTHY`
+- If any check is `DEGRADED` and none `UNHEALTHY` → overall is `DEGRADED`
+- Otherwise → `HEALTHY`
+
+### GET /health/live
+
+Kubernetes liveness probe. Returns 200 if application is running.
+
+**Response:**
+```json
+{"status": "alive"}
+```
+
+### GET /health/ready
+
+Kubernetes readiness probe. Returns ready status based on health checks.
+
+**Response:**
+```json
+{
+    "status": "ready",
+    "health": "healthy"
+}
+```
+
+Or if unhealthy:
+```json
+{
+    "status": "not_ready",
+    "health": "unhealthy"
+}
+```
+
+### GET /metrics
+
+Prometheus metrics endpoint. Returns metrics in Prometheus text format.
+
+**Response:**
+```
+# HELP http_requests_total Total HTTP requests
+# TYPE http_requests_total counter
+http_requests_total{method="GET",endpoint="/api/products",status="200"} 1234
+...
+```
+
+### GET /health/tools
+
+Returns URLs to external monitoring tools.
+
+**Response:**
+```json
+{
+    "flower": "http://flower.example.com:5555",
+    "grafana": "http://grafana.example.com:3000",
+    "prometheus": null
+}
+```
+
+## Prometheus Metrics
+
+### MetricsRegistry
+
+The metrics registry provides a wrapper around `prometheus_client` with fallback when the library isn't installed.
+
+```python
+from app.core.observability import metrics_registry
+
+# Counter - tracks cumulative values
+request_counter = metrics_registry.counter(
+    "http_requests_total",
+    "Total HTTP requests",
+    ["method", "endpoint", "status"]
+)
+request_counter.labels(method="GET", endpoint="/api/products", status="200").inc()
+
+# Histogram - tracks distributions (latency, sizes)
+request_latency = metrics_registry.histogram(
+    "http_request_duration_seconds",
+    "HTTP request latency",
+    ["endpoint"],
+    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
+)
+request_latency.labels(endpoint="/api/products").observe(0.045)
+
+# Gauge - tracks current values
+active_connections = metrics_registry.gauge(
+    "active_connections",
+    "Number of active connections",
+    ["pool"]
+)
+active_connections.labels(pool="database").set(42)
+```
+
+### Enabling Metrics
+
+Metrics are disabled by default. Enable during initialization:
+
+```python
+from app.core.observability import init_observability
+
+init_observability(
+    enable_metrics=True,
+    # ... other options
+)
+```
+
+### Dummy Metrics
+
+When `prometheus_client` isn't installed or metrics are disabled, the registry returns dummy metrics that silently ignore all operations. This allows code to use metrics without checking if they're enabled.
+
+## Sentry Integration
+
+### Configuration
+
+```python
+from app.core.observability import sentry, init_observability
+
+# Initialize via init_observability
+init_observability(
+    sentry_dsn="https://key@sentry.io/project",
+    environment="production",
+)
+
+# Or initialize directly
+sentry.init(
+    dsn="https://key@sentry.io/project",
+    environment="production",
+    release="1.0.0"
+)
+```
+
+### Capturing Errors
+
+```python
+from app.core.observability import sentry
+
+try:
+    risky_operation()
+except Exception as e:
+    event_id = sentry.capture_exception(e)
+    logger.error(f"Operation failed, Sentry event: {event_id}")
+
+# Capture messages
+sentry.capture_message("User reached rate limit", level="warning")
+```
+
+### Without Sentry
+
+If `sentry_sdk` isn't installed or DSN isn't provided, all capture calls silently return `None`.
+
+## Module Health Checks
+
+Modules can provide health checks that are automatically registered.
+
+### Defining Module Health Check
+
+```python
+# In module definition
+from app.modules.base import ModuleDefinition
+
+def check_billing_health() -> dict:
+    """Check billing service health."""
+    try:
+        # Check Stripe connection
+        stripe.Account.retrieve()
+        return {"status": "healthy", "message": "Stripe connected"}
+    except Exception as e:
+        return {"status": "unhealthy", "message": str(e)}
+
+billing_module = ModuleDefinition(
+    code="billing",
+    name="Billing",
+    health_check=check_billing_health,
+    # ...
+)
+```
+
+### Registering Module Health Checks
+
+```python
+from app.core.observability import register_module_health_checks
+
+# Call after modules are loaded (e.g., in app lifespan)
+register_module_health_checks()
+```
+
+This registers health checks as `module:{code}` (e.g., `module:billing`).
+
+## External Tools
+
+### Flower (Celery Monitoring)
+
+Configure Flower URL to include in `/health/tools`:
+
+```python
+init_observability(
+    flower_url="http://flower:5555",
+)
+```
+
+### Grafana
+
+Configure Grafana URL:
+
+```python
+init_observability(
+    grafana_url="http://grafana:3000",
+)
+```
+
+## Initialization
+
+### Application Lifespan
+
+```python
+# main.py
+from contextlib import asynccontextmanager
+from fastapi import FastAPI
+from app.core.observability import (
+    init_observability,
+    shutdown_observability,
+    health_router,
+    register_module_health_checks,
+)
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    # Startup
+    init_observability(
+        enable_metrics=True,
+        sentry_dsn=settings.SENTRY_DSN,
+        environment=settings.ENVIRONMENT,
+        flower_url=settings.FLOWER_URL,
+        grafana_url=settings.GRAFANA_URL,
+    )
+    register_module_health_checks()
+
+    yield
+
+    # Shutdown
+    shutdown_observability()
+
+app = FastAPI(lifespan=lifespan)
+app.include_router(health_router)
+```
+
+### Environment Variables
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `SENTRY_DSN` | Sentry DSN for error tracking | None (disabled) |
+| `ENVIRONMENT` | Environment name | "development" |
+| `ENABLE_METRICS` | Enable Prometheus metrics | False |
+| `FLOWER_URL` | Flower dashboard URL | None |
+| `GRAFANA_URL` | Grafana dashboard URL | None |
+
+## Kubernetes Integration
+
+### Deployment Configuration
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+spec:
+  template:
+    spec:
+      containers:
+        - name: app
+          livenessProbe:
+            httpGet:
+              path: /health/live
+              port: 8000
+            initialDelaySeconds: 10
+            periodSeconds: 30
+          readinessProbe:
+            httpGet:
+              path: /health/ready
+              port: 8000
+            initialDelaySeconds: 5
+            periodSeconds: 10
+```
+
+### Prometheus ServiceMonitor
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+spec:
+  endpoints:
+    - port: http
+      path: /metrics
+      interval: 15s
+```
+
+## Best Practices
+
+### Do
+
+- Register health checks for critical dependencies (database, cache, external APIs)
+- Use appropriate metric types (counter for counts, histogram for latency)
+- Include meaningful labels but avoid high cardinality
+- Set up alerts based on health status changes
+
+### Don't
+
+- Create health checks that are slow or have side effects
+- Add high-cardinality labels to metrics (e.g., user IDs)
+- Ignore Sentry errors in production
+- Skip readiness probes in Kubernetes deployments
+
+## Related Documentation
+
+- [Module System](module-system.md) - Module health check integration
+- [Background Tasks](background-tasks.md) - Celery monitoring with Flower
+- [Deployment](../deployment/index.md) - Production deployment with monitoring
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -76,7 +76,42 @@ Custom middleware handles:

 **See:** [Authentication & RBAC](auth-rbac.md) for details

-### 4. Request Flow
+### 4. Module System
+
+The platform uses a modular architecture with three-tier classification:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    FRAMEWORK LAYER                          │
+│  (Infrastructure - always available, not modules)           │
+│  Config │ Database │ Auth │ Permissions │ Observability     │
+└─────────────────────────────────────────────────────────────┘
+                            │
+                            ▼
+┌─────────────────────────────────────────────────────────────┐
+│                     MODULE LAYER                            │
+│                                                             │
+│  CORE (Always Enabled)     OPTIONAL (Per-Platform)         │
+│  ├── core                  ├── payments                    │
+│  ├── tenancy               ├── billing                     │
+│  ├── cms                   ├── inventory                   │
+│  └── customers             ├── orders                      │
+│                            ├── marketplace                 │
+│  INTERNAL (Admin-Only)     ├── analytics                   │
+│  ├── dev-tools             └── messaging                   │
+│  └── monitoring                                            │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Module Features:**
+- Enable/disable modules per platform
+- Module dependencies (billing requires payments)
+- Health checks and lifecycle hooks
+- Self-contained modules with own services, models, migrations
+
+**See:** [Module System](module-system.md) for complete documentation
+
+### 5. Request Flow

 ```mermaid
 graph TB
@@ -252,7 +287,15 @@ project/
 │   ├── api/               # API routes
 │   ├── routes/            # Page routes (HTML)
 │   ├── services/          # Business logic
-│   ├── core/              # Core functionality
+│   ├── core/              # Core functionality (config, db, observability)
+│   ├── modules/           # Module definitions and self-contained modules
+│   │   ├── base.py       # ModuleDefinition class
+│   │   ├── registry.py   # Module registry (CORE, OPTIONAL, INTERNAL)
+│   │   ├── service.py    # Module enablement service
+│   │   ├── events.py     # Module event bus
+│   │   ├── cms/          # Self-contained CMS module
+│   │   ├── payments/     # Self-contained payments module
+│   │   └── ...           # Other modules
 │   └── exceptions/        # Custom exceptions
 │
 ├── middleware/            # Custom middleware
@@ -288,6 +331,27 @@ project/

 ## Monitoring & Observability

+The platform includes a comprehensive observability framework:
+
+### Health Checks
+
+- Aggregated health endpoint (`/health`)
+- Kubernetes probes (`/health/live`, `/health/ready`)
+- Module health check integration
+- External tool links (`/health/tools`)
+
+### Metrics
+
+- Prometheus metrics endpoint (`/metrics`)
+- Request latency histograms
+- Counter and gauge support
+
+### Error Tracking
+
+- Sentry integration for production
+- Exception capture with context
+- Environment-aware configuration
+
 ### Logging

 - Structured logging with Python logging module
@@ -295,12 +359,7 @@ project/
 - Error tracking with stack traces
 - Audit logging for admin actions

-### Performance Monitoring
-
- Request timing headers (`X-Process-Time`)
- Database query monitoring (SQLAlchemy echo)
- Slow query identification
- Memory usage tracking
+**See:** [Observability](observability.md) for complete documentation

 ## Deployment Architecture

@@ -332,6 +391,9 @@ Internet ───────────│ Load Balancer│
 ## Related Documentation

 - [Multi-Tenant System](multi-tenant.md) - Detailed multi-tenancy implementation
+- [Module System](module-system.md) - Module architecture and classification
+- [Menu Management](menu-management.md) - Sidebar and menu configuration
+- [Observability](observability.md) - Health checks, metrics, error tracking
 - [Middleware Stack](middleware.md) - Complete middleware documentation
 - [Authentication & RBAC](auth-rbac.md) - Security and access control
 - [Request Flow](request-flow.md) - Detailed request processing