orion/docs/operations/platform-health.md

# Platform Health Monitoring

This guide covers the platform health monitoring features available in the admin dashboard.

## Overview

The Platform Health page (`/admin/platform-health`) provides real-time visibility into system performance, resource usage, and capacity thresholds.

## Accessing Platform Health

Navigate to **Admin > Platform Health** in the sidebar, or go directly to `/admin/platform-health`.

## Dashboard Sections

### 1. System Overview

Quick glance at overall platform status:

| Indicator | Green | Yellow | Red |
|-----------|-------|--------|-----|
| API Response Time | < 100ms | 100-500ms | > 500ms |
| Error Rate | < 0.1% | 0.1-1% | > 1% |
| Database Health | Connected | Slow queries | Disconnected |
| Storage | < 70% | 70-85% | > 85% |

### 2. Resource Usage

Real-time metrics:

- **CPU Usage**: Current and 24h average
- **Memory Usage**: Used vs available
- **Disk Usage**: Storage consumption with trend
- **Network**: Inbound/outbound throughput

### 3. Capacity Metrics

Track growth toward scaling thresholds:

- **Total Products**: Count across all stores
- **Total Images**: Files stored in image system
- **Database Size**: Current size vs recommended max
- **Active Clients**: Monthly active store accounts

### 4. Performance Trends

Historical charts (7-day, 30-day):

- API response times (p50, p95, p99)
- Request volume by endpoint
- Database query latency
- Error rate over time

## Alert Configuration

### Threshold Alerts

Configure alerts for proactive monitoring:

```python
# In app/core/config.py
HEALTH_THRESHOLDS = {
    "cpu_percent": {"warning": 70, "critical": 85},
    "memory_percent": {"warning": 75, "critical": 90},
    "disk_percent": {"warning": 70, "critical": 85},
    "response_time_ms": {"warning": 200, "critical": 500},
    "error_rate_percent": {"warning": 1.0, "critical": 5.0},
}
```

### Notification Channels

Alerts can be sent via:
- Email to admin users
- Slack webhook (if configured)
- Dashboard notifications

## Related Pages

- [Capacity Monitoring](capacity-monitoring.md) - Detailed capacity metrics
- [Image Storage](image-storage.md) - Image system management
- [Capacity Planning](../architecture/capacity-planning.md) - Infrastructure sizing guide

## API Endpoints

The platform health page uses these admin API endpoints:

| Endpoint | Description |
|----------|-------------|
| `GET /api/v1/admin/platform/health` | Overall health status |
| `GET /api/v1/admin/platform/metrics` | Current metrics |
| `GET /api/v1/admin/platform/metrics/history` | Historical data |
| `GET /api/v1/admin/platform/capacity` | Capacity usage |