# Infrastructure Guide This guide documents the complete infrastructure for the Orion platform, from development to high-end production. **Philosophy:** We prioritize **debuggability and operational simplicity** over complexity. Every component should be directly accessible for troubleshooting. --- ## Table of Contents - [Architecture Overview](#architecture-overview) - [Current State](#current-state) - [Development Environment](#development-environment) - [Production Options](#production-options) - [Future High-End Architecture](#future-high-end-architecture) - [Component Deep Dives](#component-deep-dives) - [Troubleshooting Guide](#troubleshooting-guide) - [Decision Matrix](#decision-matrix) --- ## Architecture Overview ### System Components ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ CLIENTS │ │ (Browsers, Mobile Apps, API Consumers) │ └─────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ LOAD BALANCER / PROXY │ │ (Nginx, Caddy, or Cloud LB) │ │ - SSL termination │ │ - Static file serving │ │ - Rate limiting │ └─────────────────────────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ APPLICATION SERVERS │ │ (FastAPI + Uvicorn) │ │ - API endpoints │ │ - HTML rendering (Jinja2) │ │ - WebSocket connections │ └─────────────────────────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ PostgreSQL │ │ Redis │ │ File Storage │ │ (Primary DB) │ │ (Cache/Queue) │ │ (S3/Local) │ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ ▼ ┌──────────────────┐ │ Celery Workers │ │ (Background Jobs)│ └──────────────────┘ ``` ### Data Flow 1. **Request** → Nginx → Uvicorn → FastAPI → Service Layer → Database 2. **Background Job** → API creates task → Redis Queue → Celery Worker → Database 3. **Static Files** → Nginx serves directly (or CDN in production) --- ## Current State ### What We Have Now | Component | Technology | Dev Required | Prod Required | Status | |-----------|------------|--------------|---------------|--------| | Web Framework | FastAPI + Uvicorn | ✅ | ✅ | ✅ Production Ready | | Database | PostgreSQL 15 | ✅ | ✅ | ✅ Production Ready | | ORM | SQLAlchemy 2.0 | ✅ | ✅ | ✅ Production Ready | | Migrations | Alembic | ✅ | ✅ | ✅ Production Ready | | Templates | Jinja2 + Tailwind CSS | ✅ | ✅ | ✅ Production Ready | | Authentication | JWT (PyJWT) | ✅ | ✅ | ✅ Production Ready | | Email | SMTP/SendGrid/Mailgun/SES | ❌ | ✅ | ✅ Production Ready | | Payments | Stripe | ❌ | ✅ | ✅ Production Ready | | Task Queue | Celery 5.3 + Redis | ❌ | ✅ | ✅ Production Ready | | Task Scheduler | Celery Beat | ❌ | ✅ | ✅ Production Ready | | Task Monitoring | Flower | ❌ | ⚪ Optional | ✅ Production Ready | | Caching | Redis 7 | ❌ | ✅ | ✅ Production Ready | | File Storage | Local / Cloudflare R2 | Local | R2 | ✅ Production Ready | | Error Tracking | Sentry | ❌ | ⚪ Recommended | ✅ Production Ready | | CDN / WAF | CloudFlare | ❌ | ⚪ Recommended | ✅ Production Ready | **Legend:** ✅ Required | ⚪ Optional/Recommended | ❌ Not needed ### Development vs Production **Development** requires only: - PostgreSQL (via Docker: `make docker-up`) - Python 3.11+ with dependencies **Production** adds: - Redis (for Celery task queue) - Celery workers (for background tasks) - Reverse proxy (Nginx) - SSL certificates **Optional but recommended for Production:** - Sentry (error tracking) - Set `SENTRY_DSN` to enable - Cloudflare R2 (cloud storage) - Set `STORAGE_BACKEND=r2` to enable - CloudFlare CDN (caching/DDoS) - Set `CLOUDFLARE_ENABLED=true` to enable ### What We Need for Enterprise (Future Growth) | Component | Priority | When Needed | Estimated Users | |-----------|----------|-------------|-----------------| | Load Balancer | Medium | Horizontal scaling | 1,000+ concurrent | | Database Replica | Medium | Read-heavy workloads | 1,000+ concurrent | | Redis Sentinel | Low | Cache redundancy | 5,000+ concurrent | | Prometheus/Grafana | Low | Advanced metrics | Any (nice to have) | | Kubernetes | Low | Multi-region/HA | 10,000+ concurrent | --- ## Development Environment ### Local Setup (Recommended) ```bash # 1. Start PostgreSQL and Redis make docker-up # 2. Run migrations make migrate-up # 3. Initialize data make init-prod # 4. Start development server make dev # 5. (Optional) Start Celery worker for background tasks make celery-dev # Worker + Beat together # 6. (Optional) Run tests make test ``` ### Services Running Locally | Service | Host | Port | Purpose | |---------|------|------|---------| | FastAPI | localhost | 8000 | Main application | | PostgreSQL | localhost | 5432 | Development database | | PostgreSQL (test) | localhost | 5433 | Test database | | Redis | localhost | 6380 | Cache and task broker | | Celery Worker | - | - | Background task processing | | Celery Beat | - | - | Scheduled task scheduler | | Flower | localhost | 5555 | Task monitoring dashboard | | MkDocs | localhost | 9991 | Documentation | ### Docker Compose Services ```yaml # docker-compose.yml services: db: # PostgreSQL 15 for development redis: # Redis 7 for cache/queue api: # FastAPI application (profile: full) celery-worker: # Background task processor (profile: full) celery-beat: # Scheduled task scheduler (profile: full) flower: # Task monitoring UI (profile: full) ``` ### Celery Commands ```bash # Start worker only make celery-worker # Start scheduler only make celery-beat # Start worker + scheduler together (development) make celery-dev # Start Flower monitoring make flower # Check worker status make celery-status # Purge pending tasks make celery-purge ``` --- ## Production Options ### Option 1: Traditional VPS (Recommended for Troubleshooting) **Best for:** Teams who want direct server access, familiar with Linux administration. ``` ┌─────────────────────────────────────────────────────────────┐ │ VPS (4GB+ RAM) │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Nginx │ │ Uvicorn │ │ PostgreSQL │ │ │ │ (reverse │ │ (4 workers)│ │ (local) │ │ │ │ proxy) │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ └────────────────┼────────────────┘ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Redis │ │ Celery │ │ │ │ (local) │ │ (workers) │ │ │ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` **Setup:** ```bash # On Ubuntu 22.04+ VPS # 1. Install system packages sudo apt update sudo apt install -y nginx postgresql-15 redis-server python3.11 python3.11-venv # 2. Create application user sudo useradd -m -s /bin/bash orion sudo su - orion # 3. Clone and setup git clone /home/orion/app cd /home/orion/app python3.11 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # 4. Configure environment cp .env.example .env nano .env # Edit with production values # 5. Setup database sudo -u postgres createuser orion_user sudo -u postgres createdb orion_db -O orion_user alembic upgrade head python scripts/seed/init_production.py # 6. Create systemd service sudo nano /etc/systemd/system/orion.service ``` **Systemd Service:** ```ini # /etc/systemd/system/orion.service [Unit] Description=Orion API After=network.target postgresql.service redis.service [Service] User=orion Group=orion WorkingDirectory=/home/orion/app Environment="PATH=/home/orion/app/.venv/bin" EnvironmentFile=/home/orion/app/.env ExecStart=/home/orion/app/.venv/bin/uvicorn main:app --host 127.0.0.1 --port 8000 --workers 4 Restart=always RestartSec=3 [Install] WantedBy=multi-user.target ``` **Celery Workers:** ```ini # /etc/systemd/system/orion-celery.service [Unit] Description=Orion Celery Worker After=network.target redis.service [Service] User=orion Group=orion WorkingDirectory=/home/orion/app Environment="PATH=/home/orion/app/.venv/bin" EnvironmentFile=/home/orion/app/.env ExecStart=/home/orion/app/.venv/bin/celery -A app.celery worker --loglevel=info --concurrency=4 Restart=always RestartSec=3 [Install] WantedBy=multi-user.target ``` **Nginx Configuration:** ```nginx # /etc/nginx/sites-available/orion server { listen 80; server_name yourdomain.com; return 301 https://$server_name$request_uri; } server { listen 443 ssl http2; server_name yourdomain.com; ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem; # Security headers add_header X-Frame-Options "SAMEORIGIN" always; add_header X-Content-Type-Options "nosniff" always; add_header X-XSS-Protection "1; mode=block" always; # Static files (served directly by Nginx) location /static { alias /home/orion/app/static; expires 30d; add_header Cache-Control "public, immutable"; } # Uploaded files location /uploads { alias /home/orion/app/uploads; expires 7d; } # API and application location / { proxy_pass http://127.0.0.1:8000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # WebSocket support (for future real-time features) proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; } } ``` **Troubleshooting Commands:** ```bash # Check service status sudo systemctl status orion sudo systemctl status orion-celery sudo systemctl status postgresql sudo systemctl status redis # View logs sudo journalctl -u orion -f sudo journalctl -u orion-celery -f # Connect to database directly sudo -u postgres psql orion_db # Check Redis redis-cli ping redis-cli monitor # Watch commands in real-time # Restart services sudo systemctl restart orion sudo systemctl restart orion-celery ``` --- ### Option 2: Docker Compose Production **Best for:** Consistent environments, easy rollbacks, container familiarity. ```yaml # docker-compose.prod.yml services: api: build: . restart: always ports: - "127.0.0.1:8000:8000" environment: DATABASE_URL: postgresql://orion_user:${DB_PASSWORD}@db:5432/orion_db REDIS_URL: redis://redis:6379/0 CELERY_BROKER_URL: redis://redis:6379/1 depends_on: db: condition: service_healthy redis: condition: service_healthy volumes: - ./uploads:/app/uploads - ./logs:/app/logs healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3 celery: build: . restart: always command: celery -A app.core.celery_config worker --loglevel=info --concurrency=2 environment: DATABASE_URL: postgresql://orion_user:${DB_PASSWORD}@db:5432/orion_db REDIS_URL: redis://redis:6379/0 CELERY_BROKER_URL: redis://redis:6379/1 depends_on: - db - redis volumes: - ./logs:/app/logs celery-beat: build: . restart: always command: celery -A app.celery beat --loglevel=info environment: DATABASE_URL: postgresql://orion_user:${DB_PASSWORD}@db:5432/orion_db CELERY_BROKER_URL: redis://redis:6379/1 depends_on: - redis db: image: postgres:15 restart: always environment: POSTGRES_DB: orion_db POSTGRES_USER: orion_user POSTGRES_PASSWORD: ${DB_PASSWORD} volumes: - postgres_data:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U orion_user -d orion_db"] interval: 10s timeout: 5s retries: 5 redis: image: redis:7-alpine restart: always volumes: - redis_data:/data healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 10s timeout: 5s retries: 5 nginx: image: nginx:alpine restart: always ports: - "80:80" - "443:443" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro - ./static:/app/static:ro - ./uploads:/app/uploads:ro - /etc/letsencrypt:/etc/letsencrypt:ro depends_on: - api volumes: postgres_data: redis_data: ``` **Troubleshooting Commands:** ```bash # View all containers docker compose -f docker-compose.prod.yml ps # View logs docker compose -f docker-compose.prod.yml logs -f api docker compose -f docker-compose.prod.yml logs -f celery # Access container shell docker compose -f docker-compose.prod.yml exec api bash docker compose -f docker-compose.prod.yml exec db psql -U orion_user -d orion_db # Restart specific service docker compose -f docker-compose.prod.yml restart api # View resource usage docker stats ``` --- ### Option 3: Managed Services (Minimal Ops) **Best for:** Small teams, focus on product not infrastructure. | Component | Service | Cost (approx) | |-----------|---------|---------------| | App Hosting | Railway / Render / Fly.io | $5-25/mo | | Database | Neon / Supabase / PlanetScale | $0-25/mo | | Redis | Upstash / Redis Cloud | $0-10/mo | | File Storage | Cloudflare R2 / AWS S3 | $0-5/mo | | Email | Resend / SendGrid | $0-20/mo | **Example: Railway + Neon** ```bash # Deploy to Railway railway login railway init railway up # Configure environment railway variables set DATABASE_URL="postgresql://..." railway variables set REDIS_URL="redis://..." ``` --- ## Future High-End Architecture ### Target Production Architecture ``` ┌─────────────────┐ │ CloudFlare │ │ (CDN + WAF) │ └────────┬────────┘ │ ┌────────▼────────┐ │ Load Balancer │ │ (HA Proxy/ALB) │ └────────┬────────┘ │ ┌──────────────────────┼──────────────────────┐ │ │ │ ┌────────▼────────┐ ┌────────▼────────┐ ┌────────▼────────┐ │ API Server 1 │ │ API Server 2 │ │ API Server N │ │ (Uvicorn) │ │ (Uvicorn) │ │ (Uvicorn) │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ └──────────────────────┼──────────────────────┘ │ ┌───────────────────────────┼───────────────────────────┐ │ │ │ ┌────────▼────────┐ ┌────────▼────────┐ ┌────────▼────────┐ │ PostgreSQL │ │ Redis │ │ S3 / MinIO │ │ (Primary) │ │ (Cluster) │ │ (Files) │ │ │ │ │ │ │ │ │ ┌────▼────┐ │ │ ┌─────────┐ │ │ │ │ │ Replica │ │ │ │ Sentinel│ │ │ │ │ └─────────┘ │ │ └─────────┘ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ┌──────────────────────┼──────────────────────┐ │ │ │ ┌────────▼────────┐ ┌────────▼────────┐ ┌────────▼────────┐ │ Celery Worker 1 │ │ Celery Worker 2 │ │ Celery Beat │ │ (General) │ │ (Import Jobs) │ │ (Scheduler) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ┌─────────────────────────────┐ │ Monitoring Stack │ │ ┌─────────┐ ┌───────────┐ │ │ │Prometheus│ │ Grafana │ │ │ └─────────┘ └───────────┘ │ │ ┌─────────┐ ┌───────────┐ │ │ │ Sentry │ │ Loki │ │ │ └─────────┘ └───────────┘ │ └─────────────────────────────┘ ``` ### Celery Task Queues ```python # app/celery.py (to be implemented) from celery import Celery celery_app = Celery( "orion", broker=settings.celery_broker_url, backend=settings.celery_result_backend, ) celery_app.conf.task_queues = { "default": {"exchange": "default", "routing_key": "default"}, "imports": {"exchange": "imports", "routing_key": "imports"}, "emails": {"exchange": "emails", "routing_key": "emails"}, "reports": {"exchange": "reports", "routing_key": "reports"}, } celery_app.conf.task_routes = { "app.tasks.import_letzshop_products": {"queue": "imports"}, "app.tasks.send_email": {"queue": "emails"}, "app.tasks.generate_report": {"queue": "reports"}, } ``` ### Background Tasks to Implement | Task | Queue | Priority | Description | |------|-------|----------|-------------| | `import_letzshop_products` | imports | High | Marketplace product sync | | `import_letzshop_orders` | imports | High | Order sync from Letzshop | | `send_order_confirmation` | emails | High | Order emails | | `send_password_reset` | emails | High | Auth emails | | `send_invoice_email` | emails | Medium | Invoice delivery | | `generate_sales_report` | reports | Low | Analytics reports | | `cleanup_expired_sessions` | default | Low | Maintenance | | `sync_stripe_subscriptions` | default | Medium | Billing sync | --- ## Component Deep Dives ### PostgreSQL Configuration **Production Settings (`postgresql.conf`):** ```ini # Memory (adjust based on server RAM) shared_buffers = 256MB # 25% of RAM for dedicated DB server effective_cache_size = 768MB # 75% of RAM work_mem = 16MB maintenance_work_mem = 128MB # Connections max_connections = 100 # Write-Ahead Log wal_level = replica max_wal_senders = 3 # Query Planning random_page_cost = 1.1 # For SSD storage effective_io_concurrency = 200 # For SSD storage # Logging log_min_duration_statement = 1000 # Log queries > 1 second log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d ' ``` **Backup Strategy:** ```bash # Daily backup script #!/bin/bash BACKUP_DIR=/backups/postgresql DATE=$(date +%Y%m%d_%H%M%S) pg_dump -U orion_user orion_db | gzip > $BACKUP_DIR/orion_$DATE.sql.gz # Keep last 7 days find $BACKUP_DIR -name "*.sql.gz" -mtime +7 -delete ``` ### Redis Configuration **Use Cases:** | Use Case | Database | TTL | Description | |----------|----------|-----|-------------| | Session Cache | 0 | 24h | User sessions | | API Rate Limiting | 0 | 1h | Request counters | | Celery Broker | 1 | - | Task queue | | Celery Results | 2 | 24h | Task results | | Feature Flags | 3 | 5m | Feature gate cache | **Configuration (`redis.conf`):** ```ini maxmemory 100mb maxmemory-policy allkeys-lru appendonly yes appendfsync everysec ``` ### Nginx Tuning ```nginx # /etc/nginx/nginx.conf worker_processes auto; worker_rlimit_nofile 65535; events { worker_connections 4096; use epoll; multi_accept on; } http { # Buffers client_body_buffer_size 10K; client_header_buffer_size 1k; client_max_body_size 50M; large_client_header_buffers 2 1k; # Timeouts client_body_timeout 12; client_header_timeout 12; keepalive_timeout 15; send_timeout 10; # Gzip gzip on; gzip_vary on; gzip_proxied any; gzip_comp_level 6; gzip_types text/plain text/css text/xml application/json application/javascript; } ``` --- ## Troubleshooting Guide ### Quick Diagnostics ```bash # Check all services systemctl status orion orion-celery postgresql redis nginx # Check ports ss -tlnp | grep -E '(8000|5432|6379|80|443)' # Check disk space df -h # Check memory free -h # Check CPU/processes htop ``` ### Database Issues ```bash # Connect to database sudo -u postgres psql orion_db # Check active connections SELECT count(*) FROM pg_stat_activity; # Find slow queries SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; # Kill stuck query SELECT pg_terminate_backend(pid); # Check table sizes SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC; # Analyze query performance EXPLAIN ANALYZE SELECT ...; ``` ### Redis Issues ```bash # Check connectivity redis-cli ping # Monitor real-time commands redis-cli monitor # Check memory usage redis-cli info memory # List all keys (careful in production!) redis-cli --scan # Check queue lengths redis-cli llen celery # Flush specific database redis-cli -n 1 flushdb # Flush Celery broker ``` ### Celery Issues ```bash # Check worker status celery -A app.celery inspect active celery -A app.celery inspect reserved celery -A app.celery inspect stats # Purge all pending tasks celery -A app.celery purge # List registered tasks celery -A app.celery inspect registered ``` ### Application Issues ```bash # Check API health curl -s http://localhost:8000/health | jq # View recent logs journalctl -u orion --since "10 minutes ago" # Check for Python errors journalctl -u orion | grep -i error | tail -20 # Test database connection python -c "from app.core.database import engine; print(engine.connect())" ``` ### Common Problems & Solutions | Problem | Diagnosis | Solution | |---------|-----------|----------| | 502 Bad Gateway | `systemctl status orion` | Restart app: `systemctl restart orion` | | Database connection refused | `pg_isready` | Start PostgreSQL: `systemctl start postgresql` | | High memory usage | `free -h`, `ps aux --sort=-%mem` | Restart app, check for memory leaks | | Slow queries | PostgreSQL slow query log | Add indexes, optimize queries | | Celery tasks stuck | `celery inspect active` | Restart workers, check Redis | | Disk full | `df -h` | Clean logs, backups, temp files | --- ## Decision Matrix ### When to Use Each Option | Scenario | Recommended | Reason | |----------|-------------|--------| | Solo developer, MVP | Managed (Railway) | Focus on product | | Small team, budget conscious | Traditional VPS | Full control, low cost | | Need direct DB access for debugging | Traditional VPS | Direct psql access | | Familiar with Docker, want consistency | Docker Compose | Reproducible environments | | High availability required | Docker + Orchestration | Easy scaling | | Enterprise, compliance requirements | Kubernetes | Full orchestration | ### Cost Comparison (Monthly) | Setup | Low Traffic | Medium | High | |-------|-------------|--------|------| | Managed (Railway + Neon) | $10 | $50 | $200+ | | VPS (Hetzner/DigitalOcean) | $5 | $20 | $80 | | Docker on VPS | $5 | $20 | $80 | | AWS/GCP Full Stack | $50 | $200 | $1000+ | --- ## Migration Path ### Phase 1: Development ✅ COMPLETE - ✅ PostgreSQL 15 (Docker) - ✅ FastAPI + Uvicorn - ✅ Local file storage ### Phase 2: Production MVP ✅ COMPLETE - ✅ PostgreSQL (managed or VPS) - ✅ FastAPI + Uvicorn (systemd or Docker) - ✅ Redis 7 (cache + task broker) - ✅ Celery 5.3 (background jobs) - ✅ Celery Beat (scheduled tasks) - ✅ Flower (task monitoring) - ✅ Cloudflare R2 (cloud file storage) - ✅ Sentry (error tracking) - ✅ CloudFlare CDN (caching + DDoS protection) ### Phase 3: Scale (1,000+ Users) - ⏳ Load balancer (Nginx/HAProxy/ALB) - ⏳ Horizontal app scaling (2-4 Uvicorn instances) - ⏳ PostgreSQL read replica - ⏳ Dedicated Celery workers per queue ### Phase 4: Enterprise (5,000+ Users) - ⏳ Redis Sentinel/cluster - ⏳ Database connection pooling (PgBouncer) - ⏳ Full monitoring stack (Prometheus/Grafana) - ⏳ Log aggregation (Loki/ELK) ### Phase 5: High Availability (10,000+ Users) - ⏳ Multi-region deployment - ⏳ Database failover (streaming replication) - ⏳ Container orchestration (Kubernetes) - ⏳ Global CDN with edge caching --- ## Enterprise Upgrade Checklist When you're ready to scale beyond 1,000 concurrent users: ### Infrastructure - [ ] **Load Balancer** - Add Nginx/HAProxy in front of API servers - Enables horizontal scaling - Health checks and automatic failover - SSL termination at edge - [ ] **Multiple API Servers** - Run 2-4 Uvicorn instances - Scale horizontally instead of vertically - Blue-green deployments possible - [ ] **Database Read Replica** - Add PostgreSQL replica - Offload read queries from primary - Backup without impacting production - [ ] **Connection Pooling** - Add PgBouncer - Reduce database connection overhead - Handle connection spikes ### Monitoring & Observability - [ ] **Prometheus + Grafana** - Metrics dashboards - Request latency, error rates, saturation - Database connection pool metrics - Celery queue lengths - [ ] **Log Aggregation** - Loki or ELK stack - Centralized logs from all services - Search and alerting - [ ] **Alerting** - PagerDuty/OpsGenie integration - On-call rotation - Escalation policies ### Security - [ ] **WAF Rules** - CloudFlare or AWS WAF - SQL injection protection - Rate limiting at edge - Bot protection - [ ] **Secrets Management** - HashiCorp Vault - Rotate credentials automatically - Audit access to secrets --- ## Next Steps **You're production-ready now!** Optional improvements: 1. **Enable Sentry** - Add `SENTRY_DSN` for error tracking (free tier) 2. **Enable R2** - Set `STORAGE_BACKEND=r2` for cloud storage (~$5/mo) 3. **Enable CloudFlare** - Proxy domain for CDN + DDoS protection (free tier) 4. **Add load balancer** - When you need horizontal scaling See also: - [Production Deployment Guide](production.md) - [CloudFlare Setup Guide](cloudflare.md) - [Docker Deployment](docker.md) - [Environment Configuration](environment.md) - [Background Tasks Architecture](../architecture/background-tasks.md)