Files
orion/app/modules/loyalty/docs/monitoring.md
Samir Boulahtit 4a60d75a13
Some checks failed
CI / ruff (push) Successful in 12s
CI / docs (push) Has been cancelled
CI / deploy (push) Has been cancelled
CI / validate (push) Has been cancelled
CI / dependency-scanning (push) Has been cancelled
CI / pytest (push) Has been cancelled
docs(loyalty): Phase 8 — runbooks, monitoring, OpenAPI tags, plan update
Final phase of the production launch plan:

- Runbook: wallet certificate management (Google + Apple rotation,
  expiry monitoring, rollback procedure)
- Runbook: point expiration task (manual execution, partial failure,
  per-merchant re-run, point restore via admin API)
- Runbook: wallet sync task (failed_card_ids interpretation, manual
  re-sync, retry behavior table)
- Monitoring: alert definitions (P0/P1/P2), key metrics, log events,
  dashboard suggestions
- OpenAPI: added tags=["Loyalty - Store"] and tags=["Loyalty - Admin"]
  to route groups for /docs discoverability
- Production launch plan: all phases 0-8 marked DONE

Coverage note: loyalty services at 70-85%, tasks at 16-29%.
Target 80% enforcement deferred — current 342 tests provide good
functional coverage. Task-level coverage requires Celery mocking
infrastructure (future sprint).

342 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 23:07:50 +02:00

2.9 KiB

Loyalty Module — Monitoring & Alerting

Alert Definitions

P0 — Page (immediate action required)

Alert Condition Action
Expiration task stale loyalty.expire_points last success > 26 hours ago Check Celery worker health, inspect task logs
Google Wallet service down Wallet sync failure rate > 50% for 2 consecutive runs Check service account credentials, Google API status

P1 — Warn (investigate within business hours)

Alert Condition Action
Wallet sync failures failed_card_ids count > 5% of total cards synced Check runbook-wallet-sync.md, inspect failed card IDs
Email notification failures loyalty_* template send failure rate > 1% in 24h Check SMTP config, EmailLog for errors
Rate limit spikes 429 responses > 100/min per store Investigate if legitimate traffic or abuse

P2 — Info (review in next sprint)

Alert Condition Action
High churn At-risk cards > 20% of active cards Review re-engagement strategy (future marketing module)
Low enrollment < 5 new cards in 7 days (per merchant with active program) Check enrollment page accessibility, QR code placement

Key Metrics to Track

Operational

  • Celery task success/failure counts for loyalty.expire_points and loyalty.sync_wallet_passes
  • EmailLog status distribution for loyalty_* template codes (sent/failed/bounced)
  • Rate limiter 429 response count per store per hour

Business

  • Daily new enrollments (total + per merchant)
  • Points issued vs redeemed ratio (health indicator: should be > 0.3 redemption rate)
  • Stamp completion rate (% of cards reaching stamps_target)
  • Cohort retention at month 3 (target: > 40%)

Observability Integration

The loyalty module logs to the standard Python logger (app.modules.loyalty.*). Key log events:

Logger Level Event
card_service INFO Enrollment, deactivation, GDPR anonymization
stamp_service INFO Stamp add/redeem/void with card and store context
points_service INFO Points earn/redeem/void/adjust
notification_service INFO Email queued (template_code + recipient)
point_expiration INFO Chunk processed (cards + points count)
wallet_sync WARNING Per-card sync failure with retry count
wallet_sync ERROR Card sync exhausted all retries

Dashboard Suggestions

If using Grafana or similar:

  1. Enrollment funnel: Page views → Form starts → Submissions → Success (track drop-off)
  2. Transaction volume: Stamps + Points per hour, grouped by store
  3. Wallet adoption: % of cards with Google/Apple Wallet passes
  4. Email delivery: Sent → Delivered → Opened → Clicked per template
  5. Task health: Celery task execution time + success rate over 24h