monitoring.metrics_storage module
Lightweight DB-backed metrics storage with safe, best-effort batching.
Design goals: - Fail-open: never break app flow if DB unavailable or pymongo missing - No import-time DB connections; initialize lazily on first flush - Use environment variables only (avoid importing config to prevent cycles) - Memory-safety in misconfiguration: drop/cap buffer when storage unavailable
Environment variables: - METRICS_DB_ENABLED: ”true/1/yes“ to enable DB writes (default: false) - MONGODB_URL: Mongo connection string (required when enabled) - DATABASE_NAME: Database name (default: code_keeper_bot) - METRICS_COLLECTION: Collection name (default: service_metrics) - METRICS_BATCH_SIZE: Batch size threshold (default: 50) - METRICS_FLUSH_INTERVAL_SEC: Time-based flush threshold (default: 5 seconds) - METRICS_MAX_BUFFER: Max queued items in memory (default: 5000) - METRICS_ROLLUP_SECONDS: Rollup bucket size in seconds for DB writes (default: 60)
- monitoring.metrics_storage.enqueue_request_metric(status_code, duration_seconds, *, request_id=None, extra=None)[מקור]
Queue a single request metric for best-effort DB persistence.
This write path is opt-in via
METRICS_DB_ENABLED=true.In production we do not write ”one document per request“. Instead, we roll up requests into per-bucket documents to keep MongoDB load low.
- monitoring.metrics_storage.aggregate_request_timeseries(*, start_dt, end_dt, granularity_seconds)[מקור]
Aggregate request metrics into fixed time buckets.
- monitoring.metrics_storage.aggregate_top_endpoints(*, start_dt, end_dt, limit=5)[מקור]
Return the slowest HTTP endpoints within the given time window.
- monitoring.metrics_storage.average_request_duration(*, start_dt, end_dt)[מקור]
Return the average request duration for a given window.
- monitoring.metrics_storage.aggregate_error_ratio(*, start_dt, end_dt)[מקור]
Return total/error counts for the window.
- monitoring.metrics_storage.find_by_request_id(request_id, *, limit=20)[מקור]
Find metrics records by request_id.
Used by triage service to provide fallback when Sentry is unavailable. Returns empty list on any failure (fail-open).
- monitoring.metrics_storage.aggregate_latency_percentiles(*, start_dt, end_dt, percentiles=(50, 95, 99), sample_limit=5000)[מקור]
Return latency percentiles (seconds) for the given window.
Best-effort: - Try Mongo $percentile aggregation when available. - Otherwise, sample up to sample_limit records and compute percentiles in Python.