monitoring.metrics_storage module

Lightweight DB-backed metrics storage with safe, best-effort batching.

Design goals: - Fail-open: never break app flow if DB unavailable or pymongo missing - No import-time DB connections; initialize lazily on first flush - Use environment variables only (avoid importing config to prevent cycles) - Memory-safety in misconfiguration: drop/cap buffer when storage unavailable

Environment variables: - METRICS_DB_ENABLED: ”true/1/yes“ to enable DB writes (default: false) - MONGODB_URL: Mongo connection string (required when enabled) - DATABASE_NAME: Database name (default: code_keeper_bot) - METRICS_COLLECTION: Collection name (default: service_metrics) - METRICS_BATCH_SIZE: Batch size threshold (default: 50) - METRICS_FLUSH_INTERVAL_SEC: Time-based flush threshold (default: 5 seconds) - METRICS_MAX_BUFFER: Max queued items in memory (default: 5000) - METRICS_ROLLUP_SECONDS: Rollup bucket size in seconds for DB writes (default: 60)

monitoring.metrics_storage.emit_event(event, severity='info', **fields)[מקור]

Return type:

None

פרמטרים:

event (str)
severity (str)
fields (Any)

monitoring.metrics_storage.flush(force=False)[מקור]

Return type:: None
פרמטרים:: force (bool)

monitoring.metrics_storage.enqueue_request_metric(status_code, duration_seconds, *, request_id=None, extra=None)[מקור]

Queue a single request metric for best-effort DB persistence.

This write path is opt-in via METRICS_DB_ENABLED=true.

In production we do not write ”one document per request“. Instead, we roll up requests into per-bucket documents to keep MongoDB load low.

Return type:

None

פרמטרים:

status_code (int)
duration_seconds (float)
request_id (str | None)
extra (Dict[str, Any] | None)

monitoring.metrics_storage.aggregate_request_timeseries(*, start_dt, end_dt, granularity_seconds)[מקור]

Aggregate request metrics into fixed time buckets.

Return type:

List[Dict[str, Any]]

פרמטרים:

start_dt (datetime | None)
end_dt (datetime | None)
granularity_seconds (int)

monitoring.metrics_storage.aggregate_top_endpoints(*, start_dt, end_dt, limit=5)[מקור]

Return the slowest HTTP endpoints within the given time window.

Return type:

List[Dict[str, Any]]

פרמטרים:

start_dt (datetime | None)
end_dt (datetime | None)
limit (int)

monitoring.metrics_storage.average_request_duration(*, start_dt, end_dt)[מקור]

Return the average request duration for a given window.

Return type:

Optional[float]

פרמטרים:

start_dt (datetime | None)
end_dt (datetime | None)

monitoring.metrics_storage.aggregate_error_ratio(*, start_dt, end_dt)[מקור]

Return total/error counts for the window.

Return type:

Dict[str, int]

פרמטרים:

start_dt (datetime | None)
end_dt (datetime | None)

monitoring.metrics_storage.find_by_request_id(request_id, *, limit=20)[מקור]

Find metrics records by request_id.

Used by triage service to provide fallback when Sentry is unavailable. Returns empty list on any failure (fail-open).

Return type:

List[Dict[str, Any]]

פרמטרים:

request_id (str)
limit (int)

monitoring.metrics_storage.aggregate_latency_percentiles(*, start_dt, end_dt, percentiles=(50, 95, 99), sample_limit=5000)[מקור]

Return latency percentiles (seconds) for the given window.

Best-effort: - Try Mongo $percentile aggregation when available. - Otherwise, sample up to sample_limit records and compute percentiles in Python.

Return type:

Dict[str, float]

פרמטרים:

start_dt (datetime | None)
end_dt (datetime | None)
percentiles (Tuple[int, ...])
sample_limit (int)