alert_manager module
Adaptive alert manager for Smart Observability v4.
Maintains rolling 3h window of request samples (status, latency)
Recomputes adaptive thresholds every 5 minutes based on mean + 3*sigma
Emits critical alerts when current 5m stats breach adaptive thresholds
Sends critical alerts to Telegram and Grafana annotations and logs dispatches
Environment variables used (optional; fail-open if missing):
ALERT_TELEGRAM_BOT_TOKEN, ALERT_TELEGRAM_CHAT_ID
GRAFANA_URL (e.g. https://grafana.example.com)
GRAFANA_API_TOKEN (Bearer token)
ALERT_COOLDOWN_SEC (override global cooldown between alerts)
ALERT_THRESHOLD_SCALE / ALERT_ERROR_THRESHOLD_SCALE / ALERT_LATENCY_THRESHOLD_SCALE (multiply adaptive thresholds to widen the ”normal“ band)
ALERT_MIN_SAMPLE_COUNT / ALERT_ERROR_MIN_SAMPLE_COUNT / ALERT_LATENCY_MIN_SAMPLE_COUNT (require a minimum number of samples in the evaluation window before alerting)
- alert_manager.reset_state_for_tests()[מקור]
Clear in-memory buffers (for unit tests).
Not intended for production use.
- Return type:
- alert_manager.note_request(status_code, duration_seconds, ts=None, *, context=None, source=None)[מקור]
Record a single request sample into the 3h rolling window.
ts: optional epoch seconds (for tests). Defaults to current time.
- alert_manager.get_current_error_rate_percent(window_sec=300, *, source=None)[מקור]
Return error rate percent for the last window (default 5 minutes).
- alert_manager.get_current_avg_latency_seconds(window_sec=300, *, source=None)[מקור]
Return average latency in seconds for the last window (default 5 minutes).
- alert_manager.get_request_stats_between(start_dt, end_dt, *, source=None, percentiles=(50, 95, 99))[מקור]
Best-effort request stats for an arbitrary time window מתוך ה-buffer בזיכרון.
מחזיר: - total (int as float) - errors (int as float) - p50/p95/p99 (seconds) אם יש דגימות
הערה: היסטוריה זמינה רק עד ~3 שעות (WINDOW_SEC).
- alert_manager.bump_threshold(kind, factor=1.2)[מקור]
Multiply the current adaptive threshold by a factor and refresh gauges (best-effort).
- alert_manager.get_thresholds_snapshot()[מקור]
Return the latest thresholds for external consumers/tests.
- alert_manager.check_and_emit_alerts(now_ts=None)[מקור]
Evaluate current stats vs adaptive thresholds and emit critical alerts when breached.
Cooldowns ensure we do not spam more than once per 5 minutes per alert type.