alert_manager module

Adaptive alert manager for Smart Observability v4.

  • Maintains rolling 3h window of request samples (status, latency)

  • Recomputes adaptive thresholds every 5 minutes based on mean + 3*sigma

  • Emits critical alerts when current 5m stats breach adaptive thresholds

  • Sends critical alerts to Telegram and Grafana annotations and logs dispatches

Environment variables used (optional; fail-open if missing):

  • ALERT_TELEGRAM_BOT_TOKEN, ALERT_TELEGRAM_CHAT_ID

  • GRAFANA_URL (e.g. https://grafana.example.com)

  • GRAFANA_API_TOKEN (Bearer token)

  • ALERT_COOLDOWN_SEC (override global cooldown between alerts)

  • ALERT_THRESHOLD_SCALE / ALERT_ERROR_THRESHOLD_SCALE / ALERT_LATENCY_THRESHOLD_SCALE (multiply adaptive thresholds to widen the ”normal“ band)

  • ALERT_MIN_SAMPLE_COUNT / ALERT_ERROR_MIN_SAMPLE_COUNT / ALERT_LATENCY_MIN_SAMPLE_COUNT (require a minimum number of samples in the evaluation window before alerting)

alert_manager.emit_event(event, severity='info', **fields)[מקור]
פרמטרים:
alert_manager.reset_state_for_tests()[מקור]

Clear in-memory buffers (for unit tests).

Not intended for production use.

Return type:

None

alert_manager.note_request(status_code, duration_seconds, ts=None, *, context=None, source=None)[מקור]

Record a single request sample into the 3h rolling window.

ts: optional epoch seconds (for tests). Defaults to current time.

Return type:

None

פרמטרים:
alert_manager.get_current_error_rate_percent(window_sec=300, *, source=None)[מקור]

Return error rate percent for the last window (default 5 minutes).

Return type:

float

פרמטרים:
  • window_sec (int)

  • source (str | None)

alert_manager.get_current_avg_latency_seconds(window_sec=300, *, source=None)[מקור]

Return average latency in seconds for the last window (default 5 minutes).

Return type:

float

פרמטרים:
  • window_sec (int)

  • source (str | None)

alert_manager.get_request_stats_between(start_dt, end_dt, *, source=None, percentiles=(50, 95, 99))[מקור]

Best-effort request stats for an arbitrary time window מתוך ה-buffer בזיכרון.

מחזיר: - total (int as float) - errors (int as float) - p50/p95/p99 (seconds) אם יש דגימות

הערה: היסטוריה זמינה רק עד ~3 שעות (WINDOW_SEC).

Return type:

Dict[str, float]

פרמטרים:
alert_manager.bump_threshold(kind, factor=1.2)[מקור]

Multiply the current adaptive threshold by a factor and refresh gauges (best-effort).

Return type:

None

פרמטרים:
alert_manager.get_thresholds_snapshot()[מקור]

Return the latest thresholds for external consumers/tests.

Return type:

Dict[str, Dict[str, float]]

alert_manager.check_and_emit_alerts(now_ts=None)[מקור]

Evaluate current stats vs adaptive thresholds and emit critical alerts when breached.

Cooldowns ensure we do not spam more than once per 5 minutes per alert type.

Return type:

None

פרמטרים:

now_ts (float | None)

alert_manager.forward_critical_alert(name, summary, **details)[מקור]

Public API: forward a critical alert to sinks and log dispatches.

Return type:

None

פרמטרים:
alert_manager.get_dispatch_log(limit=50)[מקור]
Return type:

List[Dict[str, Any]]

פרמטרים:

limit (int)