alert_manager module

Adaptive alert manager for Smart Observability v4.

Maintains rolling 3h window of request samples (status, latency)
Recomputes adaptive thresholds every 5 minutes based on mean + 3*sigma
Emits critical alerts when current 5m stats breach adaptive thresholds
Sends critical alerts to Telegram and Grafana annotations and logs dispatches

Environment variables used (optional; fail-open if missing):

ALERT_TELEGRAM_BOT_TOKEN, ALERT_TELEGRAM_CHAT_ID
GRAFANA_URL (e.g. https://grafana.example.com)
GRAFANA_API_TOKEN (Bearer token)
ALERT_COOLDOWN_SEC (override global cooldown between alerts)
ALERT_THRESHOLD_SCALE / ALERT_ERROR_THRESHOLD_SCALE / ALERT_LATENCY_THRESHOLD_SCALE (multiply adaptive thresholds to widen the ”normal“ band)
ALERT_MIN_SAMPLE_COUNT / ALERT_ERROR_MIN_SAMPLE_COUNT / ALERT_LATENCY_MIN_SAMPLE_COUNT (require a minimum number of samples in the evaluation window before alerting)

alert_manager.emit_event(event, severity='info', **fields)[מקור]

פרמטרים:

event (str)
severity (str)

alert_manager.reset_state_for_tests()[מקור]

Clear in-memory buffers (for unit tests).

Not intended for production use.

Return type:: None

alert_manager.note_request(status_code, duration_seconds, ts=None, *, context=None, source=None)[מקור]

Record a single request sample into the 3h rolling window.

ts: optional epoch seconds (for tests). Defaults to current time.

Return type:

None

פרמטרים:

status_code (int)
duration_seconds (float)
ts (float | None)
context (Dict[str, Any] | None)
source (str | None)

alert_manager.get_current_error_rate_percent(window_sec=300, *, source=None)[מקור]

Return error rate percent for the last window (default 5 minutes).

Return type:

float

פרמטרים:

window_sec (int)
source (str | None)

alert_manager.get_current_avg_latency_seconds(window_sec=300, *, source=None)[מקור]

Return average latency in seconds for the last window (default 5 minutes).

Return type:

float

פרמטרים:

window_sec (int)
source (str | None)

alert_manager.get_request_stats_between(start_dt, end_dt, *, source=None, percentiles=(50, 95, 99))[מקור]

Best-effort request stats for an arbitrary time window מתוך ה-buffer בזיכרון.

מחזיר: - total (int as float) - errors (int as float) - p50/p95/p99 (seconds) אם יש דגימות

הערה: היסטוריה זמינה רק עד ~3 שעות (WINDOW_SEC).

Return type:

Dict[str, float]

פרמטרים:

start_dt (datetime)
end_dt (datetime)
source (str | None)
percentiles (Tuple[int, ...])

alert_manager.bump_threshold(kind, factor=1.2)[מקור]

Multiply the current adaptive threshold by a factor and refresh gauges (best-effort).

Return type:

None

פרמטרים:

kind (str)
factor (float)

alert_manager.get_thresholds_snapshot()[מקור]

Return the latest thresholds for external consumers/tests.

Return type:: Dict[str, Dict[str, float]]

alert_manager.check_and_emit_alerts(now_ts=None)[מקור]

Evaluate current stats vs adaptive thresholds and emit critical alerts when breached.

Cooldowns ensure we do not spam more than once per 5 minutes per alert type.

Return type:: None
פרמטרים:: now_ts (float | None)

alert_manager.forward_critical_alert(name, summary, **details)[מקור]

Public API: forward a critical alert to sinks and log dispatches.

Return type:

None

פרמטרים:

name (str)
summary (str)
details (Any)

alert_manager.get_dispatch_log(limit=50)[מקור]

Return type:: List[Dict[str, Any]]
פרמטרים:: limit (int)