remediation_manager module
Auto-remediation manager for critical incidents.
Persists incidents to data/incidents_log.json (append-only JSON lines)
Triggers remediation actions based on incident type
Adds Grafana annotations for visibility (best-effort)
Implements simple 15-minute recurrence detection and adaptive threshold bump hook
Environment variables (optional): - GRAFANA_URL, GRAFANA_API_TOKEN for annotations
Notes: - File I/O is constrained under data/ per workspace safety rules - All operations are best-effort; never raise from public APIs
- remediation_manager.handle_critical_incident(name, metric, value, threshold, details=None)[מקור]
Main entrypoint: log incident, attempt remediation, annotate Grafana, and return incident_id.
name examples: ”High Error Rate“, ”High Latency“, ”DB Connection Errors“ metric examples: ”error_rate_percent“, ”latency_seconds“, ”db_connection_errors“