๐ง Quick Fix ืืื (Queue Delay + ืขืืืก/DB) โ ืื ืืืืช ืืืคืชืืื ืืืกืืื ื AI๏
ืืืืจื ืฉื Quick Fix ืืื ืืชืช ืืืืฆื ืงืฆืจื, ืืืืื ืืฉืืืืฉืืช ืขื โืื ืืขืฉืืช ืขืืฉืืโ, ืืคื ืืืชืืช ืฉืื ืื ื ืืืจ ืืืืืื.
ืืืืฉ ืคื ืืื ืขื ืืืื ื ืืื:
Queueing (ืชืฉืชืืชื): ืืืงืฉื ืืืชืื ื ืืคื ื ืฉืืืคืืืงืฆืื ืืชืืืื ืืืคื ืื (
queue_delay).Processing ืืืื (ืืคืืืงืืืื): ืืืคืืืงืฆืื ืืืจ ืืืคืืช, ืืื ืืืงื ืื ืืื (
duration_ms).
ืืืคื ืื ืืืคืืข๏
WebApp: ืขืืืืช Quick Fix ืืืืกืืืจืืืช ืืชืจืืืช ืืืื Observability (
/admin/observability).Incident Replay: ืืคืื ื Runbook (ืจืง ืืคืชืืจื ืคืขืืื ืฉืืืืืจืื ืึพRunbook).
Telegram (ืืชืจืืืช ืงืจืืืืืช): ืฉืืจืช โQuick Fixโ ืืชืื ืืืืขืช ืืืชืจืื (bestโeffort).
ืืงืืจ ืืืช: ืืืคื ืืืืงืื ืืืื๏
ืืืืงืื ืืืื ืืืื ืืืืืจืื ืึพconfig/observability_runbooks.yml ืชืืช:
quick_fix_rules.latency_v1
ืืคืืจืื ืืืคืฉืจ:
ืกืคืื (thresholds)
ืืืคืื ืคืขืืืืช (actions) ืืื ืชืจืืืฉ
ืื ืืกืฃ, ืืฉ ืฉื ื ืืงืืจืืช ื ืืกืคืื ืืคืขืืืืช:
Runbooks ืืืืชื YAML (ืคืขืืืืช ืฉืืืืืจืืช ืืชืื ืฆืขืืื)
ืชืืืืืช ืืืืืจ:
config/alert_quick_fixes.json
ืืืืืืจืืชื ืืคืืขื:
ืื ืืฉ ืืกืคืืง ืืืชืืช โ ืืคืืง Quick Fix ืืื ืื.
ืืืกืืฃ ืคืขืืืืช Runbook (ืื ืงืืืืืช).
ืื ืืื Runbook / ืืื ืคืขืืืืช โ ื ืืคื ืึพ
alert_quick_fixes.json.
ืืืืช ืืืืื (Latency / Slow Response)๏
ืกืคื ืืจืืจืช ืืืื (ื ืืชื ืื ืืฉืื ืื ืึพYAML):
queue_delay_ms: 500msduration_ms: 3000ms
ืืืึพืขื: ืงืืื ืืืืื Queueing ืืจืง ืื ืืื ืชืงืื ืขืืืจืื ืึพProcessing.
Detection |
Quick Fix |
ืืื |
|---|---|---|
Queueing ืืืื + DB utilization ืืืื |
๐ ืืืื Connection Pool / Kill Slow Queries |
ืืื ืืกืคืืง ืืืืืจืื ืคื ืืืื / ืฉืืืืชืืช ืืืกืืืช |
Queueing ืืืื + CPU/RAM ืืืื |
๐ Scale Up / Add Workers |
ืืฉืจืช ืืชืงืจืช ืืฉืืืื |
Queueing ืืืื + ืฉืืืืฉ ื ืืื + DB utilization ื ืืื |
๐ Restart Service (Stuck Workers) |
ืืฉื ืึพdeadlock/worker ืชืงืืข |
Queueing ื ืืื + Duration ืืืื |
๐ Optimize Query / Cache |
ืืขืืืช ืงืื/ืฉืืืืชื, ืื ืชืฉืชืืช |
ืืืื ืฉืืืช ืื ืื ื ืืฉืชืืฉืื ืืื (ืกืืืช ื ืชืื ืื)๏
ืืืชืืช ืขืืงืจืืื๏
Queue Delay
metadata.queue_delay(ms) โ ืืืืข ืืืืืaccess_logs(HeadersX-Queue-Start/X-Request-Start).ืืืงืจืื ืืกืืืืื ืงืืืืื ืื ืขืจืืื ื ืืืจืื:
queue_delay_ms_p95/queue_delay_ms_avg(bestโeffort).
Duration
metadata.duration_ms(ms) โ ืืื ืืืคืื/ืืงืฉื.
ืขืืืก ืืืฉืืืื๏
metadata.cpu_percent(ืืืืืื, bestโeffort)metadata.memory_percent(ืืืืืื, bestโeffort)metadata.active_requests(inโflight requests, ืืื ืืืงืฆืื ืืงืืืืช ืืชืืืื)
ืืขืจื: ืืืง ืืืืืืื ืื โbestโeffortโ ืืืืืืื ืืืฉืชื ืืช ืืื ืชืืืืืื/ืคืืืื. ืืื ืื ืืฉืืฉืื ืืขืืงืจ ืืื ืืืืืจ ืืืืฆื ืืืืืช, ืื ืืื ืืืกืืง ืืกืงื ืืช ืืืืช.
โDB Pool utilizationโ๏
ืืคืืขื ืื ื ืืฉืชืืฉืื ืืืื ืึพmetadata.db_pool_utilization_pct ืฉืืืืฉื ืืึพMongo serverStatus.connections (current/available).
ืื ืื ืืืืืง utilization ืฉื client pool ืฉื PyMongo.
ืืื ืื ืขืืืื ืืื ืืืงืฆืื ืืืื ืืืืฅ ืขื DB ืืฉืืฉ Queueing.
ืขืงืจืื ืืช Safety (ืืฉืื)๏
ืื ืืืฆืืข ืคืขืืืืช ืืกืืื ืืช ืืืจืืจืช ืืืื (ืืืืงืืช/kill/rollback) ืืื โcautionโ ืืืื ื ืชืื ืืจืืจ.
ืืขืืคื ืืคืขืืืืช ืืืืืืช:
copyืฉื ืคืงืืืืช ChatOps (/triage โฆ,/errors,/status_worker)ืงืืฉืืจืื ืืืกืืื ืคื ืืืืื (Replay / Dashboard)
ืื ืืื ืืกืคืืง ื ืชืื ืื โ Fallback ืืจืืจ (ืืืฉื โScale Upโ ืื โืืจืฅ /triage latencyโ).
ืืื ืืืืกืืฃ/ืืฉื ืืช ืืืง (ืฆโืงืืืกื ืืืคืชืืื ืืโAI)๏
ืขืืืื ืงืื ืคืื
ืืืกืืคื/ืฉื ื
quick_fix_rules.latency_v1.thresholdsื/ืืactions.
ืืืืงืช ืฉืืืช ืงืืืืื
ืืืื ืฉืืฉืืืช ืฉืืชื ืจืืฆืื ืงืืืืื ืึพalert
metadata.ืื ืื: ืืืกืืคื ืืืชื ืืฉืืืช ืืืื ืกืืจืืื ืืฆืื (ืืจืื
metrics.py/alert_manager.py).
ืืืืืฉ ืืืืืงื (ืื ื ืืจืฉ)
ืืืืืืงื ืืืื ืืืช ื ืืฆืืช ืึพ
services/observability_dashboard.py(ืคืื ืงืฆืื_dynamic_quick_fix_actions).ืื ืืืกืคืชื ืืืช ืืืฉ โ ืืจืืืื ืืช ืึพextract/normalize ืฉื ืืฆืืจื failโopen.
ืืกืืื
ืืื ืืฆืืข: ืฉืืจื ืืืืืช ืืืืื = ืืกื.
ืืืกืืช:
queueing+pool ืืืื
queueing+cpu/ram ืืืื
queueing+usage ื ืืื
duration ืืืื (processing)
ืืกืจ ื ืชืื ืื (fallback)
UI/Bot
UI ืืฆืืืจ ืืช
alert.quick_fixesืืื ืฉืืื.ืืืื: ืืืืืฆื ื ืื ืืช bestโeffort ืืชืื ืืืชื ืืงืืจ.
ืืขืจืืช ืืกืืื ื AI (ืืื ืืขืืื ื ืืื)๏
ืืฉืฆืจืื ืืืืข ืืืื ืืืช (ืืืฉื ืืื ืืืืช ืืฉ ืืืฅ DB) ืื ืื ืืฉืื.
ืืืงืฉืื ืืืจืืฅ ืคืงืืืืช ChatOps ืจืืืื ืืืืช:
/triage latency/triage db/triage system/errors
ืืืคืื ืฉื ืืืื ืืื ืืงืืจ ืืืืช.