# ๐Ÿง  Quick Fix ื—ื›ื (Queue Delay + ืขื•ืžืก/DB) โ€“ ื”ื ื—ื™ื•ืช ืœืžืคืชื—ื™ื ื•ืœืกื•ื›ื ื™ AI ื”ืžื˜ืจื” ืฉืœ **Quick Fix** ื”ื™ื ืœืชืช ื”ืžืœืฆื” ืงืฆืจื”, ื‘ื˜ื•ื—ื” ื•ืฉื™ืžื•ืฉื™ืช ืขืœ โ€œืžื” ืœืขืฉื•ืช ืขื›ืฉื™ื•โ€, ืœืคื™ ืื•ืชื•ืช ืฉืื ื—ื ื• ื›ื‘ืจ ืžื•ื“ื“ื™ื. ื”ื“ื’ืฉ ืคื” ื”ื•ื ืขืœ ื”ื‘ื—ื ื” ื‘ื™ืŸ: - **Queueing (ืชืฉืชื™ืชื™):** ื”ื‘ืงืฉื” *ืžืžืชื™ื ื” ืœืคื ื™ ืฉื”ืืคืœื™ืงืฆื™ื” ื”ืชื—ื™ืœื” ืœื˜ืคืœ ื‘ื”* (`queue_delay`). - **Processing ืื™ื˜ื™ (ืืคืœื™ืงื˜ื™ื‘ื™):** ื”ืืคืœื™ืงืฆื™ื” ื›ื‘ืจ ืžื˜ืคืœืช, ืื‘ืœ ืœื•ืงื— ืœื” ื–ืžืŸ (`duration_ms`). ## ืื™ืคื” ื–ื” ืžื•ืคื™ืข - **WebApp**: ืขืžื•ื“ืช **Quick Fix** ื‘ื”ื™ืกื˜ื•ืจื™ื™ืช ื”ืชืจืื•ืช ื‘ืœื•ื— Observability (`/admin/observability`). - **Incident Replay**: ื‘ืคืื ืœ Runbook (ืจืง ื›ืคืชื•ืจื™ ืคืขื•ืœื” ืฉืžื•ื’ื“ืจื™ื ื‘ึพRunbook). - **Telegram (ื”ืชืจืื•ืช ืงืจื™ื˜ื™ื•ืช)**: ืฉื•ืจืช โ€œQuick Fixโ€ ื‘ืชื•ืš ื”ื•ื“ืขืช ื”ื”ืชืจืื” (bestโ€‘effort). ## ืžืงื•ืจ ืืžืช: ืื™ืคื” ื”ื—ื•ืงื™ื ื—ื™ื™ื ื”ื—ื•ืงื™ื ื”ื“ื™ื ืžื™ื™ื ืžื•ื’ื“ืจื™ื ื‘ึพ`config/observability_runbooks.yml` ืชื—ืช: - `quick_fix_rules.latency_v1` ื”ืคื•ืจืžื˜ ืžืืคืฉืจ: - **ืกืคื™ื** (thresholds) - **ืžื™ืคื•ื™ ืคืขื•ืœื•ืช** (actions) ืœื›ืœ ืชืจื—ื™ืฉ ื‘ื ื•ืกืฃ, ื™ืฉ ืฉื ื™ ืžืงื•ืจื•ืช ื ื•ืกืคื™ื ืœืคืขื•ืœื•ืช: - **Runbooks** ื‘ืื•ืชื• YAML (ืคืขื•ืœื•ืช ืฉืžื•ื’ื“ืจื•ืช ื‘ืชื•ืš ืฆืขื“ื™ื) - **ืชืื™ืžื•ืช ืœืื—ื•ืจ**: `config/alert_quick_fixes.json` ื”ืืœื’ื•ืจื™ืชื ื‘ืคื•ืขืœ: 1. ืื ื™ืฉ ืžืกืคื™ืง ืื•ืชื•ืช โ†’ ืžืคื™ืง Quick Fix ื“ื™ื ืžื™. 2. ืžื•ืกื™ืฃ ืคืขื•ืœื•ืช Runbook (ืื ืงื™ื™ืžื•ืช). 3. ืื ืื™ืŸ Runbook / ืื™ืŸ ืคืขื•ืœื•ืช โ†’ ื ื•ืคืœ ืœึพ`alert_quick_fixes.json`. ## ื˜ื‘ืœืช ื”ื—ืœื˜ื” (Latency / Slow Response) ืกืคื™ ื‘ืจื™ืจืช ืžื—ื“ืœ (ื ื™ืชื ื™ื ืœืฉื™ื ื•ื™ ื‘ึพYAML): - `queue_delay_ms`: **500ms** - `duration_ms`: **3000ms** ื›ืœืœึพืขืœ: ืงื•ื“ื ืžื–ื”ื™ื **Queueing** ื•ืจืง ืื ื”ื•ื ืชืงื™ืŸ ืขื•ื‘ืจื™ื ืœึพ**Processing**. | Detection | Quick Fix | ืœืžื” | |---|---|---| | Queueing ื’ื‘ื•ื” + DB utilization ื’ื‘ื•ื” | ๐Ÿ”Œ ื”ื’ื“ืœ Connection Pool / Kill Slow Queries | ืื™ืŸ ืžืกืคื™ืง ื—ื™ื‘ื•ืจื™ื ืคื ื•ื™ื™ื / ืฉืื™ืœืชื•ืช ื—ื•ืกืžื•ืช | | Queueing ื’ื‘ื•ื” + CPU/RAM ื’ื‘ื•ื” | ๐Ÿ“ˆ Scale Up / Add Workers | ื”ืฉืจืช ื‘ืชืงืจืช ืžืฉืื‘ื™ื | | Queueing ื’ื‘ื•ื” + ืฉื™ืžื•ืฉ ื ืžื•ืš + DB utilization ื ืžื•ืš | ๐Ÿ”„ Restart Service (Stuck Workers) | ื—ืฉื“ ืœึพdeadlock/worker ืชืงื•ืข | | Queueing ื ืžื•ืš + Duration ื’ื‘ื•ื” | ๐Ÿ” Optimize Query / Cache | ื‘ืขื™ื™ืช ืงื•ื“/ืฉืื™ืœืชื”, ืœื ืชืฉืชื™ืช | ## ืื™ืœื• ืฉื“ื•ืช ืื ื—ื ื• ืžืฉืชืžืฉื™ื ื‘ื”ื (ืกื›ืžืช ื ืชื•ื ื™ื) ### ืื•ืชื•ืช ืขื™ืงืจื™ื™ื - **Queue Delay** - `metadata.queue_delay` (ms) โ€“ ืžื’ื™ืข ืžืœื•ื’ื™ `access_logs` (Headers `X-Queue-Start`/`X-Request-Start`). - ื‘ืžืงืจื™ื ืžืกื•ื™ืžื™ื ืงื™ื™ืžื™ื ื’ื ืขืจื›ื™ื ื ื’ื–ืจื™ื: `queue_delay_ms_p95` / `queue_delay_ms_avg` (bestโ€‘effort). - **Duration** - `metadata.duration_ms` (ms) โ€“ ื–ืžืŸ ื˜ื™ืคื•ืœ/ื‘ืงืฉื”. ### ืขื•ืžืก ื•ืžืฉืื‘ื™ื - `metadata.cpu_percent` (ืื—ื•ื–ื™ื, bestโ€‘effort) - `metadata.memory_percent` (ืื—ื•ื–ื™ื, bestโ€‘effort) - `metadata.active_requests` (inโ€‘flight requests, ืื™ื ื“ื™ืงืฆื™ื” ืžืงื•ืžื™ืช ืœืชื”ืœื™ืš) > ื”ืขืจื”: ื—ืœืง ืžื”ืžื“ื“ื™ื ื”ื โ€œbestโ€‘effortโ€ ื•ื™ื›ื•ืœื™ื ืœื”ืฉืชื ื•ืช ื‘ื™ืŸ ืชื”ืœื™ื›ื™ื/ืคื•ื“ื™ื. ืœื›ืŸ ื”ื ืžืฉืžืฉื™ื ื‘ืขื™ืงืจ ื›ื“ื™ ืœื‘ื—ื•ืจ ื”ืžืœืฆื” ื›ืœืœื™ืช, ืœื ื›ื“ื™ ืœื”ืกื™ืง ืžืกืงื ื•ืช ื—ื“ื•ืช. ### โ€œDB Pool utilizationโ€ ื‘ืคื•ืขืœ ืื ื• ืžืฉืชืžืฉื™ื ื”ื™ื•ื ื‘ึพ`metadata.db_pool_utilization_pct` ืฉืžื—ื•ืฉื‘ ืžื”ึพMongo `serverStatus.connections` (current/available). - ื–ื” **ืœื ื‘ื“ื™ื•ืง** utilization ืฉืœ client pool ืฉืœ PyMongo. - ืื‘ืœ ื–ื” ืขื“ื™ื™ืŸ ืื™ื ื“ื™ืงืฆื™ื” ื˜ื•ื‘ื” ืœืœื—ืฅ ืขืœ DB ื›ืฉื™ืฉ Queueing. ## ืขืงืจื•ื ื•ืช Safety (ื—ืฉื•ื‘) - **ืœื ืœื”ืฆื™ืข ืคืขื•ืœื•ืช ืžืกื•ื›ื ื•ืช ื›ื‘ืจื™ืจืช ืžื—ื“ืœ** (ืžื—ื™ืงื•ืช/kill/rollback) ื‘ืœื™ โ€œcautionโ€ ื•ื‘ืœื™ ื ืชื™ื‘ ื‘ืจื•ืจ. - ื”ืขื“ืคื” ืœืคืขื•ืœื•ืช ื‘ื˜ื•ื—ื•ืช: - `copy` ืฉืœ ืคืงื•ื“ื•ืช ChatOps (`/triage โ€ฆ`, `/errors`, `/status_worker`) - ืงื™ืฉื•ืจื™ื ืœืžืกื›ื™ื ืคื ื™ืžื™ื™ื (Replay / Dashboard) - ืื ืื™ืŸ ืžืกืคื™ืง ื ืชื•ื ื™ื โ†’ **Fallback ื‘ืจื•ืจ** (ืœืžืฉืœ โ€œScale Upโ€ ืื• โ€œื”ืจืฅ /triage latencyโ€). ## ืื™ืš ืœื”ื•ืกื™ืฃ/ืœืฉื ื•ืช ื—ื•ืง (ืฆโ€™ืงืœื™ืกื˜ ืœืžืคืชื—ื™ื ื•ืœโ€‘AI) 1. **ืขื“ื›ื•ืŸ ืงื•ื ืคื™ื’** - ื”ื•ืกื™ืคื•/ืฉื ื• `quick_fix_rules.latency_v1.thresholds` ื•/ืื• `actions`. 2. **ื‘ื“ื™ืงืช ืฉื“ื•ืช ืงื™ื™ืžื™ื** - ื•ื“ืื• ืฉื”ืฉื“ื•ืช ืฉืืชื ืจื•ืฆื™ื ืงื™ื™ืžื™ื ื‘ึพalert `metadata`. - ืื ืœื: ื”ื•ืกื™ืคื• ืื•ืชื ื‘ืฉื›ื‘ืช ื”ืื™ื ืกื˜ืจื•ืžื ื˜ืฆื™ื” (ืœืจื•ื‘ `metrics.py`/`alert_manager.py`). 3. **ืžื™ืžื•ืฉ ืœื•ื’ื™ืงื” (ืื ื ื“ืจืฉ)** - ื”ืœื•ื’ื™ืงื” ื”ื“ื™ื ืžื™ืช ื ืžืฆืืช ื‘ึพ`services/observability_dashboard.py` (ืคื•ื ืงืฆื™ื” `_dynamic_quick_fix_actions`). - ืื ื”ื•ืกืคืชื ืื•ืช ื—ื“ืฉ โ€“ ื”ืจื—ื™ื‘ื• ืืช ื”ึพextract/normalize ืฉื ื‘ืฆื•ืจื” failโ€‘open. 4. **ื˜ืกื˜ื™ื** - ื›ืœืœ ืืฆื‘ืข: **ืฉื•ืจื” ื‘ื˜ื‘ืœืช ื”ื—ืœื˜ื” = ื˜ืกื˜**. - ืœื›ืกื•ืช: - queueing+pool ื’ื‘ื•ื” - queueing+cpu/ram ื’ื‘ื•ื” - queueing+usage ื ืžื•ืš - duration ื’ื‘ื•ื” (processing) - ื—ืกืจ ื ืชื•ื ื™ื (fallback) 5. **UI/Bot** - UI ืžืฆื™ื™ืจ ืืช `alert.quick_fixes` ื›ืžื• ืฉื”ื•ื. - ืœื‘ื•ื˜: ื”ื”ืžืœืฆื” ื ื‘ื ื™ืช bestโ€‘effort ืžืชื•ืš ืื•ืชื• ืžืงื•ืจ. ## ื”ืขืจื•ืช ืœืกื•ื›ื ื™ AI (ืื™ืš ืœืขื‘ื•ื“ ื ื›ื•ืŸ) - ื›ืฉืฆืจื™ืš ืžื™ื“ืข ื‘ื–ืžืŸ ืืžืช (ืœืžืฉืœ ื”ืื ื‘ืืžืช ื™ืฉ ืœื—ืฅ DB) **ืœื ืžื ื—ืฉื™ื**. - ืžื‘ืงืฉื™ื ืœื”ืจื™ืฅ ืคืงื•ื“ื•ืช ChatOps ืจืœื•ื•ื ื˜ื™ื•ืช: - `/triage latency` - `/triage db` - `/triage system` - `/errors` ื•ื”ืคืœื˜ ืฉืœ ื”ื‘ื•ื˜ ื”ื•ื ืžืงื•ืจ ื”ืืžืช.