๐Ÿง  Quick Fix ื—ื›ื (Queue Delay + ืขื•ืžืก/DB) โ€“ ื”ื ื—ื™ื•ืช ืœืžืคืชื—ื™ื ื•ืœืกื•ื›ื ื™ AI๏ƒ

ื”ืžื˜ืจื” ืฉืœ Quick Fix ื”ื™ื ืœืชืช ื”ืžืœืฆื” ืงืฆืจื”, ื‘ื˜ื•ื—ื” ื•ืฉื™ืžื•ืฉื™ืช ืขืœ โ€œืžื” ืœืขืฉื•ืช ืขื›ืฉื™ื•โ€, ืœืคื™ ืื•ืชื•ืช ืฉืื ื—ื ื• ื›ื‘ืจ ืžื•ื“ื“ื™ื.

ื”ื“ื’ืฉ ืคื” ื”ื•ื ืขืœ ื”ื‘ื—ื ื” ื‘ื™ืŸ:

  • Queueing (ืชืฉืชื™ืชื™): ื”ื‘ืงืฉื” ืžืžืชื™ื ื” ืœืคื ื™ ืฉื”ืืคืœื™ืงืฆื™ื” ื”ืชื—ื™ืœื” ืœื˜ืคืœ ื‘ื” (queue_delay).

  • Processing ืื™ื˜ื™ (ืืคืœื™ืงื˜ื™ื‘ื™): ื”ืืคืœื™ืงืฆื™ื” ื›ื‘ืจ ืžื˜ืคืœืช, ืื‘ืœ ืœื•ืงื— ืœื” ื–ืžืŸ (duration_ms).

ืื™ืคื” ื–ื” ืžื•ืคื™ืข๏ƒ

  • WebApp: ืขืžื•ื“ืช Quick Fix ื‘ื”ื™ืกื˜ื•ืจื™ื™ืช ื”ืชืจืื•ืช ื‘ืœื•ื— Observability (/admin/observability).

  • Incident Replay: ื‘ืคืื ืœ Runbook (ืจืง ื›ืคืชื•ืจื™ ืคืขื•ืœื” ืฉืžื•ื’ื“ืจื™ื ื‘ึพRunbook).

  • Telegram (ื”ืชืจืื•ืช ืงืจื™ื˜ื™ื•ืช): ืฉื•ืจืช โ€œQuick Fixโ€ ื‘ืชื•ืš ื”ื•ื“ืขืช ื”ื”ืชืจืื” (bestโ€‘effort).

ืžืงื•ืจ ืืžืช: ืื™ืคื” ื”ื—ื•ืงื™ื ื—ื™ื™ื๏ƒ

ื”ื—ื•ืงื™ื ื”ื“ื™ื ืžื™ื™ื ืžื•ื’ื“ืจื™ื ื‘ึพconfig/observability_runbooks.yml ืชื—ืช:

  • quick_fix_rules.latency_v1

ื”ืคื•ืจืžื˜ ืžืืคืฉืจ:

  • ืกืคื™ื (thresholds)

  • ืžื™ืคื•ื™ ืคืขื•ืœื•ืช (actions) ืœื›ืœ ืชืจื—ื™ืฉ

ื‘ื ื•ืกืฃ, ื™ืฉ ืฉื ื™ ืžืงื•ืจื•ืช ื ื•ืกืคื™ื ืœืคืขื•ืœื•ืช:

  • Runbooks ื‘ืื•ืชื• YAML (ืคืขื•ืœื•ืช ืฉืžื•ื’ื“ืจื•ืช ื‘ืชื•ืš ืฆืขื“ื™ื)

  • ืชืื™ืžื•ืช ืœืื—ื•ืจ: config/alert_quick_fixes.json

ื”ืืœื’ื•ืจื™ืชื ื‘ืคื•ืขืœ:

  1. ืื ื™ืฉ ืžืกืคื™ืง ืื•ืชื•ืช โ†’ ืžืคื™ืง Quick Fix ื“ื™ื ืžื™.

  2. ืžื•ืกื™ืฃ ืคืขื•ืœื•ืช Runbook (ืื ืงื™ื™ืžื•ืช).

  3. ืื ืื™ืŸ Runbook / ืื™ืŸ ืคืขื•ืœื•ืช โ†’ ื ื•ืคืœ ืœึพalert_quick_fixes.json.

ื˜ื‘ืœืช ื”ื—ืœื˜ื” (Latency / Slow Response)๏ƒ

ืกืคื™ ื‘ืจื™ืจืช ืžื—ื“ืœ (ื ื™ืชื ื™ื ืœืฉื™ื ื•ื™ ื‘ึพYAML):

  • queue_delay_ms: 500ms

  • duration_ms: 3000ms

ื›ืœืœึพืขืœ: ืงื•ื“ื ืžื–ื”ื™ื Queueing ื•ืจืง ืื ื”ื•ื ืชืงื™ืŸ ืขื•ื‘ืจื™ื ืœึพProcessing.

Detection

Quick Fix

ืœืžื”

Queueing ื’ื‘ื•ื” + DB utilization ื’ื‘ื•ื”

๐Ÿ”Œ ื”ื’ื“ืœ Connection Pool / Kill Slow Queries

ืื™ืŸ ืžืกืคื™ืง ื—ื™ื‘ื•ืจื™ื ืคื ื•ื™ื™ื / ืฉืื™ืœืชื•ืช ื—ื•ืกืžื•ืช

Queueing ื’ื‘ื•ื” + CPU/RAM ื’ื‘ื•ื”

๐Ÿ“ˆ Scale Up / Add Workers

ื”ืฉืจืช ื‘ืชืงืจืช ืžืฉืื‘ื™ื

Queueing ื’ื‘ื•ื” + ืฉื™ืžื•ืฉ ื ืžื•ืš + DB utilization ื ืžื•ืš

๐Ÿ”„ Restart Service (Stuck Workers)

ื—ืฉื“ ืœึพdeadlock/worker ืชืงื•ืข

Queueing ื ืžื•ืš + Duration ื’ื‘ื•ื”

๐Ÿ” Optimize Query / Cache

ื‘ืขื™ื™ืช ืงื•ื“/ืฉืื™ืœืชื”, ืœื ืชืฉืชื™ืช

ืื™ืœื• ืฉื“ื•ืช ืื ื—ื ื• ืžืฉืชืžืฉื™ื ื‘ื”ื (ืกื›ืžืช ื ืชื•ื ื™ื)๏ƒ

ืื•ืชื•ืช ืขื™ืงืจื™ื™ื๏ƒ

  • Queue Delay

    • metadata.queue_delay (ms) โ€“ ืžื’ื™ืข ืžืœื•ื’ื™ access_logs (Headers X-Queue-Start/X-Request-Start).

    • ื‘ืžืงืจื™ื ืžืกื•ื™ืžื™ื ืงื™ื™ืžื™ื ื’ื ืขืจื›ื™ื ื ื’ื–ืจื™ื: queue_delay_ms_p95 / queue_delay_ms_avg (bestโ€‘effort).

  • Duration

    • metadata.duration_ms (ms) โ€“ ื–ืžืŸ ื˜ื™ืคื•ืœ/ื‘ืงืฉื”.

ืขื•ืžืก ื•ืžืฉืื‘ื™ื๏ƒ

  • metadata.cpu_percent (ืื—ื•ื–ื™ื, bestโ€‘effort)

  • metadata.memory_percent (ืื—ื•ื–ื™ื, bestโ€‘effort)

  • metadata.active_requests (inโ€‘flight requests, ืื™ื ื“ื™ืงืฆื™ื” ืžืงื•ืžื™ืช ืœืชื”ืœื™ืš)

ื”ืขืจื”: ื—ืœืง ืžื”ืžื“ื“ื™ื ื”ื โ€œbestโ€‘effortโ€ ื•ื™ื›ื•ืœื™ื ืœื”ืฉืชื ื•ืช ื‘ื™ืŸ ืชื”ืœื™ื›ื™ื/ืคื•ื“ื™ื. ืœื›ืŸ ื”ื ืžืฉืžืฉื™ื ื‘ืขื™ืงืจ ื›ื“ื™ ืœื‘ื—ื•ืจ ื”ืžืœืฆื” ื›ืœืœื™ืช, ืœื ื›ื“ื™ ืœื”ืกื™ืง ืžืกืงื ื•ืช ื—ื“ื•ืช.

โ€œDB Pool utilizationโ€๏ƒ

ื‘ืคื•ืขืœ ืื ื• ืžืฉืชืžืฉื™ื ื”ื™ื•ื ื‘ึพmetadata.db_pool_utilization_pct ืฉืžื—ื•ืฉื‘ ืžื”ึพMongo serverStatus.connections (current/available).

  • ื–ื” ืœื ื‘ื“ื™ื•ืง utilization ืฉืœ client pool ืฉืœ PyMongo.

  • ืื‘ืœ ื–ื” ืขื“ื™ื™ืŸ ืื™ื ื“ื™ืงืฆื™ื” ื˜ื•ื‘ื” ืœืœื—ืฅ ืขืœ DB ื›ืฉื™ืฉ Queueing.

ืขืงืจื•ื ื•ืช Safety (ื—ืฉื•ื‘)๏ƒ

  • ืœื ืœื”ืฆื™ืข ืคืขื•ืœื•ืช ืžืกื•ื›ื ื•ืช ื›ื‘ืจื™ืจืช ืžื—ื“ืœ (ืžื—ื™ืงื•ืช/kill/rollback) ื‘ืœื™ โ€œcautionโ€ ื•ื‘ืœื™ ื ืชื™ื‘ ื‘ืจื•ืจ.

  • ื”ืขื“ืคื” ืœืคืขื•ืœื•ืช ื‘ื˜ื•ื—ื•ืช:

    • copy ืฉืœ ืคืงื•ื“ื•ืช ChatOps (/triage โ€ฆ, /errors, /status_worker)

    • ืงื™ืฉื•ืจื™ื ืœืžืกื›ื™ื ืคื ื™ืžื™ื™ื (Replay / Dashboard)

  • ืื ืื™ืŸ ืžืกืคื™ืง ื ืชื•ื ื™ื โ†’ Fallback ื‘ืจื•ืจ (ืœืžืฉืœ โ€œScale Upโ€ ืื• โ€œื”ืจืฅ /triage latencyโ€).

ืื™ืš ืœื”ื•ืกื™ืฃ/ืœืฉื ื•ืช ื—ื•ืง (ืฆโ€™ืงืœื™ืกื˜ ืœืžืคืชื—ื™ื ื•ืœโ€‘AI)๏ƒ

  1. ืขื“ื›ื•ืŸ ืงื•ื ืคื™ื’

    • ื”ื•ืกื™ืคื•/ืฉื ื• quick_fix_rules.latency_v1.thresholds ื•/ืื• actions.

  2. ื‘ื“ื™ืงืช ืฉื“ื•ืช ืงื™ื™ืžื™ื

    • ื•ื“ืื• ืฉื”ืฉื“ื•ืช ืฉืืชื ืจื•ืฆื™ื ืงื™ื™ืžื™ื ื‘ึพalert metadata.

    • ืื ืœื: ื”ื•ืกื™ืคื• ืื•ืชื ื‘ืฉื›ื‘ืช ื”ืื™ื ืกื˜ืจื•ืžื ื˜ืฆื™ื” (ืœืจื•ื‘ metrics.py/alert_manager.py).

  3. ืžื™ืžื•ืฉ ืœื•ื’ื™ืงื” (ืื ื ื“ืจืฉ)

    • ื”ืœื•ื’ื™ืงื” ื”ื“ื™ื ืžื™ืช ื ืžืฆืืช ื‘ึพservices/observability_dashboard.py (ืคื•ื ืงืฆื™ื” _dynamic_quick_fix_actions).

    • ืื ื”ื•ืกืคืชื ืื•ืช ื—ื“ืฉ โ€“ ื”ืจื—ื™ื‘ื• ืืช ื”ึพextract/normalize ืฉื ื‘ืฆื•ืจื” failโ€‘open.

  4. ื˜ืกื˜ื™ื

    • ื›ืœืœ ืืฆื‘ืข: ืฉื•ืจื” ื‘ื˜ื‘ืœืช ื”ื—ืœื˜ื” = ื˜ืกื˜.

    • ืœื›ืกื•ืช:

      • queueing+pool ื’ื‘ื•ื”

      • queueing+cpu/ram ื’ื‘ื•ื”

      • queueing+usage ื ืžื•ืš

      • duration ื’ื‘ื•ื” (processing)

      • ื—ืกืจ ื ืชื•ื ื™ื (fallback)

  5. UI/Bot

    • UI ืžืฆื™ื™ืจ ืืช alert.quick_fixes ื›ืžื• ืฉื”ื•ื.

    • ืœื‘ื•ื˜: ื”ื”ืžืœืฆื” ื ื‘ื ื™ืช bestโ€‘effort ืžืชื•ืš ืื•ืชื• ืžืงื•ืจ.

ื”ืขืจื•ืช ืœืกื•ื›ื ื™ AI (ืื™ืš ืœืขื‘ื•ื“ ื ื›ื•ืŸ)๏ƒ

  • ื›ืฉืฆืจื™ืš ืžื™ื“ืข ื‘ื–ืžืŸ ืืžืช (ืœืžืฉืœ ื”ืื ื‘ืืžืช ื™ืฉ ืœื—ืฅ DB) ืœื ืžื ื—ืฉื™ื.

  • ืžื‘ืงืฉื™ื ืœื”ืจื™ืฅ ืคืงื•ื“ื•ืช ChatOps ืจืœื•ื•ื ื˜ื™ื•ืช:

    • /triage latency

    • /triage db

    • /triage system

    • /errors

ื•ื”ืคืœื˜ ืฉืœ ื”ื‘ื•ื˜ ื”ื•ื ืžืงื•ืจ ื”ืืžืช.