ACP Incident Resolution: 2026-01-01
A milestone in autonomous self-healing: Claude Code resolves GPU allocation conflicts using Gaius MCP tools.
Overview
On January 1, 2026, the Gaius HealthObserver daemon detected GPU memory exhaustion and escalated to Claude Code via the Agent Client Protocol (ACP). This document captures the complete investigation and resolution session, demonstrating the first successful end-to-end ACP escalation workflow.
Key Achievements
- Autonomous Root Cause Analysis: Claude Code identified overlapping GPU allocations between multiple endpoints
- MCP Tool Integration: Used
gpu_status,health_check, and orchestrator commands via gRPC - Self-Healing Validation: Monitored recovery and verified resolution through the same tooling
- Multiple Incident Cascade: Handled a follow-on incident caused by the same GPU contention pattern
Incident 1: GPU_001:gpu_1_health
▐▛███▜▌ Claude Code v2.0.54
▝▜█████▛▘ Opus 4.5 · Claude Max
▘▘ ▝▝ ~/local/src/zndx/gaius
Health Incident Context
Fingerprint: GPU_001:gpu_1_health
Endpoint: gpu_1_health
Failure Mode: GPU_001
RPN Score: 125 (S:5 × O:5 × D:5)
Escalation Tier: 2
Attempts: 3
Initial State
| GPU | Memory Used | Memory % | Healthy |
|---|---|---|---|
| 0 | 1.29 GB | 5.4% | ✓ |
| 1 | 23.6 GB | 98.4% | ✗ |
| 2 | 22.89 GB | 95.4% | ✓ |
| 3 | 22.89 GB | 95.4% | ✓ |
| 4 | 22.89 GB | 95.4% | ✓ |
| 5 | 0.56 GB | 2.3% | ✓ |
Endpoints at Incident Time
| Endpoint | Model | GPUs | Status | Issue |
|---|---|---|---|---|
| cap_reasoning | Qwen/QwQ-32B | [1,2,3,4] | healthy | Consuming GPU 1 |
| orchestrator | nvidia/Orchestrator-8B | [0,1] | starting | “Insufficient GPU memory” |
Root Cause Analysis
Claude Code identified the issue immediately:
Root Cause: GPU allocation conflict between two endpoints trying to use GPU 1:
cap_reasoning(Qwen/QwQ-32B) - Uses GPUs [1, 2, 3, 4], status: healthyorchestrator(nvidia/Orchestrator-8B) - Uses GPUs [0, 1], status: starting with “Insufficient GPU memory”The
cap_reasoningendpoint loaded first and consumed GPU 1’s memory (~23.6 GB of 24 GB). When orchestrator tries to start on GPUs [0, 1], GPU 0 has memory available (1.29 GB used) but GPU 1 is already exhausted.
Investigation Commands
# Check endpoint status
uv run gaius-cli --cmd "/gpu status" --format json
# Stop the stuck orchestrator
uv run gaius-cli --cmd "/gpu stop orchestrator" --format json
# Monitor recovery
uv run gaius-cli --cmd "/health gpu" --format json
Resolution
After stopping the orchestrator endpoint, the scheduler automatically rebalanced:
| GPU | Before | After |
|---|---|---|
| 0 | 5.4% | 0.01% |
| 1 | 98.4% | 0.01% ✓ |
| 2 | 95.4% | 92.2% |
| 3 | 95.4% | 94.3% |
| 4 | 95.4% | 94.3% |
| 5 | 2.3% | 0.01% |
Final State:
- orchestrator: HEALTHY (port 8094)
- coding: HEALTHY (port 8093)
- cap_reasoning: STOPPING
- fast: STARTING (port 8095)
Incident 2: VLLM_001:coding
Immediately after resolving the first incident, a second cascaded incident appeared.
Health Incident Context
Fingerprint: VLLM_001:coding
Endpoint: coding
Failure Mode: VLLM_001
RPN Score: 125 (S:5 × O:5 × D:5)
Escalation Tier: 2
Attempts: 3
Conflict Analysis
| Endpoint | GPUs | Status | Memory on GPU 1 |
|---|---|---|---|
| cap_reasoning | [1,2,3,4] | healthy | 22.89 GB (95.4%) |
| orchestrator | [0,1] | stopping | competing |
| coding | [1] | failed | can’t allocate |
| fast | [0] | healthy | - |
Resolution
The scheduler handled this automatically:
- Stopped
cap_reasoningto free GPUs [1,2,3,4] - Stopped
orchestratorandcoding - Cleared all GPU memory (95%+ → 0%)
- Restarted endpoints with non-overlapping allocations
Final State:
| Endpoint | Status |
|---|---|
| orchestrator | HEALTHY ✓ |
| coding | HEALTHY ✓ |
| reasoning | STOPPING |
| fast | STARTING |
Observations
What Worked
- FMEA-Based Escalation: RPN scoring correctly identified severity (125 = S:5 × O:5 × D:5)
- MCP Tool Chain: All diagnostic commands worked through gRPC proxying
- Scheduler Self-Healing: Automatic GPU reallocation after conflicts cleared
- Cascading Incident Detection: Second incident properly tracked with separate fingerprint
Identified Gaps
- GPU Overlap Detection: Scheduler allowed conflicting GPU assignments (
cap_reasoningandorchestratorboth claimed GPU 1) - Startup Ordering: No precedence constraints ensured larger models claim GPUs first
- Runtime Validation: GPU allocations only validated at scheduling time, not continuously
Order 3+ RCA Observations
These connect to CP-SAT constraints in makespan_scheduler.py:
| Constraint | Gap Identified |
|---|---|
GPU_MUTUAL_EXCLUSION | Enforced at planning time, not at runtime |
CONTIGUITY_REQUIREMENT | TP endpoints need contiguous GPU blocks |
PRECEDENCE | Large models should claim GPUs before small ones |
Significance
This incident represents a milestone in Gaius’s self-healing capabilities:
- First Successful ACP Escalation: HealthObserver → Claude Code → MCP tools → Resolution
- Closed-Loop Verification: Claude Code verified resolution using same tools that detected the issue
- RCA Framework Validation: Order 3+ observations identified scheduler constraint gaps
- Multi-Incident Handling: Cascading incidents tracked and resolved in sequence
The GPU allocation conflict exposed architectural issues that led to the RCA (Root Cause Analysis) framework development, enabling future incidents to be classified as OPERATIONAL (transient) or ARCHITECTURAL (needs code fix).
Captured from ACP session on 2026-01-01 04:11-04:45 UTC