Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Health Workflow

The health workflow covers diagnosing system issues, applying self-healing fixes, and monitoring recovery. Gaius implements a fail-fast policy with actionable error messages, so every failure tells you what to do next.

Step 1: Diagnose

Run the health check to see the current state of all services:

uv run gaius-cli --cmd "/health" --format json

This returns a structured report with checks organized by category. Each check has a status (ok, warn, fail) and a message explaining the current state.

To check a specific category:

uv run gaius-cli --cmd "/health engine" --format json
uv run gaius-cli --cmd "/health endpoints" --format json

Step 2: Interpret Failures

Failed checks include Guru Meditation Codes – unique identifiers for each failure mode. For example:

  • #DS.00000001.SVCNOTINIT – DatasetService not initialized
  • #NF.00000001.UNREACHABLE – NiFi not reachable
  • #EP.00000001.GPUOOM – GPU out of memory

Each code maps to a documented heuristic with symptom, cause, observation method, and solution. The error message itself contains remediation hints.

Step 3: Self-Heal

Always try /health fix before manual intervention. This is a design principle, not a suggestion:

uv run gaius-cli --cmd "/health fix engine" --format json
uv run gaius-cli --cmd "/health fix endpoints" --format json
uv run gaius-cli --cmd "/health fix nifi" --format json

Available fix targets: engine, dataset, nifi, postgres, qdrant, minio, endpoints, evolution.

Each fix strategy is a multi-step remediation sequence with verification at each step. The system attempts increasingly aggressive fixes until the service recovers.

Step 4: Monitor Recovery

After applying a fix, monitor the health observer for recovery:

# Check observer status
uv run gaius-cli --cmd "/health observer status" --format json

# List active incidents
uv run gaius-cli --cmd "/health observer incidents" --format json

# Poll for recovery
for i in $(seq 1 10); do
    sleep 15
    uv run gaius-cli --cmd "/health" --format json | \
        jq '.data.checks[] | select(.status != "ok") | {name, status, message}'
done

Step 5: Escalation

If /health fix does not resolve the issue, the Health Observer can escalate via ACP (Agent Client Protocol) to Claude Code for deeper analysis. This happens automatically when:

  1. An incident exceeds the configured FMEA RPN threshold
  2. Local remediation has failed
  3. The incident is not in cooldown

Manual escalation path – use just restart-clean as the last resort:

just restart-clean

This performs a full clean restart of all services: stops everything, cleans up state, and restarts from scratch.

FMEA Framework

The health system uses Failure Mode and Effects Analysis (FMEA) to prioritize issues. Each failure mode has a Risk Priority Number (RPN) computed from severity, occurrence frequency, and detection difficulty. Higher RPNs get attention first.

# View the FMEA catalog
uv run gaius-cli --cmd "/fmea catalog" --format json

# Calculate RPN for a specific failure mode
uv run gaius-cli --cmd "/fmea rpn <mode>" --format json

Health Observer Daemon

The Health Observer runs as a background daemon, continuously monitoring service health and automatically triggering remediation when issues are detected:

# Start the observer
uv run gaius-cli --cmd "/health observer start" --format json

# Stop the observer
uv run gaius-cli --cmd "/health observer stop" --format json

When running, it checks services periodically and logs incidents. Resolved incidents are filtered out of the active list, but unknown or unexpected states remain visible (fail-open for observability).