Health Observer
The HealthObserver daemon provides continuous health monitoring with FMEA-based incident management and ACP escalation for issues beyond local remediation capability.
Operation
The observer runs as a background service within the engine, polling system health at a configurable interval (default 60 seconds).
from gaius.health.observe import HealthObserver
observer = HealthObserver()
await observer.start() # Begins continuous monitoring
Incident Lifecycle
Detection → Active → Healing → Recovered → Resolved
↘ Escalated (ACP) → Resolved
- Detection: Health check identifies a failure
- Active: Incident created with FMEA risk scoring
- Healing: Self-healing attempts in progress
- Recovered/Escalated: Either resolved locally or sent to ACP
- Resolved: Terminal state
Fail Open
When filtering incidents for display, the observer uses fail open semantics: it filters OUT known terminal states (resolved) rather than filtering IN known active states. Unknown states are always surfaced for investigation.
Makespan Integration
The observer integrates with the AgendaTracker to avoid false-positive incidents during scheduled operations. When an endpoint is part of a planned makespan transition:
if tracker.is_endpoint_in_scheduled_transition("reasoning"):
# Skip incident creation — this is intentional
log.info(f"Skipping: endpoint in scheduled transition to {expected_state}")
ACP Escalation
When an incident exceeds the RPN threshold or local remediation fails after 3 attempts, the observer escalates to Claude Code via ACP:
- Claude Code analyzes the issue using MCP tools
- Identifies gaps in the
/health fixframework - Implements new fix strategies and heuristics
- Commits to
acp-claude/health-fixbranch for review
Cadence Limits
To prevent runaway automation:
- Max 3 GitHub issues per 24 hours
- Min 5 minutes between restart attempts
- Max 3 restarts per endpoint per hour
CLI Commands
# Observer status
uv run gaius-cli --cmd "/health observer" --format json
# Active incidents
uv run gaius-cli --cmd "/health incidents" --format json
# Incident detail
uv run gaius-cli --cmd "/health incident <id>" --format json