Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

FMEA Framework

FMEA (Failure Mode and Effects Analysis) replaces simple severity classification with quantitative risk assessment. Originally from manufacturing engineering, Gaius adapts it for software systems.

Risk Priority Number

Each failure mode is scored on three dimensions:

RPN = S x O x D (range 1-1000)

DimensionMeaningScale
S (Severity)Impact on system availability1 (negligible) to 10 (total failure)
O (Occurrence)Probability of recurrence1 (rare) to 10 (frequent)
D (Detection)Ability to detect before impact1 (always caught) to 10 (invisible)

Higher RPN means higher risk. The worst possible score (10 x 10 x 10 = 1000) indicates a severe, frequent, and invisible failure.

Action Thresholds

RPN RangeTierAction
1-100Tier 0Automatic procedural remediation
101-200Tier 1Agent-assisted remediation
201-400Tier 2Requires user approval
401-1000ManualHuman intervention required

Conservative Overrides

Certain conditions always escalate regardless of RPN:

  • Detection >= 8: Poor observability requires approval
  • Safety level DESTRUCTIVE: Data-modifying actions require approval
  • Multiple correlated failures: Escalate to next tier

Failure Mode Catalog

34 failure modes across 7 categories:

GPU (6 modes)

IDFailure ModeSODRPN
GPU_001Memory Exhaustion864192
GPU_002Temperature Critical93254
GPU_003Hardware Error102360
GPU_004Driver Crash83496
GPU_005Memory Fragmentation754140
GPU_006Power Throttling54360

vLLM Endpoint (6 modes)

IDFailure ModeSODRPN
VLLM_001Stuck Starting655150
VLLM_002Stuck Stopping44464
VLLM_003Health Check Failure763126
VLLM_004Orphan Process554100
VLLM_005OOM Crash853120
VLLM_006KV-Cache Exhaustion565150

Model Quality (5 modes)

IDFailure ModeSODRPN
MQ_001Hallucination Increase746168
MQ_002Latency Degradation45360
MQ_003Output Quality Drift567210
MQ_004Semantic Drift648192
MQ_005Context Exhaustion654120

Emergent Behavior (4 modes)

IDFailure ModeSODRPN
EB_001Swarm Consensus Failure646144
EB_002Cognition Loop547140
EB_003Embedding Drift658240
EB_004Self-Observation Bias659270

Note: Emergent behavior modes have high Detection scores (poor observability), reflecting the inherent difficulty of detecting these failure modes automatically.

Adaptive Learning

The system adjusts S/O/D scores based on remediation outcomes using exponential moving average (alpha = 0.2):

  • Successful fast fix: Occurrence decreases (problem is manageable)
  • Failed fix: Occurrence increases (problem is more persistent than estimated)
  • User-reported: Detection increases (automated checks missed it)
  • Early detection: Detection decreases (automated checks caught it)

CLI Commands

# FMEA summary with current RPN scores
uv run gaius-cli --cmd "/fmea" --format json

# Failure mode catalog
uv run gaius-cli --cmd "/fmea catalog" --format json

# Detail for specific failure mode
uv run gaius-cli --cmd "/fmea detail GPU_001" --format json

# Recent incidents
uv run gaius-cli --cmd "/fmea history" --format json