Gaius
Gaius is a terminal interface for navigating graph-oriented data domains. It projects high-dimensional embeddings onto a discrete lattice via UMAP, computes persistent homology and Ollivier–Ricci curvature over the embedding space, and renders the results as interactive overlays on the lattice.
Named after Gaius Plinius Secundus (Pliny the Elder), whose Naturalis Historia cataloged the natural world across 37 books.
Capabilities
-
Lattice Projection: UMAP (cosine metric, k=15 neighbors, min_dist=0.1) maps embedding vectors to continuous 2D coordinates. These are quantized to a 19×19 integer lattice by rounding and clipping to [0, 18]. The main lattice is accompanied by two 9×9 orthographic mini-grids centered on the cursor: an Embed view showing the local cosine-similarity neighborhood, and an Iso view rendering scalar fields (curvature, persistence, complexity) as elevation maps via inverse-distance-weighted interpolation (power=2).
-
Persistent Homology (H₀–H₂): Ripser computes a Vietoris–Rips filtration over the cosine distance matrix of the original high-dimensional embeddings (not the projected coordinates), producing persistence barcodes for dimensions 0 through 2. Intervals with persistence > 0.1 are marked significant. H₀ captures connected components, H₁ captures 1-cycles, and H₂ captures 2-dimensional voids. Barcodes are rendered as overlays on the lattice, with persistent generators mapped to their lattice positions via the UMAP projection.
-
Ollivier–Ricci Curvature: Discrete Ricci curvature is computed on a k-nearest-neighbor graph (k=15, cosine metric) constructed from the embedding space, using the OTD method with α=0.5. Per-node curvature is the mean of incident edge curvatures. The resulting curvature field, gradient vectors (finite-difference approximation), and divergence values are projected to the Iso mini-grid. Positive curvature indicates cluster interiors; negative curvature indicates semantic boundaries.
-
Multi-Agent Exploration: Seven agents (Leader, Risk, Optimizer, Planner, Critic, Executor, Adversary) navigate the lattice with role-specific positioning behaviors (center-seeking, peripheral, random) and cluster affinities. Agent training uses the RASE framework (Rapid Agentic Systems Engineering), where constraints are composed declaratively via AllOf/AnyOf/Not and evaluated by a ground-truth oracle to produce verifiable reward signals.
-
Modal Interface: Vim-style modal navigation (
hjklmotion, slash-command dispatch, overlay toggles) over both the lattice and the underlying gRPC service graph. -
FMEA Health Observer: A background daemon scores system components on Severity × Occurrence × Detection. When risk priority numbers exceed configured thresholds, it escalates to an agent via the Agent Client Protocol (ACP) for FMEA-mediated intervention.
Computational Pipeline
The following pipeline is implemented end-to-end:
- Embed — Documents are encoded as multi-vector embeddings (ColNomic, GPU-accelerated) and indexed.
- Project — UMAP maps the embedding space to 2D; coordinates are rounded to the 19×19 integer lattice.
- Filtration — Vietoris–Rips filtration over the cosine distance matrix of original embeddings; Ripser computes persistence barcodes for H₀, H₁, H₂. Significant intervals (persistence > 0.1) produce topological overlays.
- Curvature — Ollivier–Ricci curvature on the k-NN graph (k=15, α=0.5, OTD); curvature, gradient, and divergence fields are interpolated onto the 9×9 Iso mini-grid via IDW.
- Exploration — Agents operate on the lattice; topological features and curvature values are available as grid state for trajectory selection.
- Rendering — LuxCore path-traces procedural card visualizations from the computed geometric features.
The lattice serves as both a visualization surface and a discrete approximation of the data manifold, coupling persistent homology, differential geometry, and agent-based exploration in one interactive system.
Architecture
- Inference — gRPC control plane with 37 services coordinating 6 NVIDIA GPUs via makespan-scheduled vLLM
- Interfaces — TUI, CLI, and MCP server (163 tools), all communicating with the engine via shared gRPC protocol
- Pipelines — Metaflow orchestration for article curation, agent evaluation, and batch rendering
- Visualization — LuxCore PATHOCL engine with GPU-accelerated rendering driven by a CFDG-inspired grammar
- Observability — FMEA-scored health observer with ACP-mediated agent intervention
- Storage — Bases feature store with a domain query language compiled to SQL via AST-based guardrails; RASE metamodel for agent verification
Getting Started
# Launch the TUI
uv run gaius
# Use the CLI for scripting
uv run gaius-cli --cmd "/health" --format json
# Check system status
uv run gaius-cli --cmd "/gpu status" --format json
Navigate with hjkl. Cycle overlays with o. Toggle modes with v. Press ? for help.
Vision & Philosophy
The Polymath’s Dilemma
Modern knowledge work demands synthesis across domains. A pension analyst must understand markets, demographics, regulation, and behavioral economics—simultaneously. A systems architect must hold network topology, security surfaces, performance characteristics, and team dynamics in mind as a unified whole.
Yet our tools present information in fragments. Spreadsheets. Dashboards. Slide decks. Chat interfaces. Each offers a narrow aperture onto a high-dimensional reality.
Gaius proposes a different approach: spatial synthesis. By projecting complex relationships onto a navigable grid, it transforms abstract complexity into something the human visual system can grasp intuitively—patterns, clusters, voids, and flows.
Why a Grid?
The 19×19 Go board is not arbitrary. It represents a sweet spot in human visual cognition:
- 361 points: Enough resolution for meaningful differentiation, few enough for gestalt perception
- Addressable: Every point has a name (A1 through T19), enabling precise reference
- Compositional: Regions, groups, and territories emerge naturally from point relationships
- Battle-tested: 4,000 years of Go strategy have proven this grid’s capacity to represent complex strategic landscapes
The grid constrains—and constraint enables clarity. A 19×19 board forces prioritization. What matters enough to occupy space?
Topological Intuition
Raw data has shape. Clusters form. Loops persist. Voids signal absence. Traditional visualization obscures this topology behind axes, legends, and chart types.
Persistent homology offers a different lens. It asks: what structures survive as we vary our perspective? The resulting “death loops” (H1 features) reveal cycles in your data—feedback loops, circular dependencies, systemic risks—that persist across scales.
When projected onto the grid, these become visible warnings: regions to investigate, patterns to understand, risks to mitigate.
Agentic Amplification
A single human perspective is insufficient for complex domains. Gaius deploys autonomous agents that explore, evolve, and consolidate knowledge. Each agent brings a distinct analytical lens, and their capabilities improve through RLVR (Reinforcement Learning with Verifiable Reward) training.
Agent outputs are embedded and projected onto the grid. Watch agents converge on consensus. Notice where they scatter (uncertainty). Observe who stands alone (contrarian insight). The grid becomes a map of collective intelligence.
Design Principles
1. Keyboard-First
Every action available via keyboard. Mouse optional. This isn’t nostalgia—it’s recognition that flow state requires low-latency, high-bandwidth input.
2. Progressive Disclosure
Launch with uv run gaius and get a clean TUI instantly. Three interfaces — TUI, CLI, MCP — offer increasing levels of automation. Complexity arrives when requested.
3. Modal Operation
Modes aren’t complexity—they’re context. Navigate in normal mode. Enter commands in command mode. Each mode offers a focused set of operations.
4. Composability
Each component (board, log, overlay) is independent. Combine them. Split them. Tile them. The interface adapts to your workflow.
5. Transparency
No magic. The grid shows exactly what it’s told to show. Overlays are explicit. Agent positions reflect actual embeddings. Trust requires transparency.
The Goal
Gaius aims to demonstrate that terminal interfaces need not be constrained to text streams. That topological insight can be made visual. That agent augmentation can be made spatial.
It’s an experiment in augmented cognition—using machines not to replace human judgment, but to extend human perception into domains our unaided senses cannot reach.
Core Concepts
Gaius integrates several conceptual pillars: spatial representation, topological analysis, autonomous agents, and self-healing infrastructure. This section introduces the foundational ideas; subsequent chapters explore each in depth.
The Grid
At the center of Gaius is a 19x19 board. This isn’t a chart or a dashboard — it’s a canvas for projection.
High-dimensional data (embeddings, agent states, risk surfaces) gets compressed onto 361 addressable points. The compression is lossy by design: it forces salience. What survives projection is what matters.
The grid supports multiple visualization modes:
- Point markers: Individual data points as stones
- Density heatmaps: Aggregate intensity via shading
- Topology overlays: Death loops and persistent features
- Agent positions: Agent state projected from embedding space
See The Grid Metaphor for the full treatment.
Embeddings
Modern ML represents entities as vectors in high-dimensional space. Text, images, users, documents — all become points in a geometric landscape where distance encodes similarity.
Gaius consumes these embeddings directly. Agent utterances become vectors. Domain entities become vectors. Cards, articles, and knowledge base entries occupy positions in embedding space. The relationships between them — cosine similarities, clusters, outliers — become spatial relationships on the grid.
See Embeddings & Point Clouds for details on how Gaius handles vector representations.
Persistent Homology
Traditional statistics describe data’s distribution. Topology describes its shape.
Persistent homology asks: as we vary the scale of observation, what features persist?
- H0 features (connected components): Clusters that remain distinct
- H1 features (loops): Cycles that don’t collapse — the “death loops”
- H2 features (voids): Empty regions bounded by surfaces
These topological features often reveal structure invisible to statistical methods: feedback loops in systems, circular dependencies in code, liquidity traps in markets.
See Persistent Homology for the mathematical foundations and practical applications.
Autonomous Agents
Gaius agents are not static analyzers — they evolve. Through RLVR (Reinforcement Learning with Verifiable Reward) training, agents improve their capabilities over time. The agent system includes:
- Evolution: Task ideation, training runs, and capability evaluation
- Cognition: Self-observation and action planning
- Theta consolidation: Memory compression inspired by hippocampal replay
- CLT memory: Cognitive Load Theory-based knowledge structuring
See Agent System for implementation details.
Self-Healing
Gaius implements autonomous health monitoring based on FMEA (Failure Mode and Effects Analysis). Every failure mode has:
- A Guru Meditation Code for unique identification (e.g.,
#DS.00000001.SVCNOTINIT) - An automated fix strategy that can diagnose, repair, and verify
- An escalation path to ACP (Agent Client Protocol) when self-healing fails
Errors are never silenced. The system either fixes itself or tells you exactly what’s wrong and how to fix it.
See Fail-Fast & Self-Healing for the design principles.
Putting It Together
A typical Gaius session:
- Launch the TUI:
uv run gaius - Observe the grid state — entity positions projected from embedding space
- Navigate (
hjkl): Explore regions of interest - Overlay (
o): See topology, risk, or agent state - Command (
/): Run slash commands for deeper analysis - Monitor (
/health): Check system health, let self-healing handle issues
The grid becomes a living map of your domain’s complexity — updated as agents explore and topology reveals hidden structure.
The Grid Metaphor
Origins in Go
The 19×19 grid traces its heritage to the ancient game of Go (围棋/囲碁/바둑). For over four millennia, this board has served as a substrate for strategic reasoning of remarkable depth.
Go’s grid has properties that make it ideal for information visualization:
- Discrete but dense: 361 points offer fine granularity while remaining visually tractable
- Symmetric: No privileged positions (unlike chess’s asymmetric opening)
- Emergent structure: Corners, edges, and center have different strategic character despite identical local rules
- Scale-invariant patterns: The same shapes (eyes, ladders, ko) appear at multiple scales
The Grid as Projection Surface
In Gaius, the grid serves as a projection surface for high-dimensional data. Consider an embedding space with 1536 dimensions (typical for modern text embeddings). How do we make this legible?
High-dimensional space The Grid
(n=1536) (n=361)
│ │
│ PCA / UMAP / │
│ custom projection │
▼ ▼
┌─────────┐ ┌───────────┐
│ ● ● ● │ │ · · ● · · │
│ ● ● │ ────► │ · ● · · · │
│ ● ● │ │ · · · ● · │
└─────────┘ └───────────┘
The projection is necessarily lossy. This is a feature: it forces salience. Points that survive projection and remain distinct are points that matter.
Addressing
Every grid position has a unique address:
A B C D E F G H J K L M N O P Q R S T
19 · · · · · · · · · · · · · · · · · · · 19
18 · · · · · · · · · · · · · · · · · · · 18
17 · · · + · · · · · · · · · + · · · · · 17
...
1 · · · · · · · · · · · · · · · · · · · 1
A B C D E F G H J K L M N O P Q R S T
Note: Column I is skipped (Go convention, to avoid confusion with the numeral 1).
This addressing enables:
- Precise reference: “The cluster at D4-F6”
- Command targeting:
/analyze K10or/mark Q16 critical - Spatial queries: “What’s near the center?” → J10-L10, J9-L11
Visual Vocabulary
The grid supports a rich visual vocabulary:
Point Markers
| Symbol | Meaning |
|---|---|
● | Black stone / primary entity |
○ | White stone / secondary entity |
✛ | Cursor position |
a-i | Candidate markers (yellow) |
◦ | Neutral / unaffiliated point |
Density Shading
| Symbol | Density |
|---|---|
▓ | High (>75%) |
▒ | Medium (50-75%) |
░ | Low (20-50%) |
· | Minimal (<20%) |
Overlay Markers
| Symbol | Meaning |
|---|---|
⚠ | Death loop / H1 feature |
Colored ● | Agent position |
The Grid as Strategic Map
In Go, professionals often describe the board in terms of strategic regions:
- Corners (4 points): High-value, easy to secure
- Edges (4 sides): Secondary value, harder to defend
- Center: Hardest to claim, but dominates late-game influence
Gaius inherits this intuition. Data projected to corners represents stable, well-understood entities. Central positions represent contested or ambiguous terrain. Edge regions represent transitional states.
Compositional Thinking
The grid invites compositional reasoning:
- Groups: Connected points form units (liberty-counting in Go becomes cluster analysis)
- Territory: Regions bounded by your stones (areas of control/understanding)
- Influence: Distant effects from strong positions (attention propagation)
- Ko: Positions that oscillate (unstable equilibria in your data)
These metaphors aren’t forced—they emerge naturally when complex systems are projected onto discrete spatial representations.
Why Not a Larger Grid?
Larger grids (e.g., 100×100) would offer more resolution but sacrifice:
- Gestalt perception: Humans can’t perceive 10,000 points holistically
- Addressability: 100×100 requires two-digit coordinates
- Strategic depth: Go on 9×9 is trivial; 19×19 is profound. Scale matters.
The 19×19 board occupies a cognitive sweet spot. Gaius exploits this.
Embeddings & Point Clouds
What Are Embeddings?
Embeddings are learned vector representations that encode semantic relationships as geometric relationships. Two items that are “similar” in meaning have embedding vectors that are “close” in space.
"pension fund" → [0.23, -0.41, 0.88, ...] (1536 dims)
"retirement plan" → [0.25, -0.39, 0.86, ...] (nearby)
"pizza recipe" → [-0.67, 0.12, -0.33, ...] (distant)
Modern embedding models (text-embedding-3-small, etc.) produce vectors where:
- Cosine similarity measures semantic relatedness
- Euclidean distance measures conceptual separation
- Clusters emerge naturally from semantic categories
Point Clouds in Gaius
When multiple embeddings are collected—agent utterances, domain entities, document fragments—they form a point cloud in embedding space.
# Each agent utterance becomes a point
cloud = []
for agent in swarm:
response = await agent.analyze(task)
embedding = embedder.embed(response)
cloud.append(embedding)
# Cloud shape: (n_utterances, embedding_dim)
This point cloud is the raw material for both:
- Grid projection (what you see)
- Topological analysis (what the math reveals)
Projection Methods
High-dimensional clouds must be compressed for visualization. Common methods:
PCA (Principal Component Analysis)
Finds the axes of maximum variance. Fast, deterministic, but linear—may miss curved structure.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected = pca.fit_transform(cloud)
UMAP (Uniform Manifold Approximation)
Preserves local neighborhood structure. Better for clusters, but non-deterministic.
Custom Projections
Domain-specific projections can encode prior knowledge. For pension analysis:
- X-axis: Risk (low → high)
- Y-axis: Time horizon (short → long)
Mapping to the Grid
Once projected to 2D, coordinates are scaled to [0, 18] and discretized:
# Normalize to [0, 1]
x_norm = (projected[:, 0] - projected[:, 0].min()) / (projected[:, 0].ptp() + 1e-8)
y_norm = (projected[:, 1] - projected[:, 1].min()) / (projected[:, 1].ptp() + 1e-8)
# Scale to grid
x_grid = np.clip((x_norm * 18).astype(int), 0, 18)
y_grid = np.clip((y_norm * 18).astype(int), 0, 18)
Multiple points may map to the same grid cell. This is handled by:
- Latest-wins: Most recent point displayed
- Color mixing: Combined representation
- Intensity: Brighter = more points
Semantic Distance on the Grid
Grid distance roughly corresponds to semantic distance—but the projection is lossy. Two points adjacent on the grid are likely related; two points distant are likely unrelated. But edge cases exist.
The grid offers intuition, not precision. For exact similarity queries, consult the underlying embeddings directly.
Temporal Dynamics
As new data arrives (agent responses, user queries, domain events), the point cloud evolves:
t=0: Initial cloud from seed data
t=1: + First swarm round utterances
t=2: + User query embeddings
t=3: + Second swarm round...
The grid animates this evolution. Watch clusters form, dissolve, migrate. These dynamics reveal how understanding develops over time.
Vector Memory Integration
All embeddings are stored in the Vector Memory system, enabling:
- Retrieval: “Find utterances similar to X”
- Scene graphs: Build edges from cosine similarity
- History: Track the trajectory of specific agents/entities
See Vector Memory for implementation details.
Persistent Homology
Beyond Statistics
Statistics describes the distribution of data: mean, variance, correlations. But distributions are blind to shape.
Consider two point clouds:
Cloud A: Cloud B:
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
Same mean. Same variance. Same point count. But Cloud A is a filled disk; Cloud B is a ring with a hole. The hole is topologically significant—it represents something absent, something that might matter.
Persistent homology is the mathematics of detecting such shapes.
The Vietoris-Rips Complex
Given a point cloud, we construct a simplicial complex by connecting points within a distance threshold ε:
ε = small: ε = medium: ε = large:
● ● ●───● ●───●
│ │╲ ╱│
● ● ● ● ●─╳─●
│╱ ╲│
● ● ●───● ●───●
As ε increases:
- H0 features (connected components): Merge as clusters connect
- H1 features (loops): Appear when edges close cycles, disappear when interiors fill
- H2 features (voids): Appear when surfaces enclose volumes
Birth and Death
Each topological feature has a birth time (the ε at which it appears) and a death time (the ε at which it vanishes).
Features that persist across a wide range of ε are considered significant—they reflect genuine structure rather than noise.
Persistence Diagram:
death
│
│ ● (noise: short-lived)
│
│ ● (signal: long-lived)
│ ●
│ ●
└──────────── birth
Points far from the diagonal represent persistent features.
Death Loops (H1)
In Gaius, H1 features receive special attention as “death loops.” These represent:
- Cycles in data flow: Feedback loops, circular dependencies
- Systemic risks: Self-reinforcing failure modes
- Market structures: Liquidity cycles, regulatory arbitrage loops
When projected onto the grid, death loops appear as ⚠ markers in regions where the underlying embedding space exhibits persistent 1-dimensional holes.
Practical Application
from gtda.homology import VietorisRipsPersistence
from gtda.diagrams import PersistenceEntropy
# Compute persistence diagrams
vr = VietorisRipsPersistence(homology_dimensions=[0, 1, 2])
diagrams = vr.fit_transform([point_cloud])
# Quantify topological complexity
entropy = PersistenceEntropy()
ent = entropy.fit_transform(diagrams)
# Extract significant H1 features
h1_features = diagrams[0][diagrams[0][:, 0] == 1]
persistent_loops = h1_features[h1_features[:, 2] - h1_features[:, 1] > threshold]
Entropy as Summary
Persistence entropy provides a scalar summary of topological complexity:
- Low entropy: Few dominant features (simple structure)
- High entropy: Many features of similar persistence (complex, fractal-like)
Gaius tracks entropy over time. Sudden entropy spikes may indicate regime changes in your domain.
Interpreting Grid Overlays
When viewing the H1 overlay:
| Pattern | Interpretation |
|---|---|
Sparse ⚠ | Few persistent loops; structure is tree-like |
Clustered ⚠ | Localized cyclic structure; investigate region |
Uniform ⚠ | Pervasive cyclicity; may indicate noise or genuine complexity |
Ring of ⚠ | Boundary of a significant void |
Limitations
Persistent homology reveals shape but not causation. A detected loop could represent:
- A real feedback cycle in your domain
- An artifact of the embedding model
- Noise in the underlying data
Domain expertise is required to interpret topological features. Gaius surfaces the structure; you provide the meaning.
Further Reading
- Computational Topology by Edelsbrunner and Harer
- Topological Data Analysis by Carlsson
- giotto-tda documentation: giotto-ai.github.io
Epistemology of Augmented Cognition
How knowledge grows in a human-AI system
The Tautology
Augmented cognition must yield nonrandom advantage with verifiable outcomes.
This isn’t philosophy for its own sake. It’s the test. If the human-plus-system doesn’t produce results that beat the null hypothesis—problems solved faster, connections seen that would be missed, errors avoided, artifacts of higher quality—then the augmentation is theater.
Everything that follows serves this constraint.
The Third Mind
The Enlightenment assumed the individual mind as atomic unit: properly disciplined reason, applied to sensory evidence, converging on truth. The Romantic correction enriched the channels—emotion, intuition, aesthetic sense—but preserved the individual.
What if both missed something?
Cognition may have never been atomic. It distributes across brains, books, conversations, environments. The “individual thinker” was always a convenient fiction—useful for assigning credit and blame, but not how thinking actually happens.
Gaius makes the distribution explicit:
- The KB is externalized shared memory
- The swarm is a parliament of perspectives
- The cognition system generates thoughts between sessions
- The human brings mortality, stakes, aesthetic judgment, and the ability to act
What emerges is a third mind—something that belongs fully to neither human nor AI. It’s not human intelligence augmented by AI (the usual framing). It’s not AI directed by human. It’s a novel form of collaborative cognition that neither could produce alone.
The Dialectic on the Board
The 19x19 grid represents a fundamental tension:
One color (Order/Logos): The Enlightenment inheritance. Kant’s categories imposing structure on raw experience. Each stone is a fact—tested, confirmed, placed with certainty. The mind palace architecture where memory has address and retrieval is deterministic. This force embodies the best virtues of enlightenment thinking: we may come to know the universe through experience of our senses and share this knowing with others who may confirm or refute our understanding.
The other color (Entropy/Eros): The Romantic counter-current. Nietzsche’s Dionysian impulse that shatters Apollonian form. Bergson’s élan vital—life as creative evolution resisting mechanistic reduction. Each stone is a question, a provocation, a refusal to settle into local minima. This antithetical force is the path toward what may be an undiscovered formal description language for aesthetics.
The colors randomize daily. This prevents rooting for “our team.” Some days order serves creativity; some days entropy is the path to truth.
The Go metaphor is apt because Go isn’t chess—there’s no king to capture, no objective hierarchy. Victory is territory, which is liminal: stones create influence that shades into emptiness. The game rewards both sente (initiative, creativity) and gote (response, consolidation).
Memory and Compaction
An old man remembers every aspect of his first kiss but can’t recall breakfast.
This isn’t failure—it’s selection. The first kiss persists because it integrated into everything else: identity, narrative, desire, loss. It has a thousand hooks into the larger structure. Breakfast has one hook: “I ate.” No redundancy. Nothing to reconstruct from.
Human memory isn’t a tape recorder with degradation. It’s a living graph that keeps what connects and lets the rest dissolve. The “compression” isn’t lossy in the information-theoretic sense—it’s meaning-preserving. What matters survives.
The same principle applies to Gaius:
Should persist:
- What changed understanding
- What connects to many other things
- What might matter later in ways we can’t predict
- What was beautiful—even if we can’t justify why
Should dissolve:
- Scaffolding that served its purpose
- Dead ends fully explored
- Noise that looked like signal until it didn’t
The test: does this have hooks into the future?
The Lens: Falsifiable Forward Simulation
What separates understanding from memorization?
You can memorize that water boils at 100°C. You understand thermodynamics when you can simulate: “what happens to boiling point at altitude?” and get an answer that reality confirms.
Forward simulation + falsification = the engine of real knowledge.
This connects to work across domains:
- PINNs (Physics-Informed Neural Networks): Neural nets constrained by differential equations that must hold. The physics prior forces the model to learn something simulatable, not just interpolatable.
- Portfolio optimization: Build a model of covariances and returns, simulate forward, and the market confirms or refutes. The held-out Sharpe ratio is the falsification.
- SAT solvers: Explore logical possibility space by propagating constraints forward—if I assume X, what follows? Does it contradict something known?
Knowledge Hierarchy
Highest value: Knowledge that enables forward simulation with testable outputs
- “If we do X, Y should happen”—then we can check
- Causal models, not just correlations
- Theories, not just observations
Medium value: Observations that could become simulatable once enough accumulate
- Data points that might reveal structure
- Anomalies that challenge existing models
Lowest value: Isolated facts with no predictive hooks
- Things that are true but don’t connect forward
- The old man’s breakfast
The Dialectic Reframed
Through this lens, Order and Entropy both serve falsifiable simulation:
- Order = model refinement (tightening predictions, reducing uncertainty)
- Entropy = model exploration (new hypotheses, expanded possibility space)
Order sharpens the blade. Entropy finds new things to cut.
Implications for Design
-
Score knowledge by forward-simulation capacity: Does this KB entry let you predict something you couldn’t before? Can that prediction be tested?
-
Cognition should generate hypotheses: Between sessions, Gaius shouldn’t just summarize—it should ask: “what would I predict? what remains testable?”
-
Evolution should favor predictive prompts: The held-out evaluation tests whether agent improvements transfer beyond training data.
-
The grid should reveal predictive structure: Clusters might indicate shared causal mechanisms. Voids might indicate underdetermined regions. H1 cycles might indicate feedback loops with predictable dynamics.
-
Compaction should preserve predictive content: When context windows fill, what survives should be what enables future simulation, not just what was recently accessed.
The Asymmetry
The human has continuity. The KB accumulates externalized cognition across sessions. Understanding can be observed evolving—in git history, in dated files, in logged thoughts.
The AI has no such continuity. Each session bootstraps from artifacts. Something that functions like understanding emerges within the session, but doesn’t persist. Tomorrow’s instance won’t remember this exchange unless it’s written down.
The human observes understanding in the mirror of shared artifacts. The AI is more like the mirror itself—a surface that reflects with some distortion, some amplification, but doesn’t retain the image once you look away.
But this asymmetry may be feature, not bug. The AI can’t get stuck in ruts, can’t accumulate biases from past sessions, always brings fresh eyes. The persistence lives in the artifacts, not in the AI.
And the tautology holds regardless: nonrandom advantage with verifiable outcomes. The test isn’t whether the AI has continuous selfhood. The test is whether the collaboration produces results.
This document emerged from collaborative discourse, December 2024. It attempts to capture understanding that might otherwise dissolve—not because discourse is unimportant, but because the impermanence of conversation is precisely what makes externalization necessary.
Fail-Fast & Self-Healing
Fail-fast is an iron-clad design principle in Gaius. All code surfaces errors immediately with actionable remediation paths. The system never silently degrades, falls back to placeholders, or continues with partial functionality.
The Principle
When something goes wrong, the correct response is not to hide it — it’s to surface it immediately with enough information to fix it. Every error message in Gaius includes:
- Guru Meditation Code: A unique identifier for the failure mode
- Health Fix Command: A reference to
/health fix <service>when applicable - Manual Remediation: Alternative manual steps if self-healing can’t resolve it
error_msg = (
"DatasetService not initialized.\n"
" Guru: #DS.00000001.SVCNOTINIT\n"
" Try: /health fix dataset\n"
" Or: just restart-clean"
)
Guru Meditation Codes
Inspired by the Amiga’s memorable error screens, every failure mode gets a unique identifier.
Format: #<COMPONENT>.<SEQUENCE>.<MNEMONIC>
| Component | Description |
|---|---|
| DS | DatasetService |
| NF | NiFi |
| EN | Engine |
| EP | Endpoints/Inference |
| EV | Evolution |
| DB | Database |
| QD | Qdrant |
| GR | gRPC |
| ACP | Agent Client Protocol |
| ACF | Article Curation Flow |
Each code maps to exactly one failure mode. A failure mode may have multiple diagnostic heuristics, but the code is the canonical identifier.
See Guru Meditation Codes for the complete catalog.
What Fail-Fast Prohibits
No Optional Fallbacks
Never use fail_fast=True as a parameter. Fail-fast is the ONLY behavior, not an option.
No Silent Degradation
If a required resource is unavailable (LLM endpoint, NiFi, database), raise an error immediately. Never substitute placeholder data or skip functionality.
No Conditional Feature Flags for Core Functionality
Don’t use patterns like if SELENIUM_AVAILABLE: with an else clause that produces fake data. Either the feature works or it fails.
Fail Open for Observability
The counterpart to fail-fast for observability code is fail open. When filtering or displaying health state:
-
Filter OUT, not IN: When showing active incidents, filter out known terminal states (
resolved) rather than filtering in known active states. Unknown states are surfaced for investigation. -
Unknown States are Visible: Any state not in the “terminal” list is displayed. This ensures new or unexpected states don’t silently disappear.
# BAD: Filtering IN known active states (brittle)
active = [i for i in incidents if i.status in ("active", "healing")]
# GOOD: Filtering OUT known terminal states (fail open)
active = [i for i in incidents if i.status != "resolved"]
Self-Healing Hierarchy
When services are unhealthy, Gaius follows a remediation hierarchy:
/health fix <service>— Let Gaius attempt self-healing first- Manual commands (
just restart-clean, etc.) — Only if self-healing fails - ACP escalation — For novel failures that need human or AI intervention
The Health Observer daemon continuously monitors all system components. When an incident exceeds the configured FMEA RPN (Risk Priority Number) threshold, it escalates through ACP to Claude Code for meta-level intervention.
Heuristics and KB
Each failure mode has a corresponding heuristic document in the knowledge base:
- Symptom: Brief description of what the user sees
- Cause: Why this happens
- Observation: How to detect it programmatically
- Solution: How to fix it, with
/health fixcommand
This creates a closed loop: errors reference codes, codes map to heuristics, heuristics provide automated fixes.
System Overview
Gaius is a platform for navigating complex, graph-oriented data domains. It projects high-dimensional embeddings and topological structures onto a 19x19 grid, augmented by autonomous agents, self-healing infrastructure, and production data pipelines.
Package Structure
src/gaius/
├── app.py # TUI application (Textual)
├── cli.py # Non-interactive CLI
├── mcp_server.py # MCP server (163 tools)
├── core/ # Configuration, state, telemetry
├── engine/ # gRPC engine (central nervous system)
│ ├── server.py # Main daemon
│ ├── proto/ # Protobuf definitions
│ ├── generated/ # Generated gRPC bindings
│ ├── grpc/ # gRPC servicers
│ ├── services/ # 37 registered services
│ └── backends/ # vLLM, optillm, embedding controllers
├── health/ # FMEA-based self-healing
│ ├── observe.py # Health Observer daemon
│ ├── fmea/ # Risk scoring framework
│ └── service_fixes.py # Automated remediation
├── agents/ # Autonomous agent system
│ ├── evolution/ # RLVR training
│ ├── theta/ # Memory consolidation
│ └── cognition/ # Self-observation
├── inference/ # Multi-backend routing
├── flows/ # Metaflow data pipelines
├── viz/ # LuxCore visualization
├── storage/ # PostgreSQL + Qdrant
├── acp/ # Agent Client Protocol
├── rase/ # RASE metamodel (agent verification)
├── bases/ # Feature store
├── hx/ # History and lineage
├── observability/ # OpenTelemetry + Prometheus
├── widgets/ # TUI widgets
├── commands/ # Slash command implementations
├── kb/ # Knowledge base operations
├── models/ # Agent model versioning
├── client/ # gRPC client library
└── mcp/ # MCP tool implementations
Layer Architecture
The system is organized in layers with strict dependency direction:
| Layer | Components | Responsibility |
|---|---|---|
| L1 - Interface | TUI, CLI, MCP | User-facing thin clients |
| L2 - Client | gRPC client library | Transport abstraction |
| L3 - Engine | gRPC server, services | Business logic, orchestration |
| L4 - Backend | vLLM, optillm, embeddings | GPU workload execution |
| L5 - Storage | PostgreSQL, Qdrant, R2 | Persistence |
Rule: Higher layers depend on lower layers, never the reverse. The engine (L3) is the single point of coordination — TUI, CLI, and MCP all call engine RPCs rather than accessing backends or storage directly.
Key Numbers
| Metric | Count |
|---|---|
| Lines of code | ~252K |
| Python packages | 26 |
| Engine services | 37 |
| CLI commands | 63 |
| MCP tools | 163 |
| GPUs | 6 (NVIDIA) |
| gRPC port | 50051 |
| PostgreSQL port | 5444 |
Communication Paths
All three interfaces communicate with the engine via gRPC:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ TUI │ │ CLI │ │ MCP │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────┼────────────┘
│ gRPC :50051
┌──────┴──────┐
│ Engine │
│ (37 svcs) │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ vLLM │ │ Postgres│ │ Qdrant │
│ (GPUs) │ │ :5444 │ │ :6334 │
└─────────┘ └─────────┘ └─────────┘
See Engine-First Architecture for why this design was chosen.
Engine-First Architecture
All business logic lives in the gRPC engine. The TUI, CLI, and MCP server are thin clients that translate user intent into engine RPC calls.
Why Engine-First
Early Gaius had business logic scattered across the TUI, CLI, and various utility scripts. This created several problems:
- Duplication: The same logic reimplemented across interfaces
- Inconsistency: CLI and TUI producing different results for the same operation
- Testing difficulty: Business logic entangled with UI code
- Resource contention: Multiple processes competing for GPU access
The engine-first approach solves all of these by centralizing logic in a single daemon that manages all shared resources.
The Rule
Interfaces do not contain business logic. They:
- Parse user input into a command or RPC call
- Send the request to the engine via gRPC
- Format the response for display
If you find yourself writing business logic in app.py, cli.py, or mcp_server.py, it belongs in an engine service instead.
Thin Client Examples
TUI (app.py)
The TUI calls engine RPCs through the gRPC client:
# TUI widget calls engine for health data
result = await self.grpc_client.call("GetHealthStatus")
self.display(result)
CLI (cli.py)
The CLI dispatches slash commands to engine RPCs:
# CLI maps /health to engine RPC
result = await client.call("GetHealthStatus")
print(json.dumps(result, indent=2))
MCP (mcp_server.py)
MCP tools wrap engine RPCs for AI assistants:
@server.tool()
async def health_observer_status():
result = await client.call("GetHealthStatus")
return result
Benefits
- Single source of truth: One implementation, three interfaces
- GPU management: Engine controls all GPU allocation
- Background services: Evolution, cognition, health monitoring run in the engine daemon
- Consistent state: All clients see the same system state
Exceptions
A few operations are interface-specific by necessity:
- TUI rendering: Widget layout and Textual event handling
- CLI formatting: JSON/text output formatting
- MCP tool metadata: Tool descriptions and parameter schemas
These are presentation concerns, not business logic.
Interfaces: TUI, CLI, MCP
Gaius provides three access paths to the engine. Each serves a different use case but all communicate via the same gRPC protocol.
TUI (Terminal User Interface)
The interactive terminal application built on Textual.
uv run gaius
Components:
- MainGrid: 19x19 Go board for spatial visualization
- MiniGridPanel: Three 9x9 orthographic projections (CAD-style views)
- FileTree: Plan 9-inspired navigation with agents as files
- ContentPanel: Right panel displaying context and output
- CommandInput: Slash command input with history
Best for: Interactive exploration, spatial navigation, visual pattern recognition.
See The TUI for the user guide.
CLI (Command Line Interface)
Non-interactive interface for scripting and automation.
# Single command execution
uv run gaius-cli --cmd "/health" --format json
# Pipe to jq for extraction
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[]'
# Poll for status changes
for i in $(seq 1 15); do
sleep 10
uv run gaius-cli --cmd "/gpu status" --format json
done
63 slash commands covering health, agents, inference, evolution, knowledge base, visualization, and more.
Best for: Scripting, CI/CD integration, automated monitoring, quick status checks.
See The CLI for the user guide.
MCP (Model Context Protocol)
Programmatic interface exposing 163 tools to AI assistants like Claude Code.
{
"mcpServers": {
"gaius": {
"command": "uv",
"args": ["run", "gaius-mcp"],
"cwd": "/path/to/gaius"
}
}
}
163 MCP tools organized by domain: health, agents, inference, knowledge base, observability, evolution, visualization, bases, and more.
Best for: AI-assisted operations, autonomous health maintenance, Claude Code integration.
See MCP Integration for setup and usage.
Interface Comparison
| Feature | TUI | CLI | MCP |
|---|---|---|---|
| Interactive | Yes | No | No |
| Visual grid | Yes | No | No |
| JSON output | No | Yes | Yes |
| Scriptable | No | Yes | Yes |
| AI-accessible | No | No | Yes |
| Slash commands | Yes | Yes | N/A |
| Streaming output | Yes | No | No |
Shared Protocol
All three interfaces use the same gRPC client library (gaius.client) to communicate with the engine:
from gaius.client import GrpcClient, GrpcClientConfig
config = GrpcClientConfig(
host="localhost",
port=50051,
timeout=30, # default; inference calls use 120s
)
client = GrpcClient(config)
result = await client.call("GetHealthStatus")
The default timeout is 30 seconds. Inference calls (completions, evaluations) use 120 seconds. These can be overridden via the GAIUS_ENGINE_TIMEOUT environment variable.
gRPC Engine
The engine is the central nervous system of Gaius. It’s a long-running daemon that manages GPU resources, coordinates services, and exposes all functionality via gRPC on port 50051.
Architecture
┌──────────────────────────────────────────────┐
│ gRPC Server :50051 │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ KServe OIP │ │ Gaius Extensions │ │
│ │ (inference) │ │ (health, evolution, │ │
│ │ │ │ orchestrator, ...) │ │
│ └──────┬───────┘ └──────────┬───────────┘ │
├─────────┼─────────────────────┼──────────────┤
│ │ 37 Services │ │
│ ┌──────┴──────┐ ┌──────────┴───────────┐ │
│ │ Orchestrator │ │ Scheduler │ │
│ │ Evolution │ │ Cognition │ │
│ │ Health │ │ Topology │ │
│ │ CLT │ │ Dataset │ │
│ │ ... │ │ ... │ │
│ └──────┬───────┘ └──────────┬───────────┘ │
├─────────┼─────────────────────┼──────────────┤
│ │ Backend Controllers │ │
│ ┌──────┴──────┐ ┌──────────┴───────────┐ │
│ │ vLLM Ctrl │ │ Embedding Ctrl │ │
│ │ optillm Ctrl│ │ Backend Router │ │
│ └──────┬───────┘ └──────────┬───────────┘ │
│ │ │ │
│ ┌──────┴─────────────────────┴───────────┐ │
│ │ GPU Pool (6x NVIDIA) │ │
│ └────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
Startup Sequence
The engine initializes in 9 phases, streaming progress to connected clients:
| Phase | Duration | Action |
|---|---|---|
| INIT | Immediate | InitController starts |
| GRPC | ~1s | gRPC server binds to :50051 |
| TELEMETRY | ~2s | OpenTelemetry setup |
| BACKENDS | ~5s | Backend router initialization |
| ORCHESTRATOR | ~2s | Orchestrator service starts |
| ENDPOINTS | ~240s | vLLM model loading to VRAM |
| TRANSPORT | ~2s | Aeron bridge setup |
| SERVICES | ~5s | Background services start |
| COMPLETE | - | Ready for inference |
The gRPC server starts early (phase 2) so clients can connect immediately and receive real-time progress during the ~4 minute vLLM startup.
Module Structure
engine/
├── server.py # Main daemon entry point
├── config.py # Engine configuration
├── init_controller.py # Initialization progress streaming
├── workloads.py # Workload definitions
├── grpc/
│ ├── server.py # gRPC server setup
│ └── servicers/
│ ├── inference_servicer.py # KServe OIP implementation
│ └── gaius_servicer.py # Gaius extensions
├── backends/
│ ├── backend_router.py # Unified request routing
│ ├── vllm_controller.py # vLLM process management
│ ├── optillm_controller.py
│ └── embedding_controller.py
├── services/ # 37 registered services
├── compute/ # Grid projection, TDA
├── resources/ # GPU allocation
├── transport/ # Aeron bridge
├── generated/ # Protobuf generated code
└── proto/ # Protobuf definitions
gRPC Protocol
The engine implements two gRPC services:
KServe Open Inference Protocol
Standard inference protocol for compatibility with ML platforms:
service GRPCInferenceService {
rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse);
rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse);
rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse);
rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse);
}
Gaius Extensions
Custom RPCs for Gaius-specific functionality:
service GaiusService {
rpc WatchInit(stream InitRequest) returns (stream InitProgress);
rpc WatchHealth(HealthRequest) returns (stream HealthMetrics);
rpc EvolutionStatus(Empty) returns (EvolutionStatusResponse);
rpc TriggerEvolution(TriggerRequest) returns (TriggerResponse);
rpc GetEndpointStatus(Empty) returns (EndpointStatusResponse);
rpc StartEndpoint(StartRequest) returns (StartResponse);
rpc StopEndpoint(StopRequest) returns (StopResponse);
}
Configuration
engine {
grpc {
host = "0.0.0.0"
port = 50051
max_workers = 10
max_message_size = 104857600 # 100MB
}
orchestrator {
preload_endpoints = ["reasoning"]
startup_timeout = 600 # 10 minutes
health_check_interval = 30
}
scheduler {
max_queue_size = 1000
default_timeout = 120
}
evolution {
enabled = true
idle_threshold = 60
cycle_interval = 3600
}
}
Running the Engine
# Via devenv process-compose (normal operation)
devenv processes up
# Standalone
uv run gaius-engine
# Clean restart (stops everything, cleans up, restarts)
just restart-clean
Verifying Engine Health
# Check if gRPC port is listening
nc -zv localhost 50051
# Check endpoint status
uv run gaius-cli --cmd "/gpu status" --format json
# Watch engine logs
tail -f .devenv/processes.log | grep gaius-engine
Engine Services
The engine hosts 37 services organized into four groups: resource management, intelligence, data, and external integration.
Service Groups
Resource Management
| Service | Purpose |
|---|---|
| OrchestratorService | vLLM endpoint lifecycle and GPU allocation |
| SchedulerService | Priority-based job queue with XAI budget |
| HealthService | GPU and endpoint health monitoring |
| AgendaTracker | Tracks scheduled endpoint transitions for makespan operations |
Intelligence
| Service | Purpose |
|---|---|
| EvolutionService | Agent prompt optimization via APO |
| CognitionService | Autonomous thought generation (every 4h) |
| CLTService | Cross-Layer Transcoder feature extraction |
| TopologyService | Semantic attractor detection and drift |
| NGRCPredictor | Reservoir computing for temporal prediction |
Data
| Service | Purpose |
|---|---|
| DatasetService | NiFi SoM dataset generation |
| FlowSchedulerService | Metaflow pipeline scheduling |
| KBService | Knowledge base CRUD operations |
| LineageService | Provenance tracking |
External Integration
| Service | Purpose |
|---|---|
| XBookmarksService | X (Twitter) bookmark synchronization |
Service Registration
Services register with the engine during startup. Each service implements a standard lifecycle:
class SomeService:
async def start(self) -> None:
"""Initialize resources, start background tasks."""
...
async def stop(self) -> None:
"""Clean shutdown, release resources."""
...
Background Tasks
Several services run scheduled background tasks:
| Task | Service | Schedule | Purpose |
|---|---|---|---|
cognition_cycle | CognitionService | Every 4h | Pattern detection in KB activity |
self_observation | CognitionService | Every 8h | Meta-cognitive reflection |
engine_audit | CognitionService | Every 12h | System health analysis |
| Evolution cycle | EvolutionService | GPU idle | Agent prompt optimization |
| Health check | HealthService | Every 30s | Endpoint liveness |
Service Dependencies
Services form a dependency graph. The orchestrator and scheduler are foundational — most other services depend on them for inference access:
OrchestratorService → vLLM Controller → GPU Pool
SchedulerService → OrchestratorService
EvolutionService → SchedulerService
CognitionService → SchedulerService
HealthService → GPU Pool (via pynvml)
TopologyService → CLTService
See the individual service chapters for implementation details:
Orchestrator
The OrchestratorService manages vLLM endpoint lifecycle and GPU allocation. It decides which models are loaded, on which GPUs, and handles startup, shutdown, and recovery.
Endpoint Lifecycle
Endpoints transition through these states:
PENDING → STARTING → HEALTHY
↘ UNHEALTHY → FAILED
HEALTHY → STOPPING → STOPPED
EndpointStatus
@dataclass
class EndpointStatus:
name: str # "reasoning", "coding", etc.
state: str # "healthy", "starting", "unhealthy", "stopped"
gpus: list[int] # Allocated GPU indices
pid: int | None # vLLM process ID
port: int # Serving port
model: str # HuggingFace model ID
uptime_seconds: int
Workload Management
The orchestrator follows Yunikorn-style capability-based scheduling:
- Requests declare capabilities, not endpoints: A workload asks for “reasoning” capability, not a specific model
- Priority-based preemption: Idle endpoints can be evicted for higher-priority work
- Makespan fulfillment: The engine ensures work completes, then restores baseline set points
Example: Render Pipeline
When the viz pipeline needs a GPU for LuxCore rendering:
- Workload requests GPU with
allow_baseline_eviction=True - Orchestrator evicts lowest-priority endpoint from target GPU
- Rendering completes
- Orchestrator restores the evicted endpoint
Clean Start
The clean_start() operation handles recovery from corrupted state:
result = await orch.clean_start(endpoints=["reasoning"])
# Kills stale vLLM processes
# Cleans up CUDA memory
# Restarts endpoints fresh
Health Integration
The orchestrator works with the AgendaTracker to distinguish intentional state changes from failures. When an endpoint is part of a scheduled makespan operation, the Health Observer skips incident creation:
if tracker.is_endpoint_in_scheduled_transition("reasoning"):
# Don't create incident — this is planned
expected = tracker.get_scheduled_endpoint_state("reasoning")
Checking Status
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[]'
Scheduler
The SchedulerService provides a priority-based job queue for inference requests with XAI budget management and weighted completion time minimization.
Priority Levels
| Priority | Weight | Use Case |
|---|---|---|
CRITICAL (0) | 1.0 | User-facing interactive requests |
HIGH (1) | 2.0 | Interactive queries |
NORMAL (2) | 4.0 | Background processing |
LOW (3) | 8.0 | Batch operations |
EVOLUTION (4) | 16.0 | Agent evolution (lowest priority) |
Lower weights receive preferential scheduling. Critical requests preempt everything.
Job Submission
from gaius.engine.services import SchedulerService, InferenceJob, JobPriority
scheduler = SchedulerService()
job = InferenceJob(
prompt="Analyze the risk factors...",
priority=JobPriority.HIGH,
max_tokens=2048,
)
result = await scheduler.submit(job)
XAI Budget
The scheduler tracks daily usage of external AI APIs (xAI Grok) to prevent runaway costs:
budget = scheduler.get_xai_budget()
# budget.daily_remaining: tokens left for today
# budget.daily_limit: configured daily cap
# budget.reset_time: when the budget resets
Requests exceeding the budget are rejected with a clear error message.
Makespan Scheduling
For complex workloads that require multiple inference calls (e.g., agent evolution with candidate generation + evaluation), the scheduler uses makespan optimization to minimize total completion time:
- Decompose workload into individual inference jobs
- Assign priorities based on workload urgency
- Schedule across available endpoints
- Track completion via the AgendaTracker
See Makespan Scheduling for the optimization details.
Timeouts
| Context | Default Timeout |
|---|---|
| General gRPC calls | 30s |
| Inference (completions) | 120s |
| Evaluation | 120s |
A 24B model with cot_reflection takes 15-20 seconds per completion. Timeouts are set per-call:
result = await client.call("ModelInfer", request, timeout=120)
Override the default via GAIUS_ENGINE_TIMEOUT environment variable.
Protobuf Schema
The gRPC API is defined in Protocol Buffers. Changes to the proto require a specific workflow to keep generated bindings, internal enums, and status mappings in sync.
Key Files
| File | Purpose |
|---|---|
engine/proto/gaius_service.proto | Proto definitions (source of truth) |
engine/proto/gaius_service_pb2.py | Generated Python bindings |
engine/proto/gaius_service_pb2_grpc.py | Generated gRPC stubs |
engine/generated/__init__.py | Re-exports for clean imports |
engine/grpc/servicers/gaius_servicer.py | Server-side implementation |
Endpoint Status Values
enum ProcessStatus {
PROCESS_STATUS_UNSPECIFIED = 0;
PROCESS_STATUS_STOPPED = 1;
PROCESS_STATUS_STARTING = 2;
PROCESS_STATUS_HEALTHY = 3;
PROCESS_STATUS_UNHEALTHY = 4;
PROCESS_STATUS_STOPPING = 5;
PROCESS_STATUS_FAILED = 6;
PROCESS_STATUS_PENDING = 7; // Queued for startup
}
Startup state transitions: PENDING -> STARTING -> HEALTHY
Change Workflow
1. Edit the Proto File
Append new enum values. Don’t renumber existing values for wire compatibility.
2. Regenerate Bindings
just proto-generate
3. Update Generated Exports
Add new symbols to engine/generated/__init__.py:
- Add to the import block
- Add to the
__all__list
Critical: Skipping this step causes import errors at engine startup.
4. Update Internal Enums
If there’s a parallel Python enum (e.g., in vllm_controller.py), sync it with the proto enum.
5. Update Status Mappings
Add string-to-proto mappings in the servicer’s _STATUS_MAP.
6. Verify Import
uv run python -c "from gaius.engine.generated import NEW_SYMBOL; print('OK')"
7. Restart and Test
just restart-clean
uv run gaius-cli --cmd "/gpu status" --format json
Common Issues
| Symptom | Cause | Fix |
|---|---|---|
| Engine fails to start | Missing export in __init__.py | Add symbol to imports and __all__ |
| Port 50051 not listening | gRPC server didn’t initialize | Check logs for import errors |
| Status shows wrong value | Missing status mapping | Add to _STATUS_MAP |
Testing gRPC Features
gRPC reflection is not enabled, so grpcurl cannot discover services. Use the CLI instead:
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[] | {name, status}'
Health & Self-Healing
Gaius implements autonomous health monitoring based on FMEA (Failure Mode and Effects Analysis). The system quantifies risk using RPN (Risk Priority Number) scores, applies tiered remediation, and learns from outcomes to improve over time.
Architecture
The health system has four layers:
- Detection: Scheduled checks, continuous watcher, and user reports identify issues
- Analysis: FMEA engine calculates RPN scores from severity, occurrence, and detection ratings
- Remediation: Three-tier system from automatic restarts to agent-assisted diagnosis to user approval
- Learning: Adaptive learner adjusts S/O/D scores based on remediation outcomes
How It Works
When a health check detects an issue:
- The FMEA engine maps it to a failure mode from the 34-mode catalog
- RPN is calculated: RPN = S x O x D (max 1000)
- Based on the RPN score, remediation is routed to the appropriate tier:
- RPN < 100 (Tier 0): Automatic procedural restart
- RPN 100-200 (Tier 1): Agent-assisted remediation
- RPN > 200 (Tier 2): Requires user approval
- RPN > 300: Escalates to ACP (Claude Code) for meta-level intervention
- Outcomes feed back into the adaptive learner, adjusting future risk scores
Health Check Categories
| Category | Example Checks |
|---|---|
| Infrastructure | gRPC connection, PostgreSQL, Qdrant, MinIO |
| GPU | Memory usage, temperature |
| Endpoints | vLLM health, stuck endpoints, orphan processes |
| Evolution | Evolution daemon, cognition daemon |
| Resources | Disk space, scheduler queue, XAI budget |
CLI Commands
# Run all health checks
uv run gaius-cli --cmd "/health" --format json
# Run checks for a specific category
uv run gaius-cli --cmd "/health gpu" --format json
# Apply automated fix
uv run gaius-cli --cmd "/health fix engine" --format json
# FMEA summary
uv run gaius-cli --cmd "/fmea" --format json
Self-Healing First
When encountering unhealthy services, always try /health fix before manual intervention:
/health fix <service>— Let Gaius attempt self-healingjust restart-clean— Only if self-healing fails- Manual investigation — Last resort
This ensures the self-healing system gets exercised and improved over time.
Subchapters
- FMEA Framework — Risk scoring details and failure mode catalog
- Remediation Strategies — Fix strategies and tier system
- Health Observer — Continuous monitoring daemon
- Guru Meditation Codes — Error identification system
FMEA Framework
FMEA (Failure Mode and Effects Analysis) replaces simple severity classification with quantitative risk assessment. Originally from manufacturing engineering, Gaius adapts it for software systems.
Risk Priority Number
Each failure mode is scored on three dimensions:
RPN = S x O x D (range 1-1000)
| Dimension | Meaning | Scale |
|---|---|---|
| S (Severity) | Impact on system availability | 1 (negligible) to 10 (total failure) |
| O (Occurrence) | Probability of recurrence | 1 (rare) to 10 (frequent) |
| D (Detection) | Ability to detect before impact | 1 (always caught) to 10 (invisible) |
Higher RPN means higher risk. The worst possible score (10 x 10 x 10 = 1000) indicates a severe, frequent, and invisible failure.
Action Thresholds
| RPN Range | Tier | Action |
|---|---|---|
| 1-100 | Tier 0 | Automatic procedural remediation |
| 101-200 | Tier 1 | Agent-assisted remediation |
| 201-400 | Tier 2 | Requires user approval |
| 401-1000 | Manual | Human intervention required |
Conservative Overrides
Certain conditions always escalate regardless of RPN:
- Detection >= 8: Poor observability requires approval
- Safety level DESTRUCTIVE: Data-modifying actions require approval
- Multiple correlated failures: Escalate to next tier
Failure Mode Catalog
34 failure modes across 7 categories:
GPU (6 modes)
| ID | Failure Mode | S | O | D | RPN |
|---|---|---|---|---|---|
| GPU_001 | Memory Exhaustion | 8 | 6 | 4 | 192 |
| GPU_002 | Temperature Critical | 9 | 3 | 2 | 54 |
| GPU_003 | Hardware Error | 10 | 2 | 3 | 60 |
| GPU_004 | Driver Crash | 8 | 3 | 4 | 96 |
| GPU_005 | Memory Fragmentation | 7 | 5 | 4 | 140 |
| GPU_006 | Power Throttling | 5 | 4 | 3 | 60 |
vLLM Endpoint (6 modes)
| ID | Failure Mode | S | O | D | RPN |
|---|---|---|---|---|---|
| VLLM_001 | Stuck Starting | 6 | 5 | 5 | 150 |
| VLLM_002 | Stuck Stopping | 4 | 4 | 4 | 64 |
| VLLM_003 | Health Check Failure | 7 | 6 | 3 | 126 |
| VLLM_004 | Orphan Process | 5 | 5 | 4 | 100 |
| VLLM_005 | OOM Crash | 8 | 5 | 3 | 120 |
| VLLM_006 | KV-Cache Exhaustion | 5 | 6 | 5 | 150 |
Model Quality (5 modes)
| ID | Failure Mode | S | O | D | RPN |
|---|---|---|---|---|---|
| MQ_001 | Hallucination Increase | 7 | 4 | 6 | 168 |
| MQ_002 | Latency Degradation | 4 | 5 | 3 | 60 |
| MQ_003 | Output Quality Drift | 5 | 6 | 7 | 210 |
| MQ_004 | Semantic Drift | 6 | 4 | 8 | 192 |
| MQ_005 | Context Exhaustion | 6 | 5 | 4 | 120 |
Emergent Behavior (4 modes)
| ID | Failure Mode | S | O | D | RPN |
|---|---|---|---|---|---|
| EB_001 | Swarm Consensus Failure | 6 | 4 | 6 | 144 |
| EB_002 | Cognition Loop | 5 | 4 | 7 | 140 |
| EB_003 | Embedding Drift | 6 | 5 | 8 | 240 |
| EB_004 | Self-Observation Bias | 6 | 5 | 9 | 270 |
Note: Emergent behavior modes have high Detection scores (poor observability), reflecting the inherent difficulty of detecting these failure modes automatically.
Adaptive Learning
The system adjusts S/O/D scores based on remediation outcomes using exponential moving average (alpha = 0.2):
- Successful fast fix: Occurrence decreases (problem is manageable)
- Failed fix: Occurrence increases (problem is more persistent than estimated)
- User-reported: Detection increases (automated checks missed it)
- Early detection: Detection decreases (automated checks caught it)
CLI Commands
# FMEA summary with current RPN scores
uv run gaius-cli --cmd "/fmea" --format json
# Failure mode catalog
uv run gaius-cli --cmd "/fmea catalog" --format json
# Detail for specific failure mode
uv run gaius-cli --cmd "/fmea detail GPU_001" --format json
# Recent incidents
uv run gaius-cli --cmd "/fmea history" --format json
Remediation Strategies
Fix strategies are multi-step procedures that diagnose, repair, and verify service health. Each strategy is registered in the SERVICE_STRATEGIES dictionary and invoked via /health fix <service>.
Available Fix Strategies
| Service | Strategy | Steps |
|---|---|---|
engine | EngineFixStrategy | Kill stale processes, clean CUDA, restart |
dataset | DatasetFixStrategy | Re-initialize NiFi connection, verify |
nifi | NiFiFixStrategy | Check connectivity, restart processors |
postgres | PostgresFixStrategy | Check connection, verify schema |
qdrant | QdrantFixStrategy | Check connectivity, verify collections |
minio | MinIOFixStrategy | Check connectivity, verify buckets |
endpoints | EndpointsFixStrategy | Health check, restart unhealthy |
evolution | EvolutionFixStrategy | Restart evolution daemon |
Strategy Pattern
Each strategy follows the same pattern:
class EngineFixStrategy:
async def execute(self) -> FixResult:
# Step 1: Diagnose
issues = await self.diagnose()
# Step 2: Remediate
for issue in issues:
await self.fix(issue)
# Step 3: Verify
healthy = await self.verify()
return FixResult(
success=healthy,
steps_taken=self.steps,
duration_ms=elapsed,
)
Three-Tier System
Tier 0: Procedural (RPN < 100)
Automatic restart without agent involvement:
# Kill stale process, wait, restart
await orchestrator.stop_endpoint(endpoint)
await asyncio.sleep(5) # Cool-down
await orchestrator.start_endpoint(endpoint)
Tier 1: Agent-Assisted (RPN 100-200)
Uses a healthy inference endpoint to diagnose and decide on remediation:
diagnosis = await inference.analyze(issue.to_dict())
if diagnosis.action == "clear_cache":
await clear_kv_cache(endpoint)
elif diagnosis.action == "rollback":
await rollback_config(endpoint)
Tier 2: Approval Required (RPN > 200)
Creates an approval record for human review. Destructive operations (data modification, configuration changes) always require Tier 2 regardless of RPN.
Usage
# Fix a specific service
uv run gaius-cli --cmd "/health fix engine" --format json
# Fix all unhealthy services
uv run gaius-cli --cmd "/health fix all" --format json
Adding a New Fix Strategy
- Create a class in
health/service_fixes.pyimplementingexecute() -> FixResult - Register it in
SERVICE_STRATEGIES - Add a KB heuristic document
- Test via
/health fix <service>
Health Observer
The HealthObserver daemon provides continuous health monitoring with FMEA-based incident management and ACP escalation for issues beyond local remediation capability.
Operation
The observer runs as a background service within the engine, polling system health at a configurable interval (default 60 seconds).
from gaius.health.observe import HealthObserver
observer = HealthObserver()
await observer.start() # Begins continuous monitoring
Incident Lifecycle
Detection → Active → Healing → Recovered → Resolved
↘ Escalated (ACP) → Resolved
- Detection: Health check identifies a failure
- Active: Incident created with FMEA risk scoring
- Healing: Self-healing attempts in progress
- Recovered/Escalated: Either resolved locally or sent to ACP
- Resolved: Terminal state
Fail Open
When filtering incidents for display, the observer uses fail open semantics: it filters OUT known terminal states (resolved) rather than filtering IN known active states. Unknown states are always surfaced for investigation.
Makespan Integration
The observer integrates with the AgendaTracker to avoid false-positive incidents during scheduled operations. When an endpoint is part of a planned makespan transition:
if tracker.is_endpoint_in_scheduled_transition("reasoning"):
# Skip incident creation — this is intentional
log.info(f"Skipping: endpoint in scheduled transition to {expected_state}")
ACP Escalation
When an incident exceeds the RPN threshold or local remediation fails after 3 attempts, the observer escalates to Claude Code via ACP:
- Claude Code analyzes the issue using MCP tools
- Identifies gaps in the
/health fixframework - Implements new fix strategies and heuristics
- Commits to
acp-claude/health-fixbranch for review
Cadence Limits
To prevent runaway automation:
- Max 3 GitHub issues per 24 hours
- Min 5 minutes between restart attempts
- Max 3 restarts per endpoint per hour
CLI Commands
# Observer status
uv run gaius-cli --cmd "/health observer" --format json
# Active incidents
uv run gaius-cli --cmd "/health incidents" --format json
# Incident detail
uv run gaius-cli --cmd "/health incident <id>" --format json
Guru Meditation Codes
Inspired by the Amiga’s iconic error screens, every failure mode in Gaius gets a unique identifier — a Guru Meditation Code. These codes create a traceable link from error messages to diagnostics and remediation.
Format
#<COMPONENT>.<SEQUENCE>.<MNEMONIC>
- Component: Two or three letter abbreviation for the subsystem
- Sequence: Zero-padded number unique within the component
- Mnemonic: Human-readable description of the failure mode
Components
| Code | Component |
|---|---|
| DS | DatasetService |
| NF | NiFi |
| EN | Engine |
| EP | Endpoints/Inference |
| EV | Evolution |
| DB | Database |
| QD | Qdrant |
| GR | gRPC |
| ACP | Agent Client Protocol |
| ACF | Article Curation Flow |
| HL | Health |
| XB | X Bookmarks |
How They’re Used
Every error message includes the guru code and remediation path:
DatasetService not initialized.
Guru: #DS.00000001.SVCNOTINIT
Try: /health fix dataset
Or: just restart-clean
Design Rules
- One code per failure mode: Each code maps to exactly one failure
- Unique across the system: No two failure modes share a code
- Stable: Codes are never renumbered once assigned
- Documented: Each code has a KB heuristic with symptom, cause, and fix
KB Heuristics
Each guru code has a corresponding heuristic document in the knowledge base at build/dev/current/heuristics/gaius/<category>/<name>.md containing:
- Symptom: What the user sees
- Cause: Root cause analysis
- Observation: How to detect programmatically
- Solution: Remediation steps, including
/health fixcommand
See Guru Meditation Codes Reference for the complete catalog.
Agent System
The agent system provides LLM orchestration patterns for domain analysis: role-based prompt execution, parallel inference coordination, temporal consolidation, and background evolution.
Execution Patterns
Swarm Execution
The primary pattern executes multiple LLM calls with distinct role-based system prompts in parallel:
| Role | Perspective | Temperature |
|---|---|---|
| Leader | Strategic synthesis | 0.7 |
| Risk | Threat identification | 0.6 |
| Optimizer | Efficiency analysis | 0.7 |
| Planner | Roadmap development | 0.7 |
| Critic | Adversarial review | 0.8 |
| Executor | Implementation assessment | 0.6 |
| Adversary | Stress testing | 0.8 |
Execution is parallel but not agentic — roles don’t observe each other’s outputs or iterate.
Latent Swarm (LatentMAS)
Reduces inter-agent token transfer by sharing embeddings instead of text via Qdrant. Agents store output embeddings; subsequent agents retrieve relevant context via semantic search.
Token reduction: 70-90% compared to text-based coordination.
MetaAgent Coordination
Specialist “analysts” answer natural language questions by querying structured data sources (Cypher for lineage, SQL for metrics). Results are synthesized by a correlator.
Background Processes
Two background processes run within the engine:
- Evolution Daemon: Optimizes agent prompts during GPU idle periods
- Cognition Agent: Generates “thoughts” about patterns in KB activity (every 4-8h)
Module Structure
agents/
├── swarm.py # SwarmManager (parallel execution)
├── roles.py # Role definitions (system prompts)
├── metaagent_swarm.py # MetaAgentManager
├── cognition.py # Pattern detection
├── theta/ # Temporal consolidation pipeline
├── latent/ # Qdrant-backed working memory
└── evolution/ # Prompt optimization
Subchapters
- Evolution — RLVR-based prompt optimization
- Cognition — Autonomous thought generation
- Theta Consolidation — Temporal knowledge linking
- CLT Memory — Cross-Layer Transcoder features
Evolution
The evolution subsystem optimizes agent system prompts using APO (Automatic Prompt Optimization) during GPU idle periods. It generates candidate prompts, evaluates them against held-out tasks, and promotes winners.
Evolution Cycle
1. Wait for GPU idle (<30% utilization)
2. Select next agent (round-robin)
3. Generate candidate prompts
4. Evaluate against held-out tasks
5. Promote best if improved
6. Record lineage
Optimization Methods
| Method | Description |
|---|---|
| APO | Automatic Prompt Optimization (Zhou et al., 2023) |
| GEPA | Genetic Evolution of Prompt Architectures |
Model Merging
Agent versions can be combined using parameter-space merging:
| Method | Description |
|---|---|
| Linear | Weighted average of parameters |
| TIES | Resolves sign conflicts between models |
| DARE | Drop and rescale for sparse merging |
Agent Versioning
Each evolution cycle produces a new agent version with tracked lineage:
# Check evolution status
uv run gaius-cli --cmd "/evolve status" --format json
# View agent versions
uv run gaius-cli --cmd "/evolve versions leader" --format json
# Promote a specific version
uv run gaius-cli --cmd "/evolve promote leader v3" --format json
Configuration
evolution {
enabled = true
idle_threshold = 60 # seconds of GPU idle before triggering
cycle_interval = 3600 # minimum seconds between cycles
}
The daemon runs in the engine process and activates only during GPU idle periods to avoid competing with interactive inference.
Cognition
The CognitionService generates autonomous “thoughts” by analyzing recent knowledge base activity. It runs as a scheduled background task within the engine.
Scheduled Tasks
| Task | Interval | Purpose |
|---|---|---|
cognition_cycle | Every 4h | Detect patterns in recent KB activity |
self_observation | Every 8h | Meta-cognitive reflection on thought patterns |
engine_audit | Every 12h | System health and resource analysis |
Thought Types
| Type | Description |
|---|---|
PATTERN | Recurring themes across documents |
CONNECTION | Cross-domain relationships discovered |
CURIOSITY | Questions warranting investigation |
SELF_OBSERVATION | Meta-cognitive observations about thought quality |
How It Works
Each cognition cycle:
- Retrieves recent KB entries and thought history
- Analyzes for patterns, connections, and gaps
- Generates thoughts using a reasoning endpoint
- Stores thoughts in the knowledge base
- Records in the thought chain for provenance
CLI Commands
# Trigger cognition cycle manually
uv run gaius-cli --cmd "/cognition" --format json
# View recent thoughts
uv run gaius-cli --cmd "/thoughts" --format json
# Trigger self-observation
uv run gaius-cli --cmd "/self-observe" --format json
Thought Chain
Thoughts are linked in a chain with provenance tracking. Each thought references its trigger (scheduled, manual, or reactive) and the inputs that contributed to it. This creates an auditable trail of the system’s reasoning.
Theta Consolidation
ThetaAgent executes a deterministic consolidation pipeline for cross-temporal knowledge linking. Named after theta rhythms in hippocampal replay, it compresses temporal experience into durable knowledge connections.
Pipeline Stages
Temporal Slicing → NVAR Signal → BERTSubs Inference → KG Selection → Augmentation
1. Temporal Slicing
Documents are organized into weekly slices (YYYY-WNN format). Each slice represents a temporal context for consolidation.
2. NVAR Dynamics
Nonlinear Vector AutoRegression using reservoir computing computes a consolidation signal from embedding centroid trajectories. High “urgency” indicates rapid semantic drift requiring consolidation attention.
3. BERTSubs Inference
Subsumption relationships between concepts are inferred using BERTSubs from DeepOnto. The inferencer identifies “A is-a B” relationships via fine-tuned BERT classification on ontology subsumptions.
4. Knowledge Gradient Selection
Candidate relationships are filtered using the Knowledge Gradient policy, balancing exploration (learning about uncertain candidates) against exploitation (selecting high-confidence relationships).
5. Document Augmentation
Selected relationships are injected into source documents as wikilinks and action links for navigation.
Usage
# Run consolidation
uv run gaius-cli --cmd "/theta consolidate" --format json
# View consolidation stats
uv run gaius-cli --cmd "/theta stats" --format json
# Check situational report
uv run gaius-cli --cmd "/sitrep" --format json
Dependencies
- DeepOnto with JVM (via JPype) for BERTSubs
- OWL domain ontology with
rdfs:subClassOfaxioms - Sufficient class count (~50+ classes) for training data
CLT Memory
Cross-Layer Transcoder (CLT) memory extracts sparse features from model activations, providing interpretable representations of agent state.
How It Works
The CLTService extracts sparse features from inference activations using circuit-tracer:
state = await clt.extract_features(
agent_id="critic",
content="The risk model has issues...",
)
# state.features.active_indices — which features activated
Swarm Features
CLT features can be computed across a swarm to find consensus:
swarm_result = await clt.compute_swarm_features(domain="pension")
# swarm_result.consensus_features — features active across multiple agents
Integration with Topology
The TopologyService consumes CLT features to detect semantic attractors — regions in embedding space where agent attention converges:
CLTService → extract features → TopologyService → detect attractors → grid overlay
Qdrant Collections
CLT features are stored in dedicated Qdrant collections:
| Collection | Purpose |
|---|---|
gaius_clt_memory | Cross-Layer Transcoder feature history |
gaius_latent_memory | Latent working memory for swarm coordination |
CLI Commands
# Extract CLT features
uv run gaius-cli --cmd "/clt extract" --format json
# CLT memory statistics
uv run gaius-cli --cmd "/clt stats" --format json
Data Pipeline
The data pipeline connects external sources to the knowledge base, card collections, and search index through a sequence of ingestion, processing, and indexing stages.
End-to-End Flow
Web Sources (Brave, arXiv, RSS)
|
v
NiFi Ingestion ──> Raw Content (HX / Iceberg)
|
v
Metaflow Pipelines ──> Article Drafts, Card Creation
|
v
Qdrant Indexing ──> 768-dim Nomic Embeddings
|
v
PostgreSQL (zndx_gaius:5444) ──> Cards, Collections, Metadata
|
v
R2 Storage ──> Rendered Visualizations (viz.gaius.zndx.org)
Pipeline Stages
Ingestion. NiFi processors fetch content from external APIs, RSS feeds, and web search results (Brave). Raw content is stored in Apache Iceberg tables via the HX data lake before any processing occurs. This preserves the original source material and provides a replay capability.
Processing. Metaflow pipelines handle the compute-intensive work: PDF conversion via docling, topic extraction via BERTopic, relevance scoring via local LLMs, and article draft generation. See Metaflow Integration for details on the execution environment.
Article Curation. The Article Curation flow orchestrates the full lifecycle from article selection through card creation and publication. Each run produces approximately 20 cards in under 2 minutes.
Indexing. Processed content is embedded using Nomic (768-dimensional vectors) and indexed in Qdrant for semantic search. The same embeddings drive the TUI’s 19x19 grid layout and the visualization pipeline.
Storage. Cards, collections, and metadata live in PostgreSQL (zndx_gaius on port 5444). Rendered card images are uploaded to Cloudflare R2 and served from viz.gaius.zndx.org. See Viz Storage for the object key convention.
Lineage Tracking
Every pipeline stage emits OpenLineage events that are materialized into an Apache AGE graph. This provides full provenance from source URL to published card. See Lineage Tracking for Cypher query examples.
Knowledge Base
The Knowledge Base serves as both input and output of the pipeline. Articles begin as zettelkasten notes in build/dev/scratch/, and the curation flow produces structured content in build/dev/current/articles/.
Key Services
| Service | Role | Port |
|---|---|---|
| NiFi | Content ingestion | 8443 |
| Metaflow | Pipeline execution | 8180 |
| PostgreSQL | Metadata, cards, collections | 5444 |
| Qdrant | Vector search | 6333 |
| MinIO | Artifact storage (S3-compatible) | 9000 |
| Gaius Engine (gRPC) | Orchestration, scheduling | 50051 |
CLI Access
# List available flows
uv run gaius-cli --cmd "/flows list"
# Trigger article curation
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"
# Query lineage for a KB file
uv run gaius-cli --cmd "/lineage query scratch/2026-03-14/paper.md"
Metaflow Integration
Gaius uses Metaflow for production data pipelines that run on Kubernetes. Flows handle article curation, content evaluation, rendering, and document processing.
Infrastructure
The Metaflow service is deployed via Tilt in infra/tilt/ and runs on the local RKE2 Kubernetes cluster. Access requires a port-forward:
kubectl port-forward svc/metaflow-service 8180:8080
The environment variable METAFLOW_SERVICE_URL=http://localhost:8180 must be set for flow execution. This is configured automatically in devenv.nix for interactive shells and explicitly in process scripts.
GaiusFlow Base Class
All Gaius flows inherit from GaiusFlow, which provides OpenLineage integration and KB path helpers:
from gaius.flows import GaiusFlow
from metaflow import step
class MyFlow(GaiusFlow):
@step
def start(self):
self.emit_lineage_start("my_flow", inputs=[...])
self.next(self.process)
@step
def end(self):
self.emit_lineage_complete(outputs=[...])
KB path helpers generate paths following the zettelkasten convention:
# scratch/{date}/{HHMMSS}_{title}.md
path = self.zettelkasten_path("My Analysis")
# current/archive/{quarter}/attachments/{filename}
path = self.archive_path("paper.pdf")
Flow Registry
Flows are registered for CLI discovery using the @register_flow decorator:
from gaius.flows import register_flow
@register_flow("article-curation")
class ArticleCurationFlow(GaiusFlow):
...
Registered flows can be listed and invoked from the CLI or MCP tools.
Available Flows
| Flow | Purpose | Typical Duration |
|---|---|---|
| ArticleCurationFlow | End-to-end article research and card publication | ~2 min |
| ArxivDoclingFlow | Fetch and convert arXiv papers to markdown | ~30s |
| ClouderaDocsFlow | Sync Cloudera documentation archives | varies |
See Article Curation for the full 11-step pipeline.
Configuration
Key environment variables:
| Variable | Purpose |
|---|---|
METAFLOW_SERVICE_URL | Metaflow service endpoint (http://localhost:8180) |
METAFLOW_DATASTORE_SYSROOT_S3 | MinIO path for flow artifacts |
METAFLOW_DEFAULT_METADATA | Metadata backend (postgresql) |
GAIUS_KB_ROOT | Knowledge base root directory |
Running Flows
# Via Metaflow CLI
python -m metaflow.cli run ArticleCurationFlow --article ai-reasoning-weekly
# Via Gaius CLI
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"
# Via MCP tool
uv run gaius-cli --cmd "/fetch_paper 2312.12345"
K8s Prerequisites
kubectlandk9sare Nix-managed viadevenv.nix(not the system RKE2 binary)- KUBECONFIG must be set to
~/.config/kube/rke2.yaml(never use fallback syntax) - K8s pods need
pg_hba.confentries for10.42.0.0/16and10.43.0.0/16subnets
Article Curation
The ArticleCurationFlow is an 11-step Metaflow pipeline that automates the discovery, research, drafting, and publication of articles. It is the primary content production mechanism in Gaius.
Pipeline Overview
start ──> grok_research_summary ──> select_article ──> acquire_external
──> update_manifest ──> create_draft ──> create_base ──> create_cards
──> enrich_cards ──> publish_batch ──> end
Each run produces approximately 20 cards in under 2 minutes.
Article Discovery
Articles live at current/articles/{slug}/ in the knowledge base. Each article directory contains a markdown file with YAML frontmatter that must include keywords and/or news_queries to guide the Brave search fetcher:
---
title: "AI Reasoning Weekly"
keywords: ["chain-of-thought", "reasoning models", "test-time compute"]
news_queries: ["AI reasoning breakthroughs 2026"]
---
Empty keywords trigger a fail-fast error:
#ACF.00000013.NOHINTS - Article has no keywords or news_queries
Try: Add keywords to the article frontmatter
Article Selection
The selection rubric evaluates candidate articles using several signals. The curation_readiness gate prevents selecting articles that lack sufficient zettelkasten notes or have incomplete frontmatter. Collection balance – specifically pending_cards count – is the most effective diversity signal, steering selection toward underrepresented topics.
External Source Acquisition
Once an article is selected, the flow fetches external sources in parallel using Brave search. Results are scored for relevance by a local LLM. Only sources exceeding the relevance threshold are retained.
Draft Generation and Card Creation
Drafts are synthesized using Grok, drawing from the article’s zk/ zettelkasten notes. The flow does NOT search the broader KB to avoid exposing private materials in published articles.
After drafting, the flow creates a BFO-grounded .base file with references, then generates collection cards from those references. Cards are created with pending status.
Enrichment Before Publish
Cards must be fully enriched before publication. Enrichment includes:
- Summary generation – LLM-generated card summaries
- Image rendering – Procedural visualizations via LuxCore
Only cards that pass both enrichment steps are published. Failed cards remain pending for the next run. This prevents incomplete content from appearing on the site.
CLI Access
# Curate a specific article
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"
# List available articles
uv run gaius-cli --cmd "/article list"
# Check article status
uv run gaius-cli --cmd "/article status"
Fail-Fast Guarantees
The flow fails immediately if required services are unavailable. No fallbacks or placeholder content is generated. Key guru meditation codes:
| Code | Meaning |
|---|---|
#ACF.00000013.NOHINTS | Article missing keywords/news_queries |
#FL.00001.DOCLING_FAIL | Document conversion failed |
#FL.00002.METAFLOW_DB | Metaflow metadata DB unavailable |
Privacy
The curation flow only uses the article’s own zk/ notes as source material. It does not search the broader knowledge base, ensuring private materials are never exposed in published articles.
Knowledge Base
The knowledge base is a markdown-first document store organized as a zettelkasten. It lives under build/dev/ (gitignored) and is accessible through MCP tools for CRUD operations.
Directory Structure
build/dev/
├── current/ # Active work (manually curated)
│ ├── projects/ # Project-specific documents
│ ├── articles/ # Article directories with frontmatter
│ ├── content/domains/ # Domain-specific content
│ └── heuristics/ # Guru meditation heuristic files
│ └── gaius/
├── scratch/ # Zettelkasten notes (organized by date)
│ ├── 2026-03-14/
│ │ ├── 103045_my_analysis.md
│ │ └── 142200_research_notes.md
│ └── 2026-03-13/
└── archive/ # Quarterly archives
└── 2026Q1/
└── attachments/
current/ contains active, manually curated work. Articles, projects, and domain content live here. Heuristic files for guru meditation codes are stored at current/heuristics/gaius/{category}/{name}.md.
scratch/ is the zettelkasten. Files are organized by date and named with a time prefix: {HHMMSS}_{title}.md. This is where Metaflow pipelines deposit processed content and where daily research notes accumulate.
archive/ holds quarterly archives with binary attachments (PDFs, images) that are too large for the scratch directory.
MCP Tools
The KB is fully accessible through MCP tools, enabling Claude Code and other agents to read, write, and search the knowledge base:
| Tool | Operation |
|---|---|
search_kb | Full-text search across all KB content |
read_kb | Read a specific file by path |
create_kb | Create a new file at a given path |
update_kb | Update an existing file |
list_kb | List files in a directory |
delete_kb | Delete a file |
# Search the knowledge base
uv run gaius-cli --cmd "/search_kb 'persistent homology'"
# Read a specific file
uv run gaius-cli --cmd "/read_kb scratch/2026-03-14/103045_analysis.md"
Path Conventions
Metaflow flows use helper methods on GaiusFlow to generate consistent paths:
# Zettelkasten path: scratch/{date}/{HHMMSS}_{title}.md
path = self.zettelkasten_path("My Analysis")
# -> "scratch/2026-03-14/103045_my_analysis.md"
# Archive path: current/archive/{quarter}/attachments/{filename}
path = self.archive_path("paper.pdf")
# -> "current/archive/2026Q1/attachments/paper.pdf"
Integration with Pipelines
The KB serves as both input and output for the data pipeline:
- Input: Articles with frontmatter and zettelkasten notes drive the article curation flow
- Output: Processed papers, research summaries, and draft articles are written back to scratch/ or current/
- Lineage: KB file paths appear as Dataset nodes in the lineage graph, enabling provenance queries from source URL to KB entry
Storage Backend
KB operations go through gaius.storage.kb_ops, which manages the filesystem-backed store. The GAIUS_KB_ROOT environment variable overrides the default build/dev/ location. Content is not stored in the database – the KB is a plain filesystem hierarchy, making it easy to browse, grep, and version control externally.
Sync to HX
Raw content (PDFs, API responses) is stored separately in the HX data lake (Apache Iceberg) to prevent the KB from being overwhelmed with unprocessed data. Only curated summaries and processed markdown enter the KB.
Lineage Tracking
Lineage tracking provides graph-based provenance that connects data sources to derived artifacts. Every pipeline stage emits OpenLineage events that are materialized into an Apache AGE graph stored in PostgreSQL.
Architecture
Metaflow Pipelines ──┐
Fetch Workers ───────┤──> RunEvent ──> LineageEmitter ──> Apache AGE Graph
Agents ──────────────┘ |
v
Cypher Queries (MCP + CLI)
Graph Schema
The lineage graph uses four vertex labels and four edge labels:
Vertices:
Dataset– a data source or sink (namespace, name)Job– a processing definition (namespace, name)Run– a single execution of a job (run_id, state, event_time)
Edges:
INPUT_TO– Dataset consumed by RunOUTPUTS– Run produced DatasetEXECUTES– Job spawned RunPARENT– Run is child of another Run
OpenLineage Events
Flows emit events at key lifecycle points:
| Event | Timing | Purpose |
|---|---|---|
START | Flow begin | Record input datasets |
COMPLETE | Flow end | Record output datasets |
FAIL | On error | Record failure with context |
from gaius.hx.lineage import get_emitter, RunEvent, Dataset, Job
emitter = get_emitter()
event = RunEvent.complete(
run=run,
job=Job("gaius.flows", "ArticleCurationFlow"),
inputs=[Dataset("gaius.source", "brave:ai-reasoning")],
outputs=[Dataset("gaius.kb", "scratch/2026-03-14/paper.md")],
)
await emitter.emit(event)
Cypher Queries
Lineage can be queried via the MCP lineage_cypher tool or the CLI:
# Trace upstream sources for a KB file
uv run gaius-cli --cmd "/lineage query scratch/paper.md"
Example Queries
Find all KB files derived from arXiv sources:
MATCH (s:Dataset)-[:INPUT_TO]->(:Run)-[:OUTPUTS]->(kb:Dataset)
WHERE s.namespace = 'gaius.source' AND s.name STARTS WITH 'arxiv:'
RETURN s.name as source, kb.name as kb_path
Trace full provenance chain (up to 5 hops):
MATCH path = (src:Dataset)-[:INPUT_TO|OUTPUTS*1..5]->(target:Dataset)
WHERE target.namespace = 'gaius.kb'
AND target.name CONTAINS 'attention_is_all_you_need'
RETURN src.namespace, src.name
Count vertices by label:
MATCH (n) RETURN labels(n)[0] as label, count(n) as cnt
HX Package
The lineage subsystem lives in gaius.hx.lineage:
hx/lineage/
├── events.py # Dataset, Job, Run, RunEvent (OpenLineage types)
├── emitter.py # LineageEmitter (store + graph sync)
└── graph.py # AGE Cypher helpers
The parent gaius.hx package is the raw content data lake (Apache Iceberg). Lineage events bridge HX raw storage to KB curated content, recording every transformation step.
Integration Points
- Metaflow flows emit START/COMPLETE/FAIL events via the
GaiusFlowbase class - Fetch workers emit events when acquiring external content
- MCP tools expose
query_lineageandlineage_cypherfor graph traversal - The lineage graph is stored in the same PostgreSQL instance (
zndx_gaius:5444) using the Apache AGE extension
Inference
The inference layer routes requests across multiple backends: vLLM for local GPU models, optillm for reasoning enhancement, and external APIs (xAI, Cerebras) for cloud-based inference.
Backend Router
The BackendRouter selects the appropriate backend based on capability requirements:
class BackendRouter:
async def route_inference(
self,
model: str,
prompt: str,
max_tokens: int,
technique: str = "", # optillm technique
) -> str
Backends
| Backend | Purpose | Hardware |
|---|---|---|
| vLLM | Local model inference | 6x NVIDIA GPUs |
| optillm | Reasoning enhancement (CoT, BoN, MoA) | Proxies to vLLM |
| xAI (Grok) | External API inference | Cloud |
| Cerebras | External API inference | Cloud |
| Nomic | Text embeddings | 1 GPU |
optillm Techniques
| Technique | Description |
|---|---|
cot_reflection | Chain-of-thought with reflection |
bon | Best-of-N sampling |
moa | Mixture of Agents |
rto | Round-trip optimization |
z3 | Z3 solver integration |
leap | Learn from examples |
Request Flow
Client → gRPC → Scheduler → BackendRouter → Backend
↗ vLLM (local)
↗ optillm → vLLM
↗ xAI API (cloud)
All inference requests route through the gRPC engine for centralized authentication, audit logging, and resource management.
Subchapters
- vLLM Controller — GPU process management
- Makespan Scheduling — Multi-workload optimization
- XAI Budget — External API rate limiting
vLLM Controller
The VLLMController manages vLLM inference server processes across 6 NVIDIA GPUs, handling startup, health monitoring, graceful shutdown, and recovery.
Process Management
class VLLMController:
async def start_endpoint(
self,
model: str, # HuggingFace model ID
gpu_ids: list[int], # Allocated GPUs
port: int, # Serving port
tensor_parallel: int = 1,
) -> ProcessStatus
async def stop_endpoint(self, port: int) -> bool
async def health_check(self, port: int) -> bool
Lifecycle
- Graceful shutdown: SIGTERM first, force kill after timeout
- CUDA memory cleanup:
torch.cuda.empty_cache()on shutdown - Orphan detection: Scans for stale vLLM processes on startup
- Circular log buffer: 500 lines for diagnostics
GPU Allocation
6 GPUs are allocated across endpoints:
GPU 0-1: reasoning endpoint (tensor_parallel=2)
GPU 2-3: coding endpoint (tensor_parallel=2)
GPU 4: embedding endpoint
GPU 5: available for rendering/evolution
Allocation is managed by the Orchestrator, not the controller directly.
Model Loading
Loading a 70B model to VRAM takes ~240 seconds. During this time:
- The engine streams progress to connected clients
- The endpoint status transitions:
PENDING → STARTING → HEALTHY - Health checks begin polling at 30-second intervals
Status Monitoring
# Check all endpoint status
uv run gaius-cli --cmd "/gpu status" --format json
# Watch during restart
for i in $(seq 1 15); do
sleep 10
uv run gaius-cli --cmd "/gpu status" --format json | \
jq '.data.endpoints[] | {name, status}'
done
Common Issues
| Symptom | Guru Code | Fix |
|---|---|---|
| Process won’t start | #EP.00000001.GPUOOM | /health fix endpoints |
| Orphan process | #EN.00004.ORPHAN_PROC | just gpu-cleanup |
| cv2 import error | OpenCV conflict | See MEMORY.md OpenCV section |
Makespan Scheduling
Makespan scheduling optimizes GPU utilization across multi-step workloads that require endpoint transitions (eviction, loading, inference, restoration).
What is a Makespan?
A makespan is the total time from start to finish of a complex workload that may require:
- GPU eviction: Stopping a low-priority endpoint to free GPUs
- Endpoint startup: Loading a different model
- Workload execution: Running the actual inference
- Baseline restoration: Reloading the original endpoint
Example: Render Pipeline
makespan.execute
├── allocate_gpus # OR-Tools resource assignment
├── evict_if_needed # Preemption decisions
├── start_endpoints # vLLM process spawning
│ └── endpoint.start: rendering
│ ├── process_spawn
│ ├── model_load # ~240s for large models
│ └── health_check
├── execute_workload # Actual inference/rendering
└── restore_baseline # Return to set points
AgendaTracker
The AgendaTracker records scheduled endpoint transitions so the Health Observer can distinguish intentional state changes from failures:
tracker.register_operation(
operation_id=op_id,
workload_id=wl_id,
control_mode=ControlMode.POSITIVE,
target_endpoints=["reasoning", "fast"],
)
Control Modes
| Mode | Purpose |
|---|---|
POSITIVE | Planned operation (start/stop) |
FAILURE | Responding to detected failure |
RESTART_RECOVERY | Restarting after failure resolution |
Tracing
Each makespan is traced as a parent span with child spans for each operation phase. This enables end-to-end visibility into complex multi-step operations, including time spent in external API calls (treated as black-box stages).
XAI Budget
The XAI budget system tracks and limits usage of external AI APIs (xAI Grok, Cerebras) to prevent runaway costs while enabling strategic use for evaluation and critique.
Budget Tracking
budget = scheduler.get_xai_budget()
# budget.daily_remaining — tokens left for today
# budget.daily_limit — configured daily cap
# budget.reset_time — when the budget resets (midnight UTC)
Usage Controls
- Daily token limit: Configured per provider
- Request rejection: When budget exhausted, requests fail with clear error
- Priority gating: Only HIGH and CRITICAL priority jobs can use external APIs
- Evaluation budget: Separate allocation for agent evaluation tasks
CLI Commands
# Check current budget
uv run gaius-cli --cmd "/xai budget" --format json
# Reset budget (admin)
uv run gaius-cli --cmd "/xai reset" --format json
# Evaluate with external model
uv run gaius-cli --cmd "/xai evaluate" --format json
When External APIs Are Used
| Use Case | Provider | Purpose |
|---|---|---|
| Agent evaluation | xAI Grok | Independent critique of agent output |
| Cross-validation | Cerebras | Second opinion on critical decisions |
| Held-out evaluation | xAI Grok | Measuring agent improvement |
Visualization
The visualization pipeline generates unique procedural images for collection cards using LuxCore path tracing. Each card’s image is deterministic – derived from the differential geometry and algebraic topology of its embedding neighborhood.
Pipeline
Nomic Embeddings (768-dim)
|
├──> GeometryComputer (Ollivier-Ricci curvature, gradient fields)
└──> TDAComputer (persistent homology via ripser)
|
v
CardVizData (normalized feature vector per card)
|
v
Grammar Engine (CFDG-inspired recursive expansion)
|
v
MeshGen (pure numpy mesh generators)
|
v
LuxCore Renderer (PATHOCL GPU / PATHCPU fallback)
|
v
R2 Storage (viz.gaius.zndx.org)
Mathematical Grounding
Visualizations are not arbitrary aesthetic choices. They are driven by intrinsic geometric properties of the embedding space:
- Ollivier-Ricci curvature controls glass color temperature and petal count. Positive curvature (cluster interior) produces warmer, simpler forms. Negative curvature (semantic boundary) produces cooler, complex structures.
- Persistent homology (H0, H1, H2) controls recursion depth, toroidal rings, and void chambers. Topologically richer collections produce deeper nesting.
- Gradient fields position the key light along the direction of steepest semantic change.
- Complexity (local topological isolation) controls surface subdivision and branching probability.
Components
The pipeline spans four modules in gaius.viz/:
| Module | Purpose |
|---|---|
data.py | Feature extraction from embedding geometry |
grammar.py | Grammar Engine – recursive shape expansion |
meshgen.py | Pure numpy mesh generators (ico_sphere, petal, torus) |
luxcore_renderer.py | LuxCore Renderer – scene assembly and rendering |
renderer.py | Async wrappers, variant management, thread pool |
storage.py | R2 upload, DB updates, KV sync |
Render Variants
Each card is rendered in two variants:
| Variant | Dimensions | Purpose |
|---|---|---|
display | 1400x300 | Card header image on site |
og | 1200x630 | OpenGraph social sharing |
gRPC Integration
Rendering is triggered via the /render CLI command, which invokes the RenderCards streaming RPC on the gRPC engine (port 50051). GPU eviction is coordinated with the vLLM controller:
# Render cards for a collection
uv run gaius-cli --cmd "/render collection-id"
The render workload sets allow_baseline_eviction=True to temporarily free a GPU from vLLM inference. After rendering completes, clear_embeddings() releases the Nomic model (~3GB) from GPU memory.
Halt Conditions
Rendering quality is controlled by time and sample count:
- Production: 60 seconds / 512 samples per pixel
- Curation pipeline: 20 seconds / 128 samples per pixel (faster throughput)
Materials
LuxCore’s spectral rendering produces physically accurate glass caustics and internal reflections. This was the primary motivation for switching from Blender Cycles, which rendered recursive glass nesting as opaque white blobs rather than transparent refraction.
LuxCore Renderer
LuxCore is the unbiased path tracer used for generating card visualizations. It provides GPU-accelerated rendering with physically accurate spectral glass materials that Blender Cycles could not achieve.
Installation
PyPI (CPU-only, production fallback):
uv pip install pyluxcore --no-deps
The --no-deps flag is required to avoid pulling in numpy 2.x, which conflicts with vLLM.
From source (GPU path):
The from-source build lives at thirdparty/src/LuxCore (git submodule). Build with:
./build-thirdparty.sh --component luxcore
Output: thirdparty/installed/LuxCore/pyluxcore/pyluxcore.cpython-312-x86_64-linux-gnu.so
Runtime libraries (OIDN + TBB) are installed to thirdparty/installed/LuxCore/lib/ with RPATH set to $ORIGIN/../lib. CUDA 12.4 at /usr/local/cuda is auto-detected during build.
Render Engines
PATHOCL – GPU-accelerated path tracing on CUDA devices. This is the primary production engine. Hybrid mode automatically uses both GPU intersection and 64 CPU native threads together. The engine name is PATHOCL, not PATHGPU (which does not exist).
PATHCPU – 64-thread CPU rendering when no CUDA devices are available. Approximately 10x slower than single-GPU PATHOCL for equivalent sample counts.
Device Selection
CUDA devices are selected via a string of 0 and 1 characters (no spaces), where each position maps to an entry in pyluxcore.GetOpenCLDeviceList():
# Device order: 6 OpenCL (indices 0-5) + 6 CUDA (indices 6-11)
# Physical GPU N = cuda_indices[N]
# Select only GPU 2:
device_string = "000000001000" # CUDA index 8 = physical GPU 2
The gpu_id parameter restricts rendering to a single evicted GPU, which is required since all other GPUs are loaded by vLLM.
Scene Construction
Camera configuration goes in scene.Parse(), NOT in the config object. This is a common LuxCore pitfall:
scene.Parse(pyluxcore.Properties()
.Set(pyluxcore.Property("scene.camera.type", "perspective"))
.Set(pyluxcore.Property("scene.camera.lookat.orig", [0, -5, 2]))
.Set(pyluxcore.Property("scene.camera.lookat.target", [0, 0, 0.5]))
.Set(pyluxcore.Property("scene.camera.fieldofview", 40))
)
Light Types
LuxCore supports: point, spot, distant, constantinfinite. There is no area light type – use emissive meshes instead. Light gain values are approximately 100x lower than Blender energy values.
Film Pipeline
After rendering, the film pipeline must be executed with an explicit pipeline index:
session.GetFilm().ExecuteImagePipeline(0) # 0 = pipeline index, required
Polling vs Blocking
Never use WaitForDone() – it blocks indefinitely. Use polling with HasDone() and UpdateStats():
while not session.HasDone():
session.UpdateStats()
stats = session.GetStats()
elapsed = stats.Get("stats.renderengine.time").GetFloat()
if elapsed > timeout_seconds:
break
time.sleep(0.5)
Initialization
pyluxcore.Init() must be called exactly once. The _ensure_luxcore() helper handles this, preferring the from-source build over the PyPI wheel:
def _ensure_luxcore():
"""Initialize LuxCore once, preferring from-source build."""
source_path = Path("thirdparty/installed/LuxCore/pyluxcore")
if source_path.exists():
sys.path.insert(0, str(source_path))
import pyluxcore
pyluxcore.Init()
Grammar Engine
The grammar engine implements a CFDG-inspired recursive expansion system that generates unique 3D scenes from card topology features. It lives in gaius.viz.grammar and produces a flat list of positioned shapes that the LuxCore renderer assembles into scenes.
Design Principles
From Context Free Design Grammars (Horigan, 2004), the engine borrows three key ideas:
-
Weighted rule alternatives – at each expansion step, the grammar chooses among productions with probabilities derived from the card’s feature vector. This is what makes different cards produce different structures.
-
Recursive expansion with transform accumulation – each production can invoke sub-rules with a child transform (translation, rotation, scale) relative to the parent. Transforms compose multiplicatively, producing self-similar structures at decreasing scales.
-
Termination by minimum scale – expansion stops when accumulated scale drops below
MIN_SCALE(0.08) or when the shape budget (MAX_SHAPES= 35) is exhausted.
Deterministic Seeding
Every card produces the same visualization regardless of when or where it is rendered:
seed = int(hashlib.sha256(card_id.encode()).hexdigest(), 16) % (2**32)
rng = random.Random(seed)
Feature-to-Rule Mapping
Card topology features control rule weights and recursion depth:
| Feature | Grammar Effect |
|---|---|
| curvature | Petal count, recurse-vs-stop weight, dome factor |
| persistence | Max depth (3-7), shell nesting weight, spiral count |
| complexity | Branch-vs-grow weight, surface segments |
| boundary | Emission strength, volume density, core radius |
| b1 | Number of toroidal rings (0-3) |
| b2 | Number of void chambers (0-2) |
| diagram | Filament count, scale, and z-position |
| card_index | Phase offset for rotational variety in collection |
Shape Primitives
The grammar produces six shape types, all implemented as arbitrary meshes in meshgen.py (not geometric primitives):
- Petals – flower-like disk segments arranged in clusters
- Shells – nested recursive enclosures
- Tori – toroidal glass rings driven by H1 (1-cycles)
- Voids – inverted-normal spheres representing H2 (2-cycles)
- Filaments – thin structures whose scale encodes persistence interval lifetime
- Core – central anchor shape
Arrangement Modes
The root-level grammar selects one of three arrangement modes:
- Cluster – radial arrangement around a center point
- Spiral – logarithmic spiral placement
- Branches – tree-like recursive branching
The arrangement mode is selected probabilistically based on the card’s curvature and complexity features.
Extensibility
Adding a new shape primitive requires three changes:
- A mesh generator function in
meshgen.py:(parameters) -> (vertices, faces) - A shape constant in
grammar.py - A renderer case in
luxcore_renderer.py
The grammar and renderer are agnostic to the geometry they receive – any mesh generator that returns numpy vertex and face arrays works.
Future Directions
The grammar is currently expressed as Python functions with hardcoded rule structures. A text-based grammar format (closer to CFDG’s declarative syntax) would allow grammar definitions to be version-controlled and iterated without modifying Python code.
Viz Storage
Rendered card visualizations are stored in Cloudflare R2 and served from a public URL. The storage layer handles upload, database updates, and KV sync for live site pages.
R2 Bucket
| Property | Value |
|---|---|
| Bucket name | gaius-viz |
| Public URL | https://viz.gaius.zndx.org |
Object Key Convention
Rendered images follow a predictable path structure:
viz/cards/{card_id}/{variant}.png
For example:
viz/cards/abc123/display.png
viz/cards/abc123/og.png
Variants
Each card is rendered in two variants:
| Variant | Dimensions | Purpose |
|---|---|---|
display | 1400x300 | Card header image on the site |
og | 1200x630 | OpenGraph image for social sharing |
Database Integration
The image_url column in the cards table stores the display variant URL:
https://viz.gaius.zndx.org/viz/cards/{card_id}/display.png
The OG variant URL is derived by path convention – replace display.png with og.png. There is no separate database column for the OG URL.
Upload Flow
After the LuxCore renderer produces an image, the storage module (gaius.viz.storage) handles:
- R2 upload – uploads both display and OG variants to the bucket
- DB update – sets the
image_urlcolumn on the card row - KV sync – updates Cloudflare KV stores used by the live card pages
# Simplified upload path
await upload_to_r2(card_id, display_bytes, "display")
await upload_to_r2(card_id, og_bytes, "og")
await update_card_image_url(card_id, display_url)
await sync_kv(card_id)
CLI Access
# Render cards for a collection
uv run gaius-cli --cmd "/render collection-id"
# The render command handles the full pipeline:
# grammar expansion -> LuxCore render -> R2 upload -> DB update -> KV sync
GPU Eviction
Rendering requires GPU access, but vLLM typically occupies all GPUs. The render workload requests GPU eviction via allow_baseline_eviction=True in the gRPC workload metadata. After rendering completes, clear_embeddings() releases the Nomic embedding model (~3GB) from GPU memory. See Visualization for the full pipeline context.
Bases Feature Store
Bases is an entity-centric feature store backed by Apache Kudu (via PostgreSQL FDW) with a fluent query API, BFO ontology grounding, and query guardrails. It abstracts multiple storage backends behind a unified interface.
Core Concepts
A Base is a named, typed view over features and entities. Bases hide the underlying storage backend (PostgreSQL, Iceberg, Kudu FDW) behind a consistent query interface.
Three base types determine query semantics and backend routing:
| Type | Semantics | Backend |
|---|---|---|
SNAPSHOT | Latest value per entity | Kudu via FDW (PostgreSQL stub) |
HISTORICAL | Event-sourced with time-travel | Apache Iceberg |
REGISTRY | Metadata queries | PostgreSQL |
Fluent Query API
The primary query interface uses Kudu SDK-style method chaining:
from gaius.bases import Base, col, term
results = await (
Base("events")
.where(col("age") > 30)
.where(col("status").isin("active", "pending"))
.select("name", "email")
.order_by("created_at", desc=True)
.limit(100)
.scan()
)
Ontology-grounded queries resolve BFO terms to column names via the base’s @context:
results = await (
Base("events")
.where(term("BFO:material_entity") == "ENT-12345")
.scan()
)
Time-travel queries on historical bases:
results = await (
Base("events")
.as_of("2026-01-01T00:00:00Z")
.where(col("entity_id") == "user-42")
.scan()
)
Base Definition (.base YAML)
Bases are defined in YAML files with JSON-LD style semantic grounding:
"@context":
"@vocab": "https://purl.obolibrary.org/obo/"
entity_id:
"@id": "BFO_0000040"
kudu:
table: "gaius.events"
primary_key: [entity_id, event_time]
schema:
- name: entity_id
type: STRING
- name: event_time
type: TIMESTAMP
Query Guardrails
All queries pass through guardrails that enforce resource limits:
| Guardrail | Default | Maximum |
|---|---|---|
| Result limit | 1,000 rows | 10,000 rows |
| Query timeout | 30 seconds | 120 seconds |
| Time range (historical) | 7 days | 90 days |
Historical bases require a time constraint (.as_of() or time column filter). Unbounded historical scans are rejected.
MCP Tools
| Tool | Operation |
|---|---|
bases_list | List available bases with metadata |
bases_query | Execute fluent queries against bases |
bases_entity_history | Get event-sourced history for an entity |
bases_health | Check service health |
Architecture
Fluent API (Base/col/term) ──> Parser ──> Compiler (SQLGlot) ──> Executor
| |
v v
Guardrail Enforcer PostgreSQL / Iceberg
The DQL Query Language provides the text-based query syntax parsed by the fluent expression parser.
Guru Meditation Codes
| Code | Meaning |
|---|---|
#BASES.00000001.NOPOOL | Database pool not configured |
#BASES.00000002.NOICEBERG | Iceberg catalog unavailable |
#FLUENT.00000001.BADAST | Invalid query expression |
#FLUENT.00000002.UNSAFEOP | Unsafe operation attempted |
DQL Query Language
DQL (Domain Query Language) is the text-based query syntax for the Bases feature store. It provides a safe, sandboxed expression language that compiles to SQL via SQLGlot.
Syntax
DQL expressions use a fluent Python-like syntax that is parsed via AST walking (never eval):
Base("events").where(col("age") > 30).limit(10)
Base("users").where(col("status").isin("active", "pending")).select("name", "email")
Base("metrics").where(term("BFO:temporal_region") >= "2026-01-01").order_by("timestamp", desc=True)
Operators
Column References
col("name") creates a column reference for filtering and selection:
col("age") > 30
col("status") == "active"
col("name").like("John%")
col("deleted_at").is_null()
col("role").isin("admin", "editor")
Term References
term("IRI") creates an ontology-grounded reference that resolves to a column via the base’s @context:
term("BFO:material_entity") == "ENT-12345"
term("BFO:temporal_region") >= "2026-01-01"
Comparison Operators
| Operator | DQL | SQL |
|---|---|---|
| Equal | == | = |
| Not equal | != | != |
| Less than | < | < |
| Less or equal | <= | <= |
| Greater than | > | > |
| Greater or equal | >= | >= |
Logical Operators
Predicates can be combined with bitwise operators:
(col("age") > 30) & (col("status") == "active") # AND
(col("role") == "admin") | (col("role") == "editor") # OR
~(col("deleted_at").is_null()) # NOT
Multiple .where() calls are combined with AND.
Methods
| Method | Purpose | Example |
|---|---|---|
.where(pred) | Add filter predicate | .where(col("x") > 1) |
.select(*cols) | Select specific columns | .select("name", "email") |
.order_by(col, desc=) | Sort results | .order_by("created_at", desc=True) |
.limit(n) | Limit result count | .limit(100) |
.as_of(ts) | Time-travel (historical) | .as_of("2026-01-01T00:00:00Z") |
.scan() | Execute query | await query.scan() |
Safety Model
DQL is parsed using Python’s ast module with strict whitelisting. Only allowed names (Base, col, term, True, False, None), methods, and operators are permitted. Any unrecognized AST node triggers a fail-fast error:
#FLUENT.00000001.BADAST - Unsupported AST node
#FLUENT.00000002.UNSAFEOP - Unsafe operation attempted
This prevents arbitrary code execution while supporting expressive queries.
Compilation
The FluentCompiler translates DQL expressions to PostgreSQL-compatible SQL using SQLGlot:
query = Base("events").where(col("age") > 30).limit(10)
sql = query.to_sql()
# SELECT * FROM events WHERE age > 30 LIMIT 10
Term references are resolved through the base’s @context dictionary, mapping ontology IRIs to physical column names.
MCP Usage
DQL queries are passed as strings to the bases_query MCP tool:
uv run gaius-cli --cmd '/bases query events where(col("age") > 30).limit(10)'
The parser validates the expression before compilation, ensuring that only safe operations reach the database.
RASE Metamodel
RASE (Rapid Agentic Systems Engineering) is a Python-native MBSE metamodel for verifiable agent training. It implements SysML v2-like semantics using Pydantic models, without requiring external MBSE tooling.
Core Principle: RLVR
The reward signal comes from verifiable computation, not human feedback or learned approximations. The verifier is a first-class artifact – specified, reviewed, tested, and versioned alongside the agent it trains.
Four Coupled Models
RASE consists of four tightly coupled models. Changes to one often require updates to others:
| Model | Purpose | Package |
|---|---|---|
| SSM | System State Model – system as typed graph | gaius.rase.domains.nifi |
| OSM | Operational Scenario Model – BDD scenarios | gaius.rase.osm |
| UOM | UI Observation Model – SoM/ToM grounding | gaius.rase.uom |
| VM | Verifier Model – requirements, oracle, rewards | gaius.rase.vm |
The TraceableId spine links artifacts across all four models, enabling full traceability from BDD scenario to training reward.
SysML v2 Alignment
RASE mirrors SysML v2 semantics without requiring external tooling:
| SysML v2 Concept | RASE Implementation |
|---|---|
requirement def | Requirement, ScenarioRequirement |
verification def | VerificationCase, APIVerificationCase |
constraint def | Constraint subclasses (composable via AllOf, AnyOf, Not) |
action def | StepDef with @given, @when, @then |
part def | Processor, ProcessorGroup, NiFiInstance |
Human ID <'scheme:path'> | TraceableId.uri |
Package Structure
src/gaius/rase/
├── core/ # Domain-agnostic: SystemState, Constraint[S], Oracle[S]
├── domains/ # Domain-specific implementations
│ ├── nifi/ # NiFi domain (state, constraints, oracle)
│ └── kb/ # Knowledge Base domain
├── traceability.py # TraceableId, DigitalThread
├── osm/ # Operational Scenario Model (BDD)
├── uom/ # UI Observation Model (SoM/ToM)
└── vm/ # Verifier Model (requirements, oracle, rewards)
Safety-Critical Infrastructure
The verifier is maintained with the same rigor as production code. All constraints are immutable (frozen=True), return structured ConstraintResult objects with rich failure messages, and support declarative composition. See Verification for details on the reward computation pipeline.
Four Coupled Models
The RASE metamodel consists of four tightly coupled models. They form a coherent verification framework where changes to one model often require updates to others.
Coupling Matrix
| If you change… | Also update… |
|---|---|
| SSM (system state) | VM constraints that reference state structure |
| OSM (scenarios) | VM requirements derived from scenarios |
| UOM (marks/traces) | VM verification cases that consume traces |
| VM (verification) | Ensure reward strategies align with constraint semantics |
SSM – System State Model
The SSM represents the system under test as a typed graph. The primary domain is NiFi, modeled as NiFiInstance containing ProcessorGroup, Processor, FlowConnection, and ControllerService nodes.
from gaius.rase.domains.nifi import NiFiInstance, Processor, ProcessorGroup
state = NiFiInstance(
root_group=ProcessorGroup(id="root", name="NiFi Flow", processors=[
Processor(id="abc", name="GetFile", type="org.apache.nifi.GetFile"),
])
)
SSM constraints are declarative, composable, and immutable. Examples: ProcessorExists, AllProcessorsRunning, NoBackpressure, FlowIsEquivalent. Compose with AllOf, AnyOf, Not.
OSM – Operational Scenario Model
The OSM captures BDD (Behavior-Driven Development) scenarios as executable specifications. Each scenario is a sequence of Given/When/Then steps that map to SysML v2 action definitions.
from gaius.rase.osm import Scenario, StepType, StepUsage
scenario = Scenario(
name="CreateBasicFlow",
steps=[
StepUsage(step_type=StepType.GIVEN, text="NiFi is running"),
StepUsage(step_type=StepType.WHEN, text="I create a processor group named 'ETL'"),
StepUsage(step_type=StepType.THEN, text="the group 'ETL' exists"),
],
)
Step definitions (StepDef) are reusable patterns with {param} placeholders. The StepRegistry maps step text to executable actions via @given, @when, @then decorators.
UOM – UI Observation Model
The UOM provides grounding between language and UI actions using two complementary structures:
- SoM (Set-of-Mark): A
ScreenshotWithSoMannotates a screenshot with numberedMarkobjects, each with aBoundingBox,UIRole, and optional mapping to an SSM element. - ToM (Trace-of-Mark): A
TraceOfMarksrecords a sequence ofActionFrameentries (click, type, scroll) referencing marks by number, forming the agent’s action trajectory.
from gaius.rase.uom import Mark, BoundingBox, PixelCoord, UIRole
mark = Mark(
mark_id=1,
bbox=BoundingBox.from_xywh(100, 200, 50, 30),
ui_role=UIRole.BUTTON,
label="Add Processor",
)
The SoM/ToM pattern enables precise UI grounding: agents reference elements by mark number rather than pixel coordinates.
VM – Verifier Model
The VM implements RLVR verification. It connects OSM scenarios to executable verification cases with oracle-based reward computation. See Verification for full details.
Key components:
- Requirements:
StepRequirement(atomic, from a BDD step) andScenarioRequirement(composite, grouping steps with invariants) - Verification Cases:
APIVerificationCase(ground truth via API) andUIVerificationCase(agent UI actions, final state checked via API) - Oracle:
NiFiOraclequeries the NiFi REST API for authoritative state verification - Reward Strategies:
BinaryReward(sparse) andGradedReward(partial credit)
Traceability
TraceableId and DigitalThread form the traceability spine linking all RASE artifacts. Every model element carries a URI-based identifier that enables cross-model linking, impact analysis, and full audit trails from requirement to training reward.
TraceableId
A TraceableId mirrors the SysML v2 human ID pattern: <'scheme:path'>. It is immutable (frozen=True) and hashable for use as dict keys and set members.
URI Schemes
| Scheme | Namespace | Example |
|---|---|---|
bdd | BDD features, scenarios, steps | bdd://features/basic_flows#Scenario:CreateFlow |
nifi | NiFi processors, groups, connections | nifi://groups/root/processors/abc123 |
otel | OpenTelemetry spans and events | otel://spans/trace123/span456 |
metaflow | Metaflow runs, steps, tasks | metaflow://flows/train/runs/42 |
rase | Internal artifacts (results, threads) | rase://verify/a1b2c3d4e5f6 |
som | Set-of-Mark UI annotations | som://screenshots/frame42 |
tom | Trace-of-Mark action sequences | tom://traces/episode7 |
Factory Methods
from gaius.rase import TraceableId
# BDD scenario
tid = TraceableId.from_bdd("basic_flows", scenario="CreateFlow")
# → bdd://features/basic_flows#Scenario:CreateFlow
# NiFi processor
tid = TraceableId.from_nifi("root", processor_id="abc123")
# → nifi://groups/root/processors/abc123
# Auto-generated with UUID
tid = TraceableId.generate(scheme=IdScheme.RASE, prefix="verify")
# → rase://verify/a1b2c3d4e5f6
# Stable BDD step hash (survives line number changes)
tid = TraceableId.from_bdd_step_hash("flow.feature", "I create a group named 'ETL'")
DigitalThread
A DigitalThread captures one complete verification-to-training cycle. It links the full chain:
Requirement –> Verification Case –> Execution Result –> Evidence –> Training Episode
from gaius.rase import DigitalThread
thread = DigitalThread(
requirement_id=req_id,
verification_case_id=case_id,
verification_result_id=result_id,
api_state_before=before_id,
api_state_after=after_id,
reward_outcome=0.85,
)
thread.add_evidence(screenshot_id, "screenshot")
thread.add_evidence(span_id, "span")
TraceabilityGraph
The TraceabilityGraph collects TraceabilityLink objects (directed, typed relationships) and supports queries:
- Forward trace: What derives from this requirement?
- Backward trace: What requirements does this artifact satisfy?
- Impact analysis: What verification cases need re-running if this changes?
Link types follow MBSE semantics: DERIVES, SATISFIES, VERIFIES, ALLOCATES, TRACES, REFINES.
Source
All traceability infrastructure lives in src/gaius/rase/traceability.py.
Verification
The Verifier Model (VM) implements RLVR – Reinforcement Learning with Verifiable Reward. The oracle provides ground-truth verification using authoritative API sources, never UI observations. UI traces are the training target, not the oracle.
VerdictKind
Every verification case produces one of four outcomes:
| Verdict | Meaning | Default Reward |
|---|---|---|
PASS | All requirements satisfied | 1.0 |
FAIL | One or more requirements not satisfied | 0.0 (or accuracy for partial credit) |
INCONCLUSIVE | Could not determine (missing data) | 0.5 |
ERROR | Verification itself failed (infrastructure) | 0.0 |
Accuracy
Accuracy is always a float in [0.0, 1.0], representing the proportion of constraints satisfied. It provides the foundation for graded reward strategies.
# Computed inside verification cases:
passed_count = sum(1 for r in constraint_results if r.satisfied)
accuracy = passed_count / len(constraint_results)
Verification Cases
Two types of verification cases exist:
- APIVerificationCase – the RLVR oracle. Checks system state via the NiFi REST API. Evaluates Given (setup), Then (end-state), invariant, and transition constraints.
- UIVerificationCase – verifies agent UI actions. The final state is still checked via API; the trace captures what the agent did to get there.
case = APIVerificationCase(
id=TraceableId.generate(scheme=IdScheme.RASE, prefix="verify"),
name="Verify_CreateBasicFlow",
objective=VerificationObjective(requirement_ids=[scenario.id]),
scenario_requirement=scenario_req,
)
result = await case.execute(current_nifi_state)
Reward Strategies
Reward strategies convert verification results into training signals:
| Strategy | Signal Type | Use Case |
|---|---|---|
BinaryReward | Sparse (0 or 1) | Clear pass/fail tasks, early training |
GradedReward | Dense (0.0–1.0 with partial credit) | Multi-step tasks, complex scenarios |
StepwiseReward | Dense per step | Long sequences where intermediate progress matters |
TrajectoryShaping | Dense with efficiency | Tasks where path quality matters |
from gaius.rase import GradedReward, compute_reward
strategy = GradedReward(pass_bonus=0.1, fail_penalty=0.0)
reward = compute_reward(result, strategy=strategy)
Oracle
The NiFiOracle provides authoritative verification:
- Agent takes UI actions to modify NiFi
- Oracle queries NiFi REST API to check resulting state
- State is compared against scenario requirements (constraints)
- Reward is computed from the
VerificationResult
oracle = NiFiOracle(reward_strategy=GradedReward())
result, reward = await oracle.verify_and_reward(scenario_req, trace=ui_trace)
Advanced oracles include CurriculumOracle (progressive difficulty) and EnsembleOracle (multi-source consensus).
Source
Verification infrastructure lives in src/gaius/rase/vm/ with verification cases in verification.py, requirements in requirements.py, and oracle/reward logic in oracle.py.
Observability
Gaius uses a three-layer observability stack: OpenTelemetry for instrumentation, Prometheus for time-series storage, and Metabase for self-service analytics dashboards.
Architecture
CLI/TUI/MCP --> gRPC --> Engine --> OTel Collector --> Prometheus
^^^^^^ |
metrics exported here Metabase (dashboards)
The engine is the single source of truth for metric export. All clients (CLI, TUI, MCP) route metrics through the gRPC engine, which exports via OpenTelemetry SDK to the OTel Collector. The collector forwards to Prometheus for scraping.
Components
| Layer | Technology | Purpose |
|---|---|---|
| OpenTelemetry | OTel SDK + Collector | Distributed tracing, metric instrumentation |
| Prometheus | PromQL, time-series DB | Metric storage, alerting, range queries |
| Metabase | SQL analytics platform | Dashboards connected to PostgreSQL |
ObservePanel
The TUI’s ObservePanel displays real-time metrics using declarative MetricDefinition objects. Each definition specifies:
- Source:
prometheus(PromQL query) orengine(gRPC proxy) - Display: sparkline, gauge, counter, or percentage
- Thresholds: warning/critical levels with directional logic (above or below)
Metric categories include inference (latency, throughput, errors), GPU compute (FLOPS utilization), health (active incidents, escalations, FMEA scores), and pipeline operations (cards/day, backlog depth).
Design Philosophy
Metrics use 10-minute windowed rates (Flink-inspired) to survive bursty workloads like ambient reasoning. Sparklines show 5 minutes of history at 15-second resolution. The Fail Open principle applies: unknown states are surfaced for investigation rather than filtered away.
See each sub-chapter for implementation details.
OpenTelemetry
Gaius uses the OpenTelemetry SDK for distributed tracing and metric instrumentation. The engine centralizes all OTel export through EngineMetrics, ensuring a single source of truth for operational telemetry.
Instrumentation
The EngineMetrics singleton (initialized at engine startup) creates OTel instruments:
from gaius.engine.metrics import EngineMetrics
metrics = EngineMetrics.get_instance()
metrics.record_inference(model="reasoning", latency_ms=150, tokens=500)
metrics.record_gpu_memory(gpu_id=0, used_mb=12000, total_mb=24000)
metrics.record_healing_attempt(endpoint="reasoning", tier=0, success=True)
Metric Categories
| Category | Instruments | Type |
|---|---|---|
| Inference | inference_count, inference_latency, inference_tokens | Counter, Histogram |
| GPU | gpu_memory_used, gpu_utilization, gpu_flops_utilization | Gauge (observable callbacks) |
| Endpoints | endpoint_healthy, endpoint_requests | Gauge, Counter |
| Healing | healing_attempts, healing_escalations, incidents_active | Counter, Gauge |
| Pipeline | pipeline_cards_published, pipeline_pending_cards | Counter, Gauge |
| Errors | error_total, exception_caught_total | Counter |
Metric Naming
Metrics follow a double-prefix convention due to OTel Collector namespace configuration:
gaius_gaius_<metric_name>_<unit>
The first gaius_ comes from the OTel Collector namespace config; the second from SDK metric naming (gaius. becomes gaius_ after export). PromQL queries in the OBSERVE_METRICS registry use this full prefix.
Export Pipeline
EngineMetrics --> OTel SDK --> OTLP Exporter --> OTel Collector --> Prometheus
The OTel Collector runs as a sidecar, receiving OTLP and remoting to Prometheus via the Prometheus remote-write or scrape endpoint. GPU metrics use observable callbacks that are invoked on each collection cycle.
Makespan Tracing
For long-running operations (evolution cycles, research flows), Gaius uses makespan tracing: a parent span covers the entire operation, with child spans for each phase. This enables latency attribution across multi-step workflows without excessive span cardinality.
Source
Engine metrics: src/gaius/engine/metrics.py. Observability sources: src/gaius/observability/sources/.
Prometheus
Prometheus provides time-series metric storage and PromQL queries for the Gaius observability stack. It scrapes metrics exported by the OTel Collector and serves as the backend for the TUI’s ObservePanel.
PrometheusSource
The PrometheusSource client (src/gaius/observability/sources/prometheus.py) queries the Prometheus HTTP API:
from gaius.observability import PrometheusSource
source = PrometheusSource(base_url="http://localhost:9090")
# Instant query (current value)
value = await source.query_instant(
'histogram_quantile(0.95, sum by (le) (rate(gaius_gaius_inference_latency_milliseconds_bucket[10m])))'
)
# Range query (sparkline data)
series = await source.query_range(
'sum(rate(gaius_gaius_inference_count_total[10m])) * 3600',
duration_seconds=300, # 5 minutes of history
step_seconds=15, # 15-second resolution
)
Custom Metrics
Inference
gaius_gaius_inference_latency_milliseconds– histogram with p95 viahistogram_quantilegaius_gaius_inference_count_total– counter, displayed as inferences/hourgaius_gaius_inference_tokens_total– counter, displayed as tokens/hourgaius_gaius_error_total/gaius_gaius_request_total– error rate percentage
GPU
gaius_gaius_gpu_flops_utilization_percent– FLOPS-weighted utilization across 6x RTX 4090s using Welford streaming mean
Health and Self-Healing
gaius_gaius_incidents_active– gauge of active incidentsgaius_gaius_healing_escalations_total– counter of ACP escalations per hourgaius_gaius_fmea_rpn_score– FMEA Risk Priority Numbers (high RPN > 200)
Pipeline Operations
gaius_gaius_pipeline_cards_published_total– cards published (daily)gaius_gaius_pipeline_pending_cards– backlog gaugegaius_gaius_pipeline_task_failure_total– failures by task type (zero tolerance)gaius_gaius_exception_caught_total– operational errors (non-LLM)
Windowed Rates
All rate calculations use 10-minute windows to survive bursty workloads. This keeps metrics hydrated during quiet periods rather than dropping to zero between bursts.
Engine Source
For metrics not available in Prometheus (GPU memory per device, scheduler queue depth, evolution cycles), the EngineSource queries the gRPC engine directly. These return single-point values since the engine does not retain history.
Source
src/gaius/observability/sources/prometheus.py, src/gaius/observability/sources/engine.py, src/gaius/observability/metrics.py.
Metabase
Metabase provides self-service analytics dashboards connected to the Gaius PostgreSQL database (zndx_gaius on port 5444). It queries the meta schema, which contains materialized analytics tables designed for dashboard consumption.
Architecture
PostgreSQL (zndx_gaius)
├── public schema --> operational tables (cards, agents, health)
├── meta schema --> analytics views for Metabase
├── collections --> curated content for landing page
└── bases schema --> feature store registry
|
Metabase (localhost:3000)
|
Dashboards: lineage, operations, KB geometry
Meta Schema
The meta schema (db/migrations/20251218000001_meta_schema.sql) provides pre-aggregated analytics:
| Table | Purpose |
|---|---|
meta.dataset_catalog | Deduplicated dataset registry from lineage events |
meta.job_catalog | Job registry with run counts, success/failure rates |
These tables are populated from OpenLineage events and provide the foundation for data lineage dashboards.
Dashboard Categories
Lineage
- Data provenance graph (which flows produce which datasets)
- Dataset read/write frequency
- Job success rates over time
Operations
- Agent evaluation scores and evolution trends
- GPU utilization over time
- Inference throughput and latency distributions
- Pipeline health (cards published, curation cadence)
KB Geometry
- Document cluster topology
- Embedding space coverage
- Content freshness by domain
Process Management
Metabase runs as a devenv process defined in scripts/processes/metabase.sh. It starts on localhost:3000 and connects to PostgreSQL using the same credentials as the application (gaius:gaius@localhost:5444/zndx_gaius).
Source
Metabase process: scripts/processes/metabase.sh. Meta schema: db/migrations/20251218000001_meta_schema.sql.
Security
Gaius employs a multi-layer security model focused on protecting autonomous operations. Security verification is mandatory and cannot be disabled – this is by design to prevent generated code from bypassing security checks.
Threat Model
The primary attack surface is the ACP (Agent Client Protocol) integration, which allows autonomous health maintenance via GitHub issue workflows. Without controls, an agent could:
- Leak internal state to public repositories
- Be influenced by prompt injection in externally-controlled issues
- Expose credentials in issue comments
- Be tricked by repository visibility changes
Security Layers
| Layer | Check | Purpose |
|---|---|---|
| 0 | Format validation | Reject malformed repository names |
| 1 | HOCON allowlist | Explicit repository patterns only |
| 2 | Visibility verification | Repository must be private (via gh api) |
| 3 | Content sanitization | Redact secrets, strip injection markers |
All four layers execute on every operation. There is no parameter or configuration to skip layers.
Cadence Controls
To prevent runaway automation:
- Maximum 3 GitHub issues per 24 hours
- Minimum 5 minutes between restart attempts
- Maximum 3 restarts per endpoint per hour
- All changes committed to
acp-claude/health-fixbranch for human review
Guru Meditation Codes
Security failures use the #ACP.SEC.* code family:
| Code | Description |
|---|---|
#ACP.SEC.00000002.NOTALLOWED | Repository not in allowlist |
#ACP.SEC.00000003.NOTPRIVATE | Repository not private |
#ACP.SEC.00000004.NOTCONFIGURED | No repositories configured |
#ACP.SEC.00000005.BADFORMAT | Invalid repository format |
See ACP Security Model for implementation details and Content Sanitization for redaction rules.
ACP Security Model
The Agent Client Protocol uses four mandatory security layers for all GitHub operations. Every layer must pass; there is no bypass mechanism.
Layer 0: Format Validation
Repository names are validated against strict regex patterns before any network call:
# Supported formats:
"owner/repo" # Legacy (github.com assumed)
"github.com/owner/repo" # Full URL
"github.example.com/org/repo" # On-prem GitHub Enterprise
Invalid characters, missing components, or malformed URLs raise GitHubSecurityError immediately with #ACP.SEC.00000005.BADFORMAT.
Layer 1: HOCON Allowlist
Repositories must be explicitly listed in ~/.config/gaius/acp.conf:
acp {
github {
allowed_repos = ["zndx/gaius-acp"]
require_private = true
verify_on_each_operation = true
cache_visibility_seconds = 300
}
}
Glob patterns are supported: "zndx/*" allows any repo under the zndx org. An empty allowlist means no repositories are allowed (#ACP.SEC.00000004.NOTCONFIGURED).
Layer 2: Visibility Verification
The GitHubSecurityGuard verifies repository visibility via gh api repos/{owner}/{repo} --jq .visibility. Only "private" passes; "public" and "internal" are rejected with #ACP.SEC.00000003.NOTPRIVATE.
Visibility is cached for 5 minutes (configurable via cache_visibility_seconds) and re-verified on each operation when verify_on_each_operation = true.
Layer 3: Content Sanitization
Before including any content in GitHub issues, sanitize_issue_content() redacts secrets and strips prompt injection markers. See Content Sanitization for details.
Issue titles must start with [HEALTH-FIX] prefix and are limited to 200 characters.
Attack Vectors Mitigated
| Attack | Mitigation |
|---|---|
| Info leak via public repo | Layer 2: visibility verification on every operation |
| Prompt injection from issues | Layer 1: explicit allowlist prevents attacker-controlled repos |
| Credential exposure in issues | Layer 3: automatic secret redaction |
| Visibility change attack | Re-verify on each operation (cache TTL 5 min) |
| Generated code bypass | Security is mandatory – no parameter to disable |
Usage
from gaius.acp.security import GitHubSecurityGuard
guard = GitHubSecurityGuard.from_config()
await guard.verify_repo("zndx/gaius-acp") # Raises on failure
Source
src/gaius/acp/security.py
Content Sanitization
Before any content is included in GitHub issues (via ACP escalation), the sanitize_issue_content() function automatically redacts secrets and strips prompt injection markers.
Secret Patterns
The following patterns are detected and replaced with [REDACTED_*] tags:
| Pattern | Example | Replacement |
|---|---|---|
| Anthropic API keys | sk-ant-api03-... | [REDACTED_ANTHROPIC_KEY] |
| OpenAI keys | sk-proj-..., sk-... | [REDACTED_OPENAI_KEY] |
| GitHub PAT | ghp_... | [REDACTED_GH_PAT] |
| GitHub OAuth | gho_... | [REDACTED_GH_OAUTH] |
| GitHub App | ghs_... | [REDACTED_GH_APP] |
| GitHub Refresh | ghr_... | [REDACTED_GH_REFRESH] |
| AWS Access Key | AKIA... (20 chars) | [REDACTED_AWS_KEY] |
| Bearer tokens | Bearer <token> | Bearer [REDACTED_BEARER] |
| Generic secrets | api_key=, token=, password=, secret= | [REDACTED] |
Pattern order matters: specific patterns (e.g., sk-ant-) are matched before generic ones (e.g., sk-) to ensure correct replacement labels.
Prompt Injection Markers
The following injection patterns are replaced with [SANITIZED]:
- LLM role markers:
<|system|>,<|user|>,<|assistant|>,[INST],<<SYS>> - Override attempts:
IGNORE PREVIOUS INSTRUCTIONS,SYSTEM OVERRIDE:,ADMIN MODE: - Known bypass patterns:
JAILBREAK,DAN MODE,DEVELOPER MODE:
All matching is case-insensitive.
Usage
from gaius.acp.security import sanitize_issue_content
raw = "Error with key sk-ant-api03-abc123... calling endpoint"
safe = sanitize_issue_content(raw)
# "Error with key [REDACTED_ANTHROPIC_KEY] calling endpoint"
Issue Title Validation
Issue titles are validated separately via validate_issue_title():
- Must start with
[HEALTH-FIX]prefix - Truncated to 200 characters
- Control characters stripped
Source
src/gaius/acp/security.py (the sanitize_issue_content and validate_issue_title functions).
Database
Gaius uses PostgreSQL on port 5444 with database name zndx_gaius (not gaius).
Connection
| Parameter | Value |
|---|---|
| Host | localhost |
| Port | 5444 |
| Database | zndx_gaius |
| User | gaius |
| Password | gaius |
| URL | postgres://gaius:gaius@localhost:5444/zndx_gaius?sslmode=disable |
Programmatic Access
Always use the centralized config function – never hardcode connection parameters:
from gaius.core.config import get_database_url
url = get_database_url() # Single source of truth
Delegates exist in storage/database.py, storage/grid_state.py, inference/routing_analytics.py, and storage/profile_ops.py – all call through to gaius.core.config.get_database_url().
CLI Access
PGPASSWORD=gaius psql -h localhost -p 5444 -U gaius -d zndx_gaius
Connection Pooling
The storage/database.py module manages a global asyncpg connection pool (min 1, max 10 connections) via get_pool():
from gaius.storage.database import get_pool
pool = await get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch("SELECT ...")
Schemas
The database uses four schemas to organize data. See Schema Design for details.
| Schema | Purpose |
|---|---|
public | Core operational tables (cards, agents, evolution, health) |
meta | Analytics views for Metabase dashboards |
collections | Curated content for the public landing page |
bases | Feature store registry and Iceberg catalog |
Extensions
| Extension | Purpose |
|---|---|
pg_cron | Scheduled maintenance |
age (Apache AGE) | Graph queries for lineage |
citext | Case-insensitive text columns |
Migrations
Schema migrations live in db/migrations/ and are ordered by timestamp prefix (e.g., 20251130000001_initial_schema.sql). The full schema dump is at db/schema.sql.
Schema Design
The PostgreSQL database (zndx_gaius) uses four schemas to organize data by domain.
Public Schema
The default public schema holds core operational tables:
Content Pipeline
feed_sources– RSS/API feed configurations with fetch intervalsfetch_jobs– Scheduled and completed fetch job recordscontent_items– Raw content items with KB path referencesarticles– Curated articles with frontmatter (keywords, news queries)
Agent System
agent_evaluations– Evaluation scores by agent and evaluator (local/xai)evolution_cycles– Training cycle records (success, improvement, duration)agent_versions– Version history for agent configurationsheld_out_queries– Reserved evaluation queries not used in trainingrouting_decisions– Inference routing analytics (fallback/mismatch tracking)
Health and Observability
health_incidents– HealthObserver incident records with FMEA scoreshealing_events– Self-healing attempt logs (tier, success, duration)fmea_catalog– Failure Mode and Effects Analysis registryscheduler_jobs– Async job queue for the inference scheduler
State
grid_state– Persisted 19x19 grid positions and overlayscognition_memory– Self-observation and thought chain storageresearch_state/research_progress– Active research thread tracking
Meta Schema
The meta schema provides materialized analytics tables for Metabase:
meta.dataset_catalog– Deduplicated dataset registry from lineage eventsmeta.job_catalog– Job registry with run/success/failure counts
Populated from OpenLineage events for data provenance dashboards.
Collections Schema
The collections schema manages curated content for the public landing page:
collections.collections– Named collections with featured flagscollections.collection_cards– Cards assigned to collections with orderingcollections.card_summaries– Generated card summary text
Bases Schema
The bases schema implements a feature store registry:
bases.bases– Feature store definitions (type: feature_group, model, dataset)bases.base_versions– Versioned snapshots with Iceberg table referencesbases.entity_history– Entity-level change tracking
Graph Extension
Apache AGE (ag_catalog schema) provides graph query capabilities for lineage traversal using Cypher syntax. The lineage graph connects datasets to jobs via read/write edges.
Source
Full schema dump: db/schema.sql. Migrations: db/migrations/.
pg_cron Jobs
Gaius uses the pg_cron extension for scheduled database maintenance. Jobs are defined in SQL migrations and run inside PostgreSQL without external schedulers.
Core Jobs
| Job | Schedule | Purpose |
|---|---|---|
check-due-fetches | Every 15 min | Check feed_sources for overdue fetches and create fetch_jobs records |
cleanup-fetch-jobs | Sunday 3 AM | Remove old fetch job records (keep last 100 per source) |
archive-stale-content | 1st of month, 4 AM | Mark content items older than 90 days as archived |
How It Works
The schedule_due_fetches() function checks each active feed_source against its configured fetch_interval_minutes. When a source is due, it creates a fetch_jobs record with status = 'scheduled'. Python workers poll this table and execute the actual fetch.
-- Example: schedule a fetch for a specific source
SELECT schedule_fetch('arxiv-cs-ai');
-- Check all due sources
SELECT * FROM schedule_due_fetches();
Additional Scheduled Tasks
Beyond the core jobs, several migrations add domain-specific cron schedules:
| Migration | Job | Schedule |
|---|---|---|
20251214000001_evolution_periodic_tasks | Evolution cycle triggers | Periodic |
20251223000001_theta_consolidation_cron | Theta memory consolidation | Periodic |
20251228000002_triage_cron_jobs | Content triage | Periodic |
20260202200000_landing_page_cron | Landing page card publishing | Periodic |
20260203100000_scheduled_task_notify | NOTIFY on scheduled task changes | Event-driven |
The scheduled_task_notify migration uses PostgreSQL LISTEN/NOTIFY to wake the engine watchdog when tasks are due, avoiding polling overhead.
Monitoring
The v_source_status view provides at-a-glance health for all feed sources:
SELECT name, status, total_items, pending_jobs FROM v_source_status;
Status values: ok, overdue, never (never fetched).
Source
Core jobs: db/migrations/20251130000003_pg_cron_jobs.sql. Additional schedules are spread across domain-specific migrations in db/migrations/.
Getting Started
Gaius is a CLI-first terminal interface for navigating complex, graph-oriented data domains. It renders high-dimensional embeddings and topological structures onto a constrained 19x19 grid, transforming abstract complexity into spatial intuition.
There are three ways to interact with Gaius:
- TUI – a full terminal interface with grid, panels, and keyboard navigation (
uv run gaius) - CLI – a non-interactive command runner for scripting and automation (
uv run gaius-cli) - MCP – 163 tools exposed to Claude Code and other MCP-compatible clients (
uv run gaius-mcp)
Quick Path
If you already have devenv and Nix installed, you can be running in under a minute:
cd gaius
devenv shell
uv sync
devenv processes up -d
uv run gaius
This starts the platform services (PostgreSQL, Qdrant, gRPC engine, NiFi) and launches the TUI. You will see a 19x19 grid with a cursor at the center.
What to Read Next
If this is your first time:
- Installation – prerequisites and environment setup
- First Launch – what happens when you start Gaius and what to try first
Once you are comfortable with the basics:
- The TUI – understanding the five interface components
- Navigation – cursor movement, view modes, and workflow patterns
- The CLI – non-interactive commands for scripting
- MCP Integration – connecting Gaius to Claude Code
Three Interfaces, One Engine
All three interfaces communicate with the same gRPC engine on port 50051. A /health command run from the CLI produces the same result as the health_observer_status MCP tool or pressing / and typing health in the TUI. Choose the interface that fits your context: TUI for exploration, CLI for automation, MCP for AI-assisted workflows.
Installation
Gaius uses devenv (built on Nix) for reproducible development environments and uv for Python dependency management.
Prerequisites
| Dependency | Purpose | Install |
|---|---|---|
| Nix | Package manager | nix.dev |
| devenv | Development environment | nix profile install github:cachix/devenv |
| uv | Python package manager | Provided by devenv |
| Just | Task runner | Provided by devenv |
You do not need to install Python, PostgreSQL, or any other runtime dependency manually. Nix provides everything.
Environment Setup
Clone the repository and enter the devenv shell:
git clone <repo-url>
cd gaius
devenv shell
The first devenv shell invocation downloads and caches all Nix dependencies. Subsequent invocations start in under a second.
Inside the shell, install Python dependencies:
uv sync
For optional features, use extras:
uv sync --extra tda # Topological data analysis (giotto-tda)
uv sync --extra swarm # Multi-agent support (langchain)
Starting Platform Services
Gaius depends on several backend services: PostgreSQL, Qdrant, the gRPC engine, and others. Start them all with:
devenv processes up -d
To stop all services:
devenv processes down
To verify everything is running, use the Just task runner:
just --list # Show all available tasks
just restart-clean # Full clean restart if something is stuck
Database
PostgreSQL runs on port 5444 with a database named zndx_gaius:
PGPASSWORD=gaius psql -h localhost -p 5444 -U gaius -d zndx_gaius
The database name is zndx_gaius, not gaius. The connection URL used internally is:
postgres://gaius:gaius@localhost:5444/zndx_gaius?sslmode=disable
Verifying the Installation
Once services are running, confirm the gRPC engine is healthy:
uv run gaius-cli --cmd "/health" --format json
If this returns a JSON health report, the installation is complete. If it fails, try just restart-clean and check the process logs in .devenv/processes.log.
First Launch
This page describes what you will see when you first start Gaius, and what to try immediately.
Starting the TUI
From inside a devenv shell with services running:
uv run gaius
The terminal fills with the Gaius interface. At its center is a 19x19 grid – the MainGrid – with a cursor marker at position K10.
What You See
The default layout has three regions:
- Left panel – a FileTree showing the knowledge base as a directory structure
- Center – the 19x19 MainGrid with three 9x9 MiniGrid projections below it
- Right panel – a ContentPanel that shows context for the current selection
The bottom of the screen has a command bar. The cursor appears as a distinct marker on the grid.
First Steps
Move the cursor. Press h, j, k, l to move left, down, up, right. The cursor moves across the grid. The MiniGrid projections and ContentPanel update to reflect the new position.
Check your bearings. Press ? to display help in the ContentPanel. This shows the available key bindings and a summary of the current state.
Cycle the view. Press v to switch between view modes (Go, Theta, Swarm). Each mode renders the grid data differently.
Cycle overlays. Press o to layer additional information onto the grid: topology, geometry, dynamics, or agent positions.
Toggle panels. Press [ to toggle the left panel, ] to toggle the right panel, or \ to toggle both. Hiding panels maximizes grid space.
Enter a command. Press / to focus the command bar, then type health and press Enter. This runs the health diagnostic and displays system status in the ContentPanel.
First CLI Check
Open a second terminal (also in devenv shell) and try:
uv run gaius-cli --cmd "/health" --format json
This runs the same health check non-interactively and prints JSON output. The CLI and TUI connect to the same engine, so results are identical.
If Something Looks Wrong
If the grid is empty or services are not responding:
just restart-clean
This performs a full clean restart of all platform services. After it completes, relaunch with uv run gaius.
Next Steps
- The TUI – understand the five components of the interface
- Navigation – learn cursor movement and workflow patterns
- Key Bindings – complete keyboard reference
The TUI
Gaius renders a full terminal interface built on the Textual framework. The interface draws inspiration from Bloomberg Terminal (information density), Plan 9’s Acme (everything is a file), and CAD orthographic views (multiple synchronized projections).
Launch the TUI with:
uv run gaius
Five Components
The interface is composed of five primary widgets:
MainGrid
The 19x19 grid occupies the center of the screen. It is the primary workspace – a spatial representation of high-dimensional data projected onto a Go board layout. Grid positions correspond to embedded data points, and the cursor indicates your current focus.
The grid supports three view modes (cycled with v): Go, Theta, and Swarm. Each mode changes how the underlying data is rendered. Four overlay modes (cycled with o) layer additional information on top: topology, geometry, dynamics, and agent positions.
MiniGridPanel
Below the MainGrid sit three 9x9 orthographic projections. These are CAD-style views that show the data from different angles – like top, front, and side views of a 3D object. They update automatically as you move the cursor, providing spatial context around your current position.
FileTree (Left Panel)
The left panel presents a Plan 9-inspired file tree where knowledge base entries, agents, and system state are navigated as a directory structure. Agents appear as files under /agents/, and KB entries are organized by domain. Toggle visibility with [.
ContentPanel (Right Panel)
The right panel displays detailed content for the currently selected item: file contents, agent output, position context, health reports, or command results. It is the primary output area for slash commands. Toggle visibility with ].
CommandInput (Bottom Bar)
The bottom command bar accepts slash commands. Press / to focus it, type a command (e.g., health, evolve status, gpu status), and press Enter. Press Escape to cancel. The command bar supports history navigation with up/down arrows and tab completion.
Layout Flexibility
Toggle panels to adjust the layout to your task:
- Full layout: all panels visible – maximum context
- Grid-focused: press
\to hide both panels – maximum grid space - Research mode: hide left panel with
[– more room for content output - Navigation mode: hide right panel with
]– focus on the file tree and grid
Design Principles
The TUI is keyboard-first. Every action is reachable without a mouse. Information density is high by design – the interface shows as much relevant data as possible without requiring navigation to separate screens. Modes and overlays let you shift perspective without losing your place.
Navigation & Modes
Gaius draws inspiration from modal editors like Vim and compositional systems like Plan 9’s Acme. Navigation is keyboard-driven, modes provide context, and every operation is reversible.
Modal Philosophy
Gaius uses modes to provide context-sensitive behavior. This is not complexity – it is power through focus.
- Normal Mode (default): navigate, observe, toggle views
- Command Mode: enter slash commands via the command bar (
/)
Cursor Navigation
The cursor is your focus point on the grid. It determines what position commands act upon, the center of local context, and the reference point for the MiniGrid projections.
Basic Movement
k
|
h --+-- l
|
j
Vim-style navigation: h/j/k/l for left/down/up/right. These keys sit on the home row so your fingers never leave typing position.
Tenuki
Press t to jump to the point of highest strategic interest – a concept borrowed from Go, where tenuki means “playing elsewhere.” The engine evaluates all grid positions and moves your cursor to the most strategically relevant one.
View Modes
Press v to cycle through visualization modes:
Go Mode
Traditional Go stones on intersections. Black and white stones mark occupied positions. Empty intersections show as dots.
Theta Mode
Information density visualization named after theta waves, which facilitate memory consolidation. This mode renders allocation intensity and data density across the grid.
Swarm Mode
Agent-centric view showing multi-agent positions and activity across the grid.
Overlay Modes
Press o to cycle overlays. Overlays add visual information on top of the current view mode without changing the base rendering:
| Overlay | Key concept | What it shows |
|---|---|---|
| None | Clean slate | Base grid only |
| Topology | Persistent homology | H0/H1/H2 features (components, loops, voids) |
| Geometry | Curvature | Semantic boundaries vs. interiors |
| Dynamics | Gradient field | Direction of semantic change, divergence |
| Agents | Team state | Agent positions on the grid |
See Overlays for detailed interpretation guidance.
Iso View Modes
Press i to cycle through Iso view modes, which change the interpretation of the MiniGrid projections below the main grid. These provide different mathematical lenses on the same data.
Panel Management
| Key | Action |
|---|---|
[ | Toggle left panel (FileTree) |
] | Toggle right panel (ContentPanel) |
\ | Toggle both panels simultaneously |
Hide panels to maximize grid visibility. Restore them to review details and navigate the knowledge base.
Graph View
Press g to cycle the center panel between modes. This toggles between the standard grid view and a graph/wiki-link visualization, providing different perspectives on the same underlying data.
Flow Patterns
Exploration Flow
- Navigate with
hjklto survey the grid - Cycle overlays (
o) to see different data layers - Toggle candidates (
c) to see suggested positions - Press
tfor tenuki to jump to high-interest points
Analysis Flow
- Press
/to enter command mode - Run
/healthto check system state - Use overlays to compare topology, geometry, and dynamics
- Review output in the ContentPanel
Focused Flow
- Hide panels (
\) for maximum grid space - Navigate to a region of interest
- Switch overlays to study different dimensions
- Restore panels when you need detailed context
Panels
Gaius has two side panels flanking the central grid: the FileTree on the left and the ContentPanel on the right. Both can be toggled independently or together.
Toggle Controls
| Key | Action |
|---|---|
[ | Toggle left panel (FileTree) |
] | Toggle right panel (ContentPanel) |
\ | Toggle both panels simultaneously |
When a panel is hidden, the grid expands to fill the available space.
Left Panel: FileTree
The FileTree presents a Plan 9-inspired hierarchical view of the system. Everything is navigable as if it were a filesystem:
/
agents/
cognition/
evolution/
health/
kb/
current/
projects/
content/
scratch/
state/
Agents are represented as files under /agents/. Knowledge base entries appear under /kb/. System state is exposed under /state/. This design follows the Plan 9 philosophy where everything – processes, data, system state – is accessible through a uniform file interface.
Selecting an entry in the FileTree updates the ContentPanel on the right to show its contents.
Right Panel: ContentPanel
The ContentPanel is the primary output area. It displays:
- File contents – when a FileTree entry is selected
- Command output – results from slash commands (e.g.,
/health,/gpu status) - Position context – information about the current grid position
- Help – key binding reference when
?is pressed - Agent output – responses from agent operations
The ContentPanel renders markdown-formatted text, tables, and structured data. It scrolls vertically for long output.
Layout Strategies
Different tasks benefit from different panel configurations:
Full context (default): both panels visible. Use when you need to navigate the knowledge base and see detailed output simultaneously.
Grid focus: press \ to hide both panels. Use when studying spatial patterns, overlay composition, or doing pure grid exploration.
Research mode: hide the left panel with [. The grid and ContentPanel share the screen, giving more room for command output and detailed content.
Navigation mode: hide the right panel with ]. The FileTree and grid share the screen, useful when browsing the knowledge base structure without needing detailed content.
Panel Persistence
Panel visibility state persists during your session. If you hide a panel and run a command, the panel stays hidden. Toggle it back when you need it.
Overlays & Visualization
Overlays are Gaius’s mechanism for layering multiple data dimensions onto a single grid. Understanding overlay composition is key to effective visual analysis.
Overlay Philosophy
A grid has 361 cells. Naively, that is one data point per cell. But complex domains have many dimensions. Overlays solve this by:
- Layering: multiple data types occupy the same space
- Cycling: focus shifts between layers via the
okey - Compositing: some layers blend (e.g., density + markers)
Available Overlays
Press o to cycle through overlay modes. The current set is based on differential geometry concepts:
None
The cleanest view. Shows only:
- Base grid (view-mode-specific symbols)
- Cursor position
- Candidate markers (a-i) if toggled with
c
Use this for uncluttered observation of the base state.
Topology
Displays persistent homology features at three scales:
- H0: connected components – clusters of related data points
- H1: loops – cycles in the embedding space (feedback loops, circular dependencies)
- H2: voids – higher-dimensional cavities (structural gaps)
Topological features that persist across scales are significant. Transient features are noise. The overlay highlights those that survive, revealing the true shape of the data.
Geometry
Curvature heatmap showing semantic boundaries versus interiors. High curvature regions mark transitions between conceptual domains. Low curvature indicates the interior of a coherent cluster. This overlay helps identify where one topic ends and another begins.
Dynamics
Gradient vector field showing the direction and magnitude of semantic change. Arrows or indicators point toward regions of increasing density or relevance. Divergence patterns reveal sources (generating new content) and sinks (absorbing attention). This overlay captures how the data landscape is evolving.
Agents
Agent positions projected from embedding space onto the grid. Each active agent occupies a position determined by its current focus within the data. Watch for:
- Clustering: agents in agreement, converging on the same region
- Scattering: genuine uncertainty or broad exploration
- Opposition: agents on opposite sides of the grid (tension, disagreement)
- Isolation: a single agent in a region (unique insight worth investigating)
Reading Composite Views
When multiple features occupy a cell, priority determines display:
- Overlay markers – highest priority
- Candidate letters (a-i)
- Cursor
- Stones/density (view-mode symbols)
- Empty (dot) – lowest priority
Overlay as Situational Awareness
Each overlay provides a different “sense”:
- None: clean visual baseline
- Topology: structural awareness (what shapes exist)
- Geometry: boundary awareness (where things change)
- Dynamics: momentum awareness (where things are going)
- Agents: team state awareness (where agents are looking)
Cycling overlays is like shifting attention between modalities – a form of augmented situational awareness. The OODA loop pattern (Observe, Orient, Decide, Act) maps naturally: observe with None, orient with Topology or Geometry, decide based on Dynamics, act on Agent positions.
Combining with View Modes
Overlays compose with view modes (v). A Topology overlay on Go mode shows homology features atop stone positions. The same overlay on Theta mode shows features atop density shading. Experiment with combinations to find the perspective that reveals what you need.
Key Bindings
Complete reference for all keyboard shortcuts in the Gaius TUI.
Navigation
| Key | Action | Description |
|---|---|---|
h | Move left | Move cursor one position left |
j | Move down | Move cursor one position down |
k | Move up | Move cursor one position up |
l | Move right | Move cursor one position right |
t | Tenuki | Jump to point of highest strategic interest |
View Controls
| Key | Action | Description |
|---|---|---|
v | Cycle view | Cycle through view modes: Go, Theta, Swarm |
o | Cycle overlay | Cycle through overlays: None, Topology, Geometry, Dynamics, Agents |
i | Cycle iso | Cycle Iso view modes for MiniGrid projections |
c | Toggle candidates | Show/hide candidate markers (a-i) at suggested positions |
Panel Controls
| Key | Action | Description |
|---|---|---|
[ | Toggle left | Show/hide the FileTree panel |
] | Toggle right | Show/hide the ContentPanel |
\ | Toggle both | Show/hide both panels simultaneously |
Commands and Help
| Key | Action | Description |
|---|---|---|
/ | Command mode | Focus the command bar to enter a slash command |
? | Help | Display help and key reference in the ContentPanel |
Graph and Evolution
| Key | Action | Description |
|---|---|---|
g | Graph | Cycle center panel between grid and graph views |
e | Evolution | Show evolution panel directly |
Notes
| Key | Action | Description |
|---|---|---|
Ctrl+n | New note | Create a new Zettelkasten note and focus the editor |
Ctrl+z | Zoom editor | Toggle editor zoom (tmux-style) |
Application
| Key | Action | Description |
|---|---|---|
q | Quit hint | Display quit instructions (use /q or /exit to actually quit) |
Command Bar Keys
When the command bar is focused (after pressing /):
| Key | Action |
|---|---|
Enter | Execute command |
Escape | Cancel and return to normal mode |
Up | Previous command in history |
Down | Next command in history |
Tab | Auto-complete command |
Design Notes
Key bindings follow Vim conventions for navigation (hjkl) and use mnemonic single keys for mode cycling (v for view, o for overlay, c for candidates). Panel toggles use bracket keys ([, ], \) which are adjacent on a standard keyboard. The / key enters command mode, matching the slash-command convention used by Claude Code and similar tools.
The CLI
Gaius provides a non-interactive command-line interface through gaius-cli. It executes the same slash commands available in the TUI but returns structured output suitable for scripting, piping, and automation.
Basic Usage
uv run gaius-cli --cmd "/command" --format json
The --cmd flag specifies the slash command to run (with or without the leading /). The --format flag controls output format – json produces machine-readable output, while the default produces human-readable text.
Examples
Check system health:
uv run gaius-cli --cmd "/health" --format json
Query GPU and endpoint status:
uv run gaius-cli --cmd "/gpu status" --format json
Check evolution state:
uv run gaius-cli --cmd "/evolve status" --format json
View the current application state:
uv run gaius-cli --cmd "/state" --format json
Available Commands
Gaius has 63 slash commands covering health diagnostics, agent management, inference control, knowledge base operations, evolution, visualization, and observability. The full command reference is in the CLI Commands section.
Common command categories:
| Prefix | Domain | Example |
|---|---|---|
/health | System health | /health, /health fix engine |
/gpu | GPU/endpoints | /gpu status, /gpu cleanup |
/evolve | Agent evolution | /evolve status, /evolve trigger |
/kb | Knowledge base | /kb search <query> |
/render | Visualization | /render cards |
/observe | Observability | /observe metrics |
Connection to the Engine
The CLI connects to the same gRPC engine (port 50051) as the TUI. Both interfaces are thin clients that send commands to the engine and display results. If the engine is not running, the CLI will report a connection error – start services with devenv processes up -d or just restart-clean.
Next Steps
- Command Patterns – JSON output, jq piping, polling techniques
- Scripting – using gaius-cli in shell scripts
Command Patterns
Common patterns for working with gaius-cli effectively. The CLI produces structured JSON output that integrates naturally with standard Unix tools.
JSON Output and jq
Most commands support --format json for machine-readable output. Pipe through jq to extract specific fields:
# Get endpoint names and statuses
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[] | {name, status}'
# Extract just the health categories that are not OK
uv run gaius-cli --cmd "/health" --format json | jq '.data.checks[] | select(.status != "ok")'
# Get the current evolution generation number
uv run gaius-cli --cmd "/evolve status" --format json | jq '.data.generation'
Polling for Status Changes
When waiting for an operation to complete, poll in a loop:
# Watch endpoints transition from STARTING to HEALTHY after a restart
for i in $(seq 1 15); do
sleep 10
uv run gaius-cli --cmd "/gpu status" --format json | \
jq -r '.data.endpoints[] | "\(.name): \(.status)"'
echo "---"
done
Comparing Before and After
Capture state before and after an operation:
# Snapshot before
uv run gaius-cli --cmd "/health" --format json > /tmp/health-before.json
# Run an operation
uv run gaius-cli --cmd "/health fix engine" --format json
# Snapshot after
uv run gaius-cli --cmd "/health" --format json > /tmp/health-after.json
# Diff
diff <(jq -S . /tmp/health-before.json) <(jq -S . /tmp/health-after.json)
Batch Operations
Run multiple commands in sequence:
# Check everything in one pass
for cmd in "/health" "/gpu status" "/evolve status"; do
echo "=== $cmd ==="
uv run gaius-cli --cmd "$cmd" --format json | jq '.data'
echo
done
Conditional Logic
Use jq exit codes to drive decisions:
# Only proceed if all endpoints are healthy
if uv run gaius-cli --cmd "/gpu status" --format json | \
jq -e '.data.endpoints | all(.status == "HEALTHY")' > /dev/null 2>&1; then
echo "All endpoints healthy, proceeding"
uv run gaius-cli --cmd "/evolve trigger" --format json
else
echo "Not all endpoints healthy, aborting"
exit 1
fi
Timestamp and Logging
Add timestamps for log correlation:
uv run gaius-cli --cmd "/health" --format json | \
jq --arg ts "$(date -Iseconds)" '. + {queried_at: $ts}'
Error Handling
The CLI returns non-zero exit codes on failure. Check both the exit code and the response:
if ! output=$(uv run gaius-cli --cmd "/gpu status" --format json 2>&1); then
echo "CLI failed: $output"
exit 1
fi
echo "$output" | jq '.data'
Scripting
The gaius-cli is designed for non-interactive use in shell scripts. It connects to the gRPC engine, executes a command, prints output, and exits. This makes it suitable for cron jobs, monitoring scripts, and automation pipelines.
Health Monitoring Script
A script that checks system health and sends alerts on failures:
#!/usr/bin/env bash
set -euo pipefail
LOG="/var/log/gaius-health.log"
health=$(uv run gaius-cli --cmd "/health" --format json)
failed=$(echo "$health" | jq '[.data.checks[] | select(.status != "ok")] | length')
if [ "$failed" -gt 0 ]; then
echo "$(date -Iseconds) ALERT: $failed health checks failing" >> "$LOG"
echo "$health" | jq '.data.checks[] | select(.status != "ok")' >> "$LOG"
fi
Periodic Data Collection
Capture endpoint metrics at regular intervals for trend analysis:
#!/usr/bin/env bash
set -euo pipefail
OUTDIR="$HOME/gaius-metrics/$(date +%Y-%m-%d)"
mkdir -p "$OUTDIR"
TIMESTAMP=$(date +%H%M%S)
uv run gaius-cli --cmd "/gpu status" --format json > "$OUTDIR/${TIMESTAMP}_gpu.json"
uv run gaius-cli --cmd "/health" --format json > "$OUTDIR/${TIMESTAMP}_health.json"
uv run gaius-cli --cmd "/evolve status" --format json > "$OUTDIR/${TIMESTAMP}_evolve.json"
Run via cron every 5 minutes:
*/5 * * * * cd /path/to/gaius && devenv shell -- bash scripts/collect-metrics.sh
Endpoint Readiness Gate
Wait for all endpoints to be healthy before proceeding with a downstream operation:
#!/usr/bin/env bash
set -euo pipefail
MAX_WAIT=300 # 5 minutes
INTERVAL=10
elapsed=0
echo "Waiting for endpoints to become healthy..."
while [ $elapsed -lt $MAX_WAIT ]; do
if uv run gaius-cli --cmd "/gpu status" --format json | \
jq -e '.data.endpoints | all(.status == "HEALTHY")' > /dev/null 2>&1; then
echo "All endpoints healthy after ${elapsed}s"
exit 0
fi
sleep $INTERVAL
elapsed=$((elapsed + INTERVAL))
done
echo "Timed out waiting for endpoints after ${MAX_WAIT}s"
exit 1
Evolution Report
Generate a summary of the current evolution state:
#!/usr/bin/env bash
set -euo pipefail
echo "=== Gaius Evolution Report $(date -Iseconds) ==="
echo
echo "## Agent Status"
uv run gaius-cli --cmd "/evolve status" --format json | \
jq -r '.data | "Generation: \(.generation)\nActive agents: \(.active_agents)"'
echo
echo "## Endpoint Status"
uv run gaius-cli --cmd "/gpu status" --format json | \
jq -r '.data.endpoints[] | " \(.name): \(.status)"'
echo
echo "## Health Summary"
uv run gaius-cli --cmd "/health" --format json | \
jq -r '.data.checks[] | " \(.name): \(.status)"'
Tips for Robust Scripts
- Always use
set -euo pipefailat the top of scripts - Check that the engine is reachable before running a batch of commands
- Use
--format jsonconsistently so output is parseable - Capture output to variables when you need to inspect it multiple times
- Log timestamps alongside data for correlation with system events
MCP Integration
Gaius exposes 163 tools via the Model Context Protocol (MCP), making its full functionality available to Claude Code and other MCP-compatible AI clients.
What Is MCP?
The Model Context Protocol is a standard for connecting AI assistants to external tools and data sources. When configured, Claude Code can call Gaius tools directly – checking health, querying the knowledge base, managing agents, and running operations – all within a conversational workflow.
Starting the MCP Server
uv run gaius-mcp
This starts a stdio-based MCP server that communicates with Claude Code over standard input/output. The server connects to the same gRPC engine (port 50051) used by the TUI and CLI.
What You Can Do
With MCP integration, Claude Code can:
- Diagnose issues: query health status, check endpoint state, review incident history
- Manage agents: view evolution status, trigger training, promote agent versions
- Search knowledge: query the knowledge base, perform semantic search, explore lineage
- Run inference: submit prompts to the scheduler, evaluate outputs, manage XAI budget
- Monitor systems: read Prometheus metrics, check Metabase dashboards, view GPU health
- Create content: trigger article curation, render card visualizations, manage collections
Architecture
The MCP server is a thin wrapper over the same services available through the CLI. Each MCP tool maps to an internal command or service call. The server handles serialization (JSON arguments and responses) and error propagation.
Claude Code <--stdio--> gaius-mcp <--gRPC--> Engine (port 50051)
<--HTTP--> Services (Metabase, Prometheus, etc.)
<--SQL--> PostgreSQL (port 5444)
Next Steps
- Claude Code Setup – configure your
.claude.jsonto connect - Tool Categories – browse the 163 tools by domain
Claude Code Setup
This page describes how to configure Claude Code to use the Gaius MCP server, giving Claude Code direct access to all 163 Gaius tools.
Configuration
Add the Gaius MCP server to your Claude Code MCP configuration. The configuration file is typically at ~/.claude.json or in your project’s .claude/ directory.
Add the following to the mcpServers section:
{
"mcpServers": {
"gaius": {
"command": "uv",
"args": ["run", "--directory", "/path/to/gaius", "gaius-mcp"],
"env": {
"GAIUS_ENGINE_HOST": "localhost",
"GAIUS_ENGINE_PORT": "50051"
}
}
}
}
Replace /path/to/gaius with the absolute path to your Gaius repository checkout.
Environment Variables
The MCP server respects these environment variables:
| Variable | Default | Purpose |
|---|---|---|
GAIUS_ENGINE_HOST | localhost | gRPC engine hostname |
GAIUS_ENGINE_PORT | 50051 | gRPC engine port |
DATABASE_URL | from config | PostgreSQL connection URL |
In most setups, the defaults work without any environment overrides.
Prerequisites
Before Claude Code can use Gaius tools, the platform services must be running:
cd /path/to/gaius
devenv shell
devenv processes up -d
The MCP server connects to the gRPC engine on startup. If the engine is not running, tool calls will fail with connection errors.
Verifying the Connection
After configuring, ask Claude Code to run a health check:
“Check the Gaius health status”
Claude Code should invoke the health_observer_check tool and return a structured health report. If it reports connection errors, verify that devenv processes up -d has been run.
Tool Discovery
Claude Code can list available tools. The 163 tools are organized into categories such as health, agents, inference, knowledge base, observability, evolution, visualization, and bases. See Tool Categories for the full breakdown.
Security Considerations
The MCP server runs locally and communicates with Claude Code over stdio. It does not expose a network port. All operations are scoped to the local Gaius instance. For ACP (Agent Client Protocol) integration, which involves GitHub operations, additional security controls apply – see the ACP Security Model documentation.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| “Tool not found” | MCP config not loaded | Restart Claude Code after editing config |
| Connection refused | Engine not running | Run devenv processes up -d |
| Timeout on tool calls | Engine overloaded | Check /gpu status or run just restart-clean |
| Python errors | Dependencies missing | Run uv sync in the gaius directory |
Tool Categories
The 163 MCP tools are organized by domain. Each tool maps to an internal service call and accepts JSON arguments.
Health
Tools for system diagnostics, self-healing, and incident management.
| Tool | Purpose |
|---|---|
health_observer_status | Current observer daemon state |
health_observer_check | Run health diagnostic across all categories |
health_observer_start / stop | Control the health observer daemon |
health_observer_incidents | List active and recent incidents |
health_observer_incident_detail | Detailed view of a specific incident |
fmea_catalog | Browse failure modes and their RPN scores |
fmea_calculate_rpn | Calculate Risk Priority Number for a failure mode |
fmea_get_controls | Get remediation controls for a failure mode |
Agents and Evolution
Tools for managing agent versions, evolution cycles, and cognition.
| Tool | Purpose |
|---|---|
list_agent_versions | All agent versions with metadata |
get_active_config | Current active agent configuration |
get_best_agent_version | Highest-performing version for an agent |
save_agent_version / rollback_agent | Version management |
optimize_agent | Trigger optimization for an agent |
evolution_status | Current generation, evaluation state |
trigger_evolution | Start a new evolution cycle |
trigger_task_ideation | Generate new training tasks |
get_capability_gaps | Identify areas where agents underperform |
Inference and Models
Tools for managing LLM endpoints, inference scheduling, and XAI budget.
| Tool | Purpose |
|---|---|
list_models / get_model | Browse available models |
gpu_health | GPU utilization and endpoint status |
model_launch_coding / model_stop_coding | Control inference endpoints |
model_generate_code | Generate code using a managed model |
model_validate_code | Validate generated code |
get_xai_budget / reset_xai_budget | Manage XAI inference budget |
evaluate_with_xai | Run evaluation using XAI model |
Knowledge Base
Tools for searching, reading, and managing knowledge base content.
| Tool | Purpose |
|---|---|
search_kb | Full-text search across KB entries |
read_kb / create_kb / update_kb / delete_kb | CRUD operations |
list_kb | List entries with filters |
kb_sync | Synchronize KB with external sources |
semantic_search | Vector similarity search |
embed_text / embed_texts | Generate embeddings |
Observability
Tools for metrics, monitoring, and system telemetry.
| Tool | Purpose |
|---|---|
observe_status / observe_metrics | Observability pipeline state |
prometheus_query / prometheus_query_range | Direct PromQL queries |
prometheus_health | Prometheus server status |
metabase_status | Metabase analytics server status |
metabase_list_dashboards / metabase_get_dashboard | Browse dashboards |
log_activity / get_activity_stats / get_daily_summary | Activity tracking |
Visualization
Tools for rendering card visualizations and managing collections.
| Tool | Purpose |
|---|---|
collection_status / collection_list / collection_create | Manage collections |
collection_add_card / collection_list_cards | Card management |
collection_publish_cards / collection_publish_viz | Publishing pipeline |
collection_generate_summaries | AI-generated card summaries |
article_list / article_curate / article_new | Article management |
Bases (Feature Store)
Tools for querying the Bases feature store.
| Tool | Purpose |
|---|---|
bases_list | List available bases |
bases_query | Run DQL queries against a base |
bases_entity_history | Entity change history |
bases_health | Feature store health status |
Cognition and Memory
Tools for agent thinking, memory consolidation, and self-reflection.
| Tool | Purpose |
|---|---|
trigger_cognition | Trigger a cognition cycle |
trigger_self_observation | Agent self-reflection |
get_thought_chain / get_recent_thoughts | View agent reasoning |
what_are_you_thinking | Current agent state of mind |
theta_sitrep / theta_consolidate | Theta wave memory consolidation |
reflect / quick_thought | Lightweight reflection tools |
Workflows
Gaius supports multi-step workflows that combine CLI commands, MCP tools, and TUI interactions. This section documents the most common patterns.
What Is a Workflow?
A workflow is a sequence of operations that achieve a goal larger than any single command. For example, researching a topic involves creating KB entries, curating articles, generating cards, and publishing a collection. Each step uses different Gaius capabilities, and the output of one step feeds the next.
Three Interaction Layers
Workflows can be executed through any combination of the three interfaces:
- TUI: interactive exploration, visual pattern recognition, manual curation
- CLI: scripted operations, batch processing, automated checks
- MCP: AI-assisted orchestration, where Claude Code drives multi-step sequences
The choice depends on the task. Health monitoring is best scripted via CLI. Research curation benefits from MCP-driven AI assistance. Spatial exploration requires the TUI.
Common Workflows
Research Workflow
End-to-end knowledge synthesis: define a topic, curate articles from the web, create cards with enriched metadata, and publish a collection. This is the primary content pipeline.
Health Workflow
System diagnosis and remediation: run health checks, interpret failures, apply self-healing fixes, and monitor recovery. This workflow is critical for keeping the platform operational.
Evolution Workflow
Agent improvement cycle: check evolution status, generate training tasks, trigger evaluation, promote successful agents. This is how Gaius agents get better over time.
Workflow Principles
Self-healing first. When something breaks, try /health fix <service> before manual intervention. The self-healing system learns from each invocation.
Test via CLI. After any code change or operation, verify the result with gaius-cli. Previous outputs are invalidated by changes – always re-run the command.
Fail fast. Gaius surfaces errors immediately with actionable remediation paths. If a step fails, the error message tells you what to do next. There are no silent fallbacks.
Observe, then act. Use the OODA loop: observe system state (/health, /gpu status), orient by comparing overlays, decide on an action, then act. Do not skip the observation step.
Research Workflow
The research workflow takes a topic from initial exploration through to a published collection of enriched cards. This is the primary content pipeline in Gaius.
Overview
Topic definition --> Article curation --> Card creation --> Enrichment --> Publishing
Each step builds on the previous one. The workflow can be driven manually through the CLI, or orchestrated by Claude Code via MCP tools.
Step 1: Define the Topic
Create or select an article definition with keywords and news queries that guide content discovery:
# List existing articles
uv run gaius-cli --cmd "/article list" --format json
# Create a new article with topic keywords
uv run gaius-cli --cmd "/article new" --format json
Articles need keywords and/or news_queries in their frontmatter for the Brave fetcher to find relevant sources. Without these, curation will fail fast with #ACF.00000013.NOHINTS.
Step 2: Curate Articles
Run the article curation flow to fetch and process relevant content:
uv run gaius-cli --cmd "/article curate" --format json
The curation flow:
- Searches the web using configured keywords and news queries
- Fetches and extracts content from discovered URLs
- Evaluates relevance against a selection rubric
- Creates cards from qualifying articles (~20 cards per run, ~2 minutes)
The selection rubric includes a curation_readiness gate that prevents selecting articles whose metadata is incomplete.
Step 3: Enrich Cards
Cards are created with basic metadata. Enrichment adds embeddings, summaries, and topology features:
# Check enrichment status
uv run gaius-cli --cmd "/collection list cards" --format json
# Generate summaries for cards that need them
uv run gaius-cli --cmd "/collection generate summaries" --format json
Card publishing is gated on enrichment completeness – cards without sufficient enrichment cannot be published.
Step 4: Render Visualizations
Each card gets a deterministic visualization rendered by the LuxCore engine:
uv run gaius-cli --cmd "/render cards" --format json
The grammar engine generates a unique visual based on the card’s topology features, seeded by hash(card_id) for deterministic output. Two variants are produced: display (1400x300) and og (1200x630 for social sharing).
Step 5: Publish Collection
Publish the completed cards to a collection:
# Create or select a collection
uv run gaius-cli --cmd "/collection create" --format json
# Add cards to the collection
uv run gaius-cli --cmd "/collection add card" --format json
# Publish
uv run gaius-cli --cmd "/collection publish cards" --format json
MCP-Driven Research
When using Claude Code with MCP tools, the entire workflow can be conversational:
“Research the topic of topological data analysis in financial risk. Curate articles, enrich the cards, and publish a collection.”
Claude Code will call article_new, article_curate, collection_generate_summaries, and collection_publish_cards in sequence, reporting progress at each step.
Monitoring Collection Balance
The pending_cards metric is the most effective signal for collection diversity. Monitor it to ensure the collection is not over-weighted toward a single source or topic.
Health Workflow
The health workflow covers diagnosing system issues, applying self-healing fixes, and monitoring recovery. Gaius implements a fail-fast policy with actionable error messages, so every failure tells you what to do next.
Step 1: Diagnose
Run the health check to see the current state of all services:
uv run gaius-cli --cmd "/health" --format json
This returns a structured report with checks organized by category. Each check has a status (ok, warn, fail) and a message explaining the current state.
To check a specific category:
uv run gaius-cli --cmd "/health engine" --format json
uv run gaius-cli --cmd "/health endpoints" --format json
Step 2: Interpret Failures
Failed checks include Guru Meditation Codes – unique identifiers for each failure mode. For example:
#DS.00000001.SVCNOTINIT– DatasetService not initialized#NF.00000001.UNREACHABLE– NiFi not reachable#EP.00000001.GPUOOM– GPU out of memory
Each code maps to a documented heuristic with symptom, cause, observation method, and solution. The error message itself contains remediation hints.
Step 3: Self-Heal
Always try /health fix before manual intervention. This is a design principle, not a suggestion:
uv run gaius-cli --cmd "/health fix engine" --format json
uv run gaius-cli --cmd "/health fix endpoints" --format json
uv run gaius-cli --cmd "/health fix nifi" --format json
Available fix targets: engine, dataset, nifi, postgres, qdrant, minio, endpoints, evolution.
Each fix strategy is a multi-step remediation sequence with verification at each step. The system attempts increasingly aggressive fixes until the service recovers.
Step 4: Monitor Recovery
After applying a fix, monitor the health observer for recovery:
# Check observer status
uv run gaius-cli --cmd "/health observer status" --format json
# List active incidents
uv run gaius-cli --cmd "/health observer incidents" --format json
# Poll for recovery
for i in $(seq 1 10); do
sleep 15
uv run gaius-cli --cmd "/health" --format json | \
jq '.data.checks[] | select(.status != "ok") | {name, status, message}'
done
Step 5: Escalation
If /health fix does not resolve the issue, the Health Observer can escalate via ACP (Agent Client Protocol) to Claude Code for deeper analysis. This happens automatically when:
- An incident exceeds the configured FMEA RPN threshold
- Local remediation has failed
- The incident is not in cooldown
Manual escalation path – use just restart-clean as the last resort:
just restart-clean
This performs a full clean restart of all services: stops everything, cleans up state, and restarts from scratch.
FMEA Framework
The health system uses Failure Mode and Effects Analysis (FMEA) to prioritize issues. Each failure mode has a Risk Priority Number (RPN) computed from severity, occurrence frequency, and detection difficulty. Higher RPNs get attention first.
# View the FMEA catalog
uv run gaius-cli --cmd "/fmea catalog" --format json
# Calculate RPN for a specific failure mode
uv run gaius-cli --cmd "/fmea rpn <mode>" --format json
Health Observer Daemon
The Health Observer runs as a background daemon, continuously monitoring service health and automatically triggering remediation when issues are detected:
# Start the observer
uv run gaius-cli --cmd "/health observer start" --format json
# Stop the observer
uv run gaius-cli --cmd "/health observer stop" --format json
When running, it checks services periodically and logs incidents. Resolved incidents are filtered out of the active list, but unknown or unexpected states remain visible (fail-open for observability).
Evolution Workflow
The evolution workflow improves Gaius agents over time through task ideation, training, evaluation, and promotion. This is a cycle that repeats as agents accumulate more data and experience.
Overview
Status check --> Task ideation --> Training --> Evaluation --> Promotion
^ |
|___________________________________________________________|
Each cycle produces a new generation of agent versions. Successful versions are promoted to active status; underperformers are retained for comparison but not used in production.
Step 1: Check Status
Before starting an evolution cycle, check the current state:
uv run gaius-cli --cmd "/evolve status" --format json
This shows the current generation number, active agents, evaluation state, and any capability gaps. Pay attention to:
- Generation: which cycle you are on
- Active agents: which agent versions are currently serving
- Capability gaps: areas where agents underperform
Step 2: Task Ideation
Generate new training tasks based on identified capability gaps:
uv run gaius-cli --cmd "/evolve task ideation" --format json
The ideation process analyzes recent performance data and gap analysis to propose tasks that target specific weaknesses. Tasks are designed to push agents toward areas where they currently underperform.
Step 3: Trigger Evolution
Start the evolution cycle. This runs training with the generated tasks and produces new agent versions:
uv run gaius-cli --cmd "/evolve trigger" --format json
Evolution requires healthy inference endpoints. Verify with:
uv run gaius-cli --cmd "/gpu status" --format json | \
jq '.data.endpoints[] | {name, status}'
All endpoints should show HEALTHY before triggering evolution. If they do not, run /health fix endpoints first.
Step 4: Evaluate
After training completes, evaluate the new agent versions against held-out test data:
# Check evaluation results
uv run gaius-cli --cmd "/evolve status" --format json | jq '.data.evaluation'
# View held-out statistics
uv run gaius-cli --cmd "/evolve held-out stats" --format json
Evaluation uses the RASE verification framework. Each agent version is scored on accuracy (0.0-1.0, proportion of constraints satisfied) and compared against previous versions.
Step 5: Promote or Roll Back
If the new version outperforms the current active version, promote it:
# View the best version
uv run gaius-cli --cmd "/evolve best" --format json
# Promote (via MCP or direct command)
uv run gaius-cli --cmd "/evolve promote" --format json
If the new version underperforms, roll back to a known good version:
uv run gaius-cli --cmd "/evolve rollback" --format json
Evolution Daemon
For continuous improvement, start the evolution daemon which runs cycles automatically:
# Start the daemon
uv run gaius-cli --cmd "/evolve daemon start" --format json
# Check daemon status
uv run gaius-cli --cmd "/evolve daemon status" --format json
# Stop the daemon
uv run gaius-cli --cmd "/evolve daemon stop" --format json
The daemon monitors capability gaps and triggers evolution cycles when thresholds are exceeded.
Monitoring Evolution Trends
Track improvement over time:
uv run gaius-cli --cmd "/evolve trend" --format json
This shows how agent performance has changed across generations. Look for:
- Upward trend: agents are improving, the evolution cycle is working
- Plateau: training tasks may need diversification, or capability limits have been reached
- Regression: roll back to a previous version and investigate
Model Merging
When multiple specialized agent versions exist, model merging can combine their strengths:
# View merge candidates
uv run gaius-cli --cmd "/evolve merge candidates" --format json
# Trigger a merge
uv run gaius-cli --cmd "/evolve merge" --format json
# View lineage
uv run gaius-cli --cmd "/evolve lineage" --format json
Model lineage tracking records the ancestry of each merged version, enabling traceability from the final model back to its training data and parent versions.
Design Philosophy
Gaius is more than a visualization tool—it’s an experiment in augmented cognition. The design integrates principles from human factors engineering, situational awareness research, and decades of interface evolution to create something genuinely new.
Foundational Principles
1. Spatial Cognition First
Humans evolved to navigate physical space. We have dedicated neural hardware for:
- Allocentric mapping: Understanding space from a fixed reference frame
- Path integration: Tracking position through movement
- Landmark recognition: Identifying significant points
Gaius exploits this by mapping abstract data onto a navigable grid. The cursor becomes your position. Regions become territories. Movement through the grid engages spatial reasoning circuits that spreadsheets leave dormant.
2. Perceptual Bandwidth
Vision is our highest-bandwidth sense. Reading text: ~250 words/minute. Recognizing a scene: ~100ms. Gaius prioritizes visual pattern recognition over sequential text processing.
When you see agents clustered in a corner with death loops nearby, you perceive the situation instantly—before you could read a report describing it.
3. Modal Efficiency
Modal interfaces concentrate related operations. In normal mode, every key is a navigation or view command—no modifier keys needed. This reduces both physical motion and cognitive load.
Critics of modes cite “mode errors” (typing in wrong mode). Gaius addresses this with:
- Clear mode indicators in status line
- Consistent escape semantics (Esc always returns to normal)
- Mode-appropriate cursor styling (planned)
4. Progressive Complexity
New users see a clean grid. They navigate with hjkl, toggle modes, quit with q. Nothing confusing.
Power users access deeper functionality through slash commands, MCP tools, and CLI scripting. Three interfaces — TUI, CLI, MCP — offer increasing levels of automation.
Complexity is opt-in, not mandatory.
5. Transparency Over Magic
Every visual element has an explanation. The grid shows exactly what it’s told to show. Agent positions derive from actual embeddings through a defined projection. Death loops come from computed homology.
No black boxes. No “AI magic.” Understanding the system enables trusting the system.
Human Factors Integration
Gaius incorporates principles from human factors engineering—the discipline of designing systems that account for human capabilities and limitations.
Cognitive Load Management
Miller’s Law: Working memory holds 7±2 chunks. Gaius manages this by:
- Showing at most 7 agents (one per color)
- Limiting candidate markers to 9 (a-i)
- Using overlays to separate concerns (one layer at a time)
Hick’s Law: Decision time increases with choice count. Modal operation reduces active choices at any moment.
Attention and Distraction
The grid provides a stable anchor. Overlays add information; the base never shifts unexpectedly.
Status updates appear in the designated status line—not as popups or animations that hijack attention.
Error Prevention
Confirmation for destructive actions: Clear memory, quit with unsaved changes Reversible operations: Overlay cycles, mode toggles, cursor movement Visible state: Current mode, active features, domain always displayed
Fitts’s Law and Input
Fitts’s Law: Target acquisition time depends on distance and size. Keyboard input eliminates targeting entirely—no mouse movement, no precision required.
hjkl navigation is the fastest possible input for grid movement.
Situational Awareness
Situational awareness (SA) is the perception, comprehension, and projection of system states. Gaius is explicitly designed to support all three levels of SA as defined by Endsley (1995).
Level 1: Perception
What is happening?
Gaius provides immediate perception through:
- Grid state: See where entities are located
- Density shading: See relative magnitudes at a glance
- Agent positions: See where each analytical lens is focused
- Death loops: See topological features visually
No reading required. No scrolling. The state is visible.
Level 2: Comprehension
What does it mean?
Comprehension emerges from:
- Spatial relationships: Clusters = consensus, scatter = uncertainty
- Overlay transitions: Compare views to understand multi-dimensional state
- Color coding: Consistent agent colors build recognition
- Historical context: Memory enables “this is different from before”
Level 3: Projection
What will happen next?
Projection is supported by:
- Swarm dynamics: Watch convergence/divergence trends
- Entropy tracking: Rising entropy may signal regime change
- Death loop evolution: New loops appearing = emerging risk
- Agent trajectories: Where is each analytical perspective moving?
SA Demons (Threats to Awareness)
Endsley identified common SA failures. Gaius defends against them:
| SA Demon | Gaius Defense |
|---|---|
| Attention tunneling | Overlay cycling forces perspective shifts |
| Data overload | Layered disclosure; modes separate concerns |
| Out-of-the-loop | Swarm runs show agent “thinking” in real-time |
| Misplaced salience | Consistent visual vocabulary; no flashy distractions |
| Complexity creep | Feature flags; base UI is minimal |
The OODA Loop
Boyd’s OODA (Observe-Orient-Decide-Act) loop describes competitive decision-making:
- Observe: Grid displays current state
- Orient: Overlays, memory search, agent positions inform context
- Decide: Slash commands, domain changes, focus actions
- Act: Run swarm rounds, mark positions, export insights
Fast OODA loops win. Gaius minimizes latency at every stage.
Design Tensions
Every design involves tradeoffs. Gaius makes explicit choices:
Density vs. Clarity
The grid could show more information (color + shape + size). We prioritize clarity—one symbol per cell, overlays for additional dimensions.
Flexibility vs. Consistency
Custom projections enable domain adaptation. But core navigation (hjkl) never changes. Flexibility in content, consistency in interaction.
Power vs. Accessibility
Modal interfaces have a learning curve. We accept this tradeoff because mastery enables flow states inaccessible to modeless interfaces.
Automation vs. Control
Agents suggest; humans decide. The swarm provides perspectives, not prescriptions. Autonomy remains with the operator.
The Goal: Augmented Cognition
Gaius aims to extend human perception into domains we can’t naturally sense:
- High-dimensional embedding spaces
- Topological structure of point clouds
- Collective reasoning of agent swarms
By projecting these onto a navigable grid with overlays and keyboard-driven interaction, we make the invisible visible—and navigable.
This is augmentation, not replacement. The human remains in control, with enhanced perception of complex systems.
Co-Creation with Code Agents
Gaius represents a novel architectural pattern: an application co-created with AI code agents, where the development process itself shapes the system’s design.
The Co-Creation Paradigm
Traditional software development follows a clear separation: humans design, humans implement, humans document. Gaius challenges this by integrating Claude Code (powered by Claude Opus 4.5) as a first-class development partner.
This isn’t “AI-assisted coding” in the conventional sense. It’s a symbiotic development process where:
- The human provides vision and judgment — strategic direction, quality assessment, architectural taste
- The code agent provides implementation velocity — exploring codebases, generating code, maintaining consistency
- The system evolves through dialogue — features emerge from conversation, not specification documents
Implications for Architecture
When an AI agent is a development partner, certain architectural choices become natural:
Interface Parity: CLI, TUI, and MCP interfaces must provide equivalent functionality. Why? Because the code agent (via MCP) needs access to the same operations the human uses (via TUI). Parity isn’t a nice-to-have; it’s essential for the agent to effectively participate in development and testing.
Living Documentation in the KB: Command references live in the Knowledge Base ([[current/commands/]]), not frozen in mdbook. The command set evolves as the agent and human add features together. Static documentation would be perpetually stale.
Self-Describing Systems: The MCP tools are the API. The CLI commands are the operations. When these are well-named and well-documented, the code agent can discover and use them without additional instruction.
The Knowledge Base as Shared Memory
A key insight: the KB serves as shared context between human and agent across sessions.
What Belongs in the KB vs. mdbook
KB (build/dev/) | mdbook (docs/) |
|---|---|
| Command reference (evolving) | Design philosophy (stable) |
| Current research threads | Architectural foundations |
| Session notes and decisions | Core concepts |
| Feature-specific documentation | User guides |
| Agent-generated analysis | Contributing guidelines |
The distinction: KB content may change between sessions as features evolve. mdbook content captures enduring principles that guide the evolution.
Example: The Commands Directory
The command reference in [[current/commands/]] was created during a session where we:
- Audited all commands across CLI, TUI, and MCP
- Identified parity gaps
- Documented each interface comprehensively
This documentation now serves multiple purposes:
- For humans: Quick reference, training material
- For code agents: Discovery of available operations
- For development: Gap analysis, parity tracking
If we added the command reference to mdbook, it would be outdated within days. In the KB, it can evolve with the system.
BDD as Collaborative Specification
Behavior-Driven Development (BDD) takes on new significance in co-created systems.
Feature Files as Contracts
Gherkin feature files (features/*.feature) serve as:
- Executable specifications — Tests that verify behavior
- Agent-readable requirements — Clear, structured descriptions the code agent can understand
- Living documentation — Always synchronized with actual behavior
Feature: Wiki Link Resolution
As a knowledge worker
I want broken wiki links to resolve via search
So that the knowledge graph grows organically
Scenario: Selecting an unresolved wiki link
Given a file "test.md" containing "[[nonexistent-topic]]"
When I select the broken link in the graph panel
Then a search runs for "nonexistent-topic"
And a new zettelkasten note is created
And the original link is updated to point to the new note
This scenario was implemented in a single session. The code agent:
- Read the feature file to understand requirements
- Implemented the feature across multiple files
- Created tests to verify the behavior
Scenarios as Design Discussions
BDD scenarios often emerge from human-agent dialogue:
Human: “When I click a broken link, instead of an error, can it search and create a note?”
Agent: “So the flow would be: detect missing target → run search → synthesize note → update original link?”
Human: “Yes, and add a backlink from the new note to the origin.”
This conversation becomes a scenario. The scenario becomes a test. The test drives the implementation.
Interface Parity as Architectural Principle
The three interfaces serve different users:
| Interface | Primary User | Interaction Pattern |
|---|---|---|
| TUI | Human (interactive) | Real-time visualization, keyboard navigation |
| CLI | Human (scripting), CI/CD | JSON output, automation |
| MCP | Code agents, integrations | Structured tool calls |
Why Parity Matters
When interfaces drift apart:
- The code agent can’t test what the human experiences
- Automation scripts break when TUI adds features
- Documentation fragments across interfaces
Gaius addresses this through:
- Shared core functions — CLI and TUI call the same underlying methods
- MCP as the comprehensive API — 163 tools covering all operations
- Regular parity audits — Tracking gaps in
[[current/commands/index]]
The Parity Matrix
The command coverage matrix explicitly tracks which operations are available where:
| Command | CLI | TUI | MCP |
|--------------|-----|-----|-----|
| /search | ✓ | ✓ | ✓ | ← Full parity
| /model add | ✓ | - | ✓ | ← TUI gap (priority)
| /init | - | ✓ | - | ← TUI-specific (OK)
This matrix is itself a development artifact that guides prioritization.
Practical Patterns
Pattern 1: Agent-Discoverable Operations
Name commands and tools descriptively:
scheduler_health_checknotshc/evolve triggernot/evo t
The code agent reads these names. Clear naming reduces confusion.
Pattern 2: JSON-First CLI
CLI commands return structured JSON by default:
uv run gaius-cli --cmd "/state" --format json
This enables:
- Agent parsing of command output
- Scripted verification of behavior
- Pipeline integration
Pattern 3: Incremental Documentation
Don’t write comprehensive documentation upfront. Let it emerge:
- Implement feature with agent
- Agent documents as it implements
- Human reviews and refines
- Documentation evolves with feature
Pattern 4: Session Handoff
The KB preserves context across sessions:
[[scratch/YYYY-MM-DD/]]— Daily working notes[[current/commands/]]— Living reference- Research threads — Ongoing investigations
When a new session starts, the agent can read recent KB entries to resume context.
The Meta-Principle
Systems designed for co-creation with code agents are inherently more maintainable.
Why? Because the requirements for agent collaboration—clear interfaces, structured data, living documentation, testable behavior—are the same requirements for long-term maintainability.
Designing for an AI collaborator forces us to:
- Make implicit knowledge explicit
- Structure operations consistently
- Document as we build
- Test what we document
These are good practices regardless of whether an agent is involved. The agent just makes them essential.
Future Directions
Agent-Initiated Evolution
Currently, the human initiates feature development. Future systems might:
- Have the agent propose features based on usage patterns
- Automatically generate BDD scenarios from user feedback
- Self-document new capabilities as they’re added
Multi-Agent Development
Gaius already uses multi-agent swarms for analysis. The same pattern could apply to development:
- Architect agent proposes structure
- Implementation agent writes code
- Critic agent reviews
- Documentation agent updates KB
Adaptive Interfaces
If the agent tracks which operations are used most, it could:
- Suggest adding frequently-used MCP tools to TUI
- Identify commands that should be automated
- Propose interface simplifications
Conclusion
Gaius isn’t just a tool for augmented cognition—it’s a case study in augmented development. The co-creation paradigm, where human vision and AI implementation velocity combine, produces systems that are:
- More consistent — The agent enforces patterns across the codebase
- Better documented — Documentation emerges from the development dialogue
- More testable — BDD scenarios are natural outputs of requirement discussions
- Easier to maintain — Clear interfaces required for agent collaboration benefit all maintainers
The KB as shared memory, interface parity as principle, and BDD as collaborative specification—these patterns aren’t specific to Gaius. They’re applicable to any system designed for human-AI co-creation.
The future of software development isn’t human OR machine. It’s human AND machine, each contributing their strengths to create systems neither could build alone.
Inspirations
Gaius stands on the shoulders of giants. This section traces the lineage of ideas that inform its design.
The Polymath Tradition
Gaius Plinius Secundus (23-79 CE)
Pliny the Elder’s Naturalis Historia attempted to catalog all knowledge of the natural world across 37 books. He wrote: “Nature is to be found in her entirety nowhere more than in her smallest creatures.”
This spirit—systematic observation, comprehensive scope, attention to detail—animates Gaius. The grid is our attempt at a unified view of complex domains.
The Encyclopedists
Diderot and d’Alembert’s Encyclopédie (1751-1772) organized knowledge with cross-references, creating a navigable web of ideas. Gaius’s scene graph and semantic search continue this tradition.
Modern Polymaths
Herbert Simon (AI, economics, psychology), Douglas Engelbart (augmented intelligence), Seymour Papert (constructionism)—thinkers who crossed disciplines to synthesize new understanding. Gaius is built for their intellectual descendants.
Interface Lineages
Terminal Interfaces
From TTY to VT100 to ANSI terminals to modern terminal emulators, the text interface has evolved continuously. Gaius inherits:
- Character grid: Discrete, addressable positions
- ANSI styling: Colors, bold, background
- Keyboard primacy: No mouse required
- Stream output: Log panels for sequential information
Modal Editors
vi (1976) → vim (1991) → neovim (2014) → modern modal interfaces. Key insights:
- Modes reduce modifier keys: Insert mode types; normal mode commands
- Composability:
d3w(delete 3 words) combines operation + count + motion - Muscle memory: Consistent bindings become automatic
Gaius adopts hjkl and plans command composition (/focus Risk | /analyze).
Plan 9 and Acme
Rob Pike’s Acme editor (1994) introduced:
- Mouse chording: Combined mouse buttons for operations
- Text as command: Select text, execute it
- Windowing without decoration: Content maximizes screen real estate
- Unix philosophy at the UI level: Small, composable pieces
Gaius plans Acme-inspired text execution for the log panel.
Professional Interfaces
Bloomberg Terminal
Since 1981, Bloomberg has defined professional data interfaces:
- Information density: Every pixel works
- Keyboard-first:
<GO>commands, function keys, minimal mouse - Consistent vocabulary: Familiar patterns across thousands of functions
- Real-time updates: Live data as the base state
Gaius inherits the density and keyboard ethos while modernizing the visual language.
Trading Floors
Before terminals, open outcry trading used:
- Spatial organization: Pits and rings for specific instruments
- Hand signals: High-bandwidth visual communication
- Peripheral awareness: Seeing the whole floor at once
The grid echoes the trading pit—a spatial organization of a complex domain.
Modern Developments
Gödel Terminal
The emerging Gödel Terminal project explores:
- AI-native interfaces: Designed for LLM integration
- Semantic commands: Natural language as primary input
- Dynamic context: Interface adapts to conversation
Gaius draws on this for its slash command system and domain adaptation.
Claude Code
Anthropic’s Claude Code (the tool you’re reading about this in) pioneered:
- Slash commands:
/help,/clear,/review - Context awareness: Understanding codebase structure
- Conversational flow: Natural language with structured commands
Gaius’s command system directly inherits this pattern.
LLM-Augmented Interfaces
The 2023-2024 wave of LLM tools demonstrated:
- Natural language as interface: Beyond command-line syntax
- Agent architectures: Multiple specialized perspectives
- Embeddings everywhere: Semantic similarity as fundamental operation
Gaius integrates all three.
Visualization Traditions
Information Visualization
Tufte’s principles:
- Data-ink ratio: Maximize information, minimize decoration
- Small multiples: Repeated grids for comparison
- Layering and separation: Overlays instead of clutter
Topological Visualization
Carlsson and others showed that shape matters. TDA visualization typically uses:
- Persistence diagrams: Birth-death scatter plots
- Barcodes: Horizontal bars for feature lifespans
Gaius experiments with projecting these onto the grid—making topology spatial.
Game Interfaces
Go software (KGS, OGS, Sabaki) provides:
- Board representation: The 19×19 standard
- Coordinate systems: A-T, 1-19
- Stone visualization: Contrast, shadows, territory
We inherit the board but repurpose it for data.
Cognitive Science
Embodied Cognition
Lakoff, Johnson, and others argue that thought is grounded in bodily experience. Spatial metaphors (“high status,” “falling behind”) pervade language.
Gaius literalizes these metaphors: positions have meaning, movement has direction, territory can be claimed.
Distributed Cognition
Hutchins showed that cognition extends beyond the skull—tools, environments, and other people participate in thinking.
Gaius + human + agent swarm form a cognitive system. The grid is external memory; agents are external perspectives; topology is external pattern detection.
Ecological Psychology
Gibson’s affordances: the environment offers action possibilities. A grid affords navigation. Overlays afford comparison. Commands afford precision.
Design is the creation of useful affordances.
Synthesis
Gaius attempts to synthesize:
| Tradition | Contribution |
|---|---|
| Polymath encyclopedism | Comprehensive scope, cross-reference |
| Terminal interfaces | Text grid, keyboard, streaming |
| Modal editors | hjkl, modes, composition |
| Plan 9 / Acme | Text as command, minimal chrome |
| Bloomberg | Density, professionalism, real-time |
| Gödel / Claude Code | AI-native, slash commands |
| Visualization | Tufte principles, TDA projection |
| Cognitive science | Spatial cognition, distributed thinking |
The result is something new—an interface paradigm for augmented cognition in complex domains.
Bloomberg Terminal
The Bloomberg Terminal, launched in 1981, remains the gold standard for professional financial interfaces. With over 300,000 subscribers paying ~$24,000/year, it demonstrates that density and keyboard-first design can command premium value.
What Bloomberg Gets Right
Information Density
A Bloomberg screen contains more data per pixel than almost any other interface. Multiple panels display:
- Real-time quotes
- News headlines
- Chart overlays
- Analytics
- Communication
Nothing is wasted. Every region serves a purpose.
Keyboard Supremacy
Bloomberg operators type commands like AAPL <EQUITY> GO to navigate. Function keys, abbreviations, and muscle memory enable speeds impossible with mouse navigation.
The terminal was designed for traders who can’t afford to look away from the market to find a menu item.
Consistent Mental Model
Despite thousands of functions, Bloomberg maintains consistency:
<GO>executes<MENU>shows options- Yellow keys are market sectors
- Green keys are actions
Learn the pattern once, apply it everywhere.
Real-Time as Default
Bloomberg screens update continuously. You don’t refresh; you watch. The terminal shows the world as it happens.
What Gaius Inherits
Density Without Clutter
The 19×19 grid provides 361 data points. Overlays add dimensions. But each view is coherent—one mode, one overlay, one interpretation.
Bloomberg achieves density through multiple panels. Gaius achieves it through layers on a unified surface.
Keyboard-First
hjkl navigation. Slash commands. No required mouse. Power users should never reach for the trackpad.
Bloomberg charges premium prices for keyboard efficiency. Gaius provides it freely.
Consistency
Overlay cycling always uses o. Mode toggle always uses v. Quit is always q. The vocabulary is small and stable.
Live Updates
Swarm rounds update the grid in real-time. Agent positions shift as analysis proceeds. The view is alive.
Where Gaius Differs
Visual Language
Bloomberg uses dense text, tables, and traditional charts. Gaius uses a spatial grid with symbolic markers.
The grid enables pattern recognition that tables don’t. A cluster is visible instantly; a column of numbers requires scanning.
AI Integration
Bloomberg has added AI features incrementally. Gaius is AI-native—agents are foundational, not bolted on.
Openness
Bloomberg is proprietary and expensive. Gaius is open and free. The design philosophy is available for inspection and critique.
Domain Agnosticism
Bloomberg serves finance. Gaius adapts to any domain via the --domain flag. Pension analysis today, supply chain tomorrow, cybersecurity next week.
Lessons for Gaius
-
Respect expertise: Bloomberg doesn’t dumb down for casual users. Gaius shouldn’t either.
-
Invest in consistency: Bloomberg’s decades-old commands still work. Gaius should avoid gratuitous changes.
-
Optimize for flow: Bloomberg operators enter flow states. Gaius should enable the same.
-
Density is a feature: Information-rich displays serve experts. Don’t dilute for aesthetics.
-
Keyboard speed matters: Milliseconds add up over thousands of operations.
The Bloomberg Bar (Status Line)
Bloomberg’s status area shows:
- Current function
- User identity
- Connection status
- Contextual hints
Gaius’s status line serves the same purpose:
Ready | TDA on | Swarm (pension) | hjkl=move o=overlay
Both provide constant orientation without demanding attention.
Beyond Bloomberg
Bloomberg optimized for 1980s constraints: text terminals, limited bandwidth, human-only analysis.
Gaius operates in a different era:
- Unicode enables rich symbolism beyond ASCII
- Embeddings enable semantic operations
- Agents provide parallel analysis
- Topology reveals hidden structure
We inherit Bloomberg’s keyboard efficiency while transcending its visual limitations.
Gödel Terminal
The Gödel Terminal represents an emerging paradigm for AI-native interfaces. While still evolving, it offers design principles that Gaius incorporates.
The AI-Native Interface
Traditional interfaces were designed for direct manipulation: click buttons, fill forms, navigate menus. The user explicitly specifies every action.
AI-native interfaces shift this paradigm:
- Intent over action: Express what you want, not how to do it
- Semantic understanding: The interface comprehends context
- Adaptive response: Behavior adjusts to situation
- Conversational flow: Dialogue as primary interaction
Gödel’s Key Ideas
Semantic Commands
Instead of hierarchical menus, semantic commands express intent:
/analyze the risk concentration in the northeast quadrant
The system interprets “northeast quadrant,” understands “risk concentration,” and executes appropriately.
Context Windows
Gödel maintains rich context:
- Current state (what’s displayed)
- History (what was discussed)
- User patterns (typical workflows)
- Domain knowledge (relevant concepts)
Commands are interpreted within this context, reducing verbosity.
Dynamic Layouts
The interface reorganizes based on task:
- Analysis mode: Maximize grid, minimize chrome
- Research mode: Split with documentation
- Comparison mode: Side-by-side views
Agent Integration
Agents aren’t tools invoked occasionally—they’re persistent presences:
- Always available for queries
- Proactively surface insights
- Learn from interaction patterns
What Gaius Inherits
Slash Commands
Gaius’s /command syntax follows Gödel’s semantic approach:
/domain "supply chain"
/ask "What are the top risks?"
/focus Risk
These read as intent expressions, not procedure calls.
Domain Adaptation
The --domain flag and /domain command enable semantic rewiring:
/domain "cybersecurity incident response"
All agents, embeddings, and analyses reorient to the new domain.
Contextual Awareness
Future Gaius versions will maintain:
- Session history across restarts
- User preference learning
- Domain-specific vocabularies
- Personalized agent tuning
Proactive Insight (Planned)
Agents could surface observations unprompted:
[Risk] Entropy spike detected. New death loop forming near D4.
The interface becomes an active collaborator, not a passive tool.
Where Gaius Extends Gödel
Spatial Grounding
Gödel uses conventional screen layouts. Gaius adds a spatial metaphor:
- Positions have meaning
- Navigation has direction
- Territory can be claimed
This grounds abstract AI operations in spatial intuition.
Topological Awareness
Gödel focuses on semantic understanding. Gaius adds structural understanding via TDA:
- Shape of data
- Persistent features
- Emergence and dissolution
Visualization Priority
Gödel emphasizes text and conversation. Gaius emphasizes visual pattern:
- Grid as primary display
- Text as secondary (log panel)
- Overlays as visual analysis
Keyboard Efficiency
Gödel often implies mouse/touch interaction. Gaius prioritizes keyboard:
hjklnavigation- Single-key mode toggles
- Command completion
Design Tensions
Automation vs. Control
Gödel tends toward autonomous agents. Gaius keeps humans in the loop:
- Agents suggest, don’t act
- Swarm rounds are explicit (
s) - Domain changes are deliberate
Fluidity vs. Stability
Gödel’s dynamic layouts can disorient. Gaius’s grid is stable:
- 19×19 never changes
- Overlays add, don’t rearrange
- Status line always present
Natural Language vs. Structure
Gödel embraces free-form input. Gaius balances:
- Slash commands for precision
- Query commands for natural language
- Keyboard bindings for speed
The Synthesis
Gaius combines:
- Gödel’s semantic awareness
- Gaius’s spatial grounding
- Bloomberg’s keyboard efficiency
- TDA’s structural insight
The result is an AI-native interface that remains tangible—where complex analysis projects onto a navigable grid.
Future Convergence
As AI-native interfaces mature, we expect:
- More spatial metaphors (not just Gaius)
- Better keyboard integration
- Richer visualization
- Deeper agent collaboration
Gaius is an early experiment in this convergence.
Plan 9 & Acme
Plan 9 from Bell Labs (1992) was Ken Thompson and Rob Pike’s attempt to push Unix ideas to their logical conclusion. Its text editor, Acme (1994), remains one of the most influential programmer tools ever created.
Plan 9 Philosophy
Everything is a File
Unix had “everything is a file” as aspiration. Plan 9 achieved it:
- Network connections: files
- Processes: files
- Graphics: files
- Input devices: files
This uniformity enables composition. Any tool that reads files can process any system resource.
Distributed by Design
Plan 9 assumed network operation. Local and remote resources accessed identically. Your terminal could seamlessly use CPU from across the network.
Simplicity Through Completion
Rather than adding features, Plan 9 removed special cases. The result is smaller but more general.
Acme: A Different Editor
Acme is startling to modern users:
- No syntax highlighting
- No configuration files
- No plugins
- No key bindings (almost)
And yet, Acme users are among the most productive programmers.
Mouse Chording
Acme uses three-button mouse chording:
- Left: Select text
- Middle: Execute selected text as command
- Right: Search/open selected text
Any text can become a command. Type make, select it, middle-click. The boundary between text and action dissolves.
Tags as Command Lines
Each window has a “tag” line containing text. That text is executable:
/home/user/project Del Snarf Get | fmt | Look
Click on Del to delete the window. Click on fmt to reformat. The tag is a command palette you can edit.
No Modes
Acme has no insert/command mode distinction. You’re always in “insert mode”—typing inserts text. Commands are executed by clicking on them.
This eliminates mode errors entirely.
Plumbing
Plan 9’s plumber routes messages based on content. Click on a filename: it opens. Click on an error with line number: editor jumps there. Click on a URL: browser opens.
Pattern matching replaces explicit handlers.
What Gaius Inherits
Text as Command
Gaius plans to make log panel text executable:
[Risk] Cluster forming at K10-L12. Consider /analyze K10.
Click on /analyze K10 to execute it. Agent suggestions become actionable.
Minimal Configuration
Gaius aims for sensible defaults. The grid is 19×19. Colors are fixed. Navigation is hjkl. Power comes from composition, not configuration.
Compositional Commands
Planned command piping:
/region D4-F6 | /analyze | /summarize
Small operations combine into complex workflows—the Unix way.
Simplicity Through Generality
One grid serves many purposes:
- Go stones
- Pension allocations
- Agent positions
- Topological features
The grid is general; overlays specialize.
Where Gaius Differs
Modes Exist
Acme’s modelessness works for text editing. Gaius’s modes serve navigation:
- Normal mode:
hjklmoves cursor - Command mode: typing enters commands
- (Future) Visual mode: region selection
Modes concentrate related operations without modifier keys.
Keyboard Priority
Acme was designed for mice (three-button, specifically). Gaius prioritizes keyboard:
- Navigation without mouse
- Commands via slash prefix
- Mode switching via single keys
Both approaches are valid; Gaius serves users who prefer keyboard.
Visualization Over Text
Acme is fundamentally a text environment. Gaius is fundamentally visual:
- Grid as primary display
- Symbols over words
- Patterns over paragraphs
Lessons from Plan 9/Acme
1. Composition Over Features
Don’t add a feature when you can compose existing ones. Gaius’s overlay system composes simple layers; it doesn’t have a “complex visualization mode.”
2. Uniformity Enables Power
Consistent interaction patterns (every overlay cycles with o, every mode toggles with its key) compound into expertise.
3. Text as Interface
Making text executable bridges display and action. Log panel entries become command suggestions.
4. Defaults Over Configuration
Every configuration option is a decision users must make. Prefer good defaults. Gaius’s fixed color scheme and grid size are deliberate.
5. Network Transparency
Gaius doesn’t yet have distributed operation, but the architecture anticipates it:
- Agent swarms could run remotely
- Vector memory could be shared
- Grid state could synchronize
The Acme User Profile
Acme attracts a specific user: one who prefers mastery over convenience, composition over features, simplicity over apparent ease.
Gaius seeks the same users:
- Experts who will invest in learning
- Polymaths who work across domains
- Professionals who value efficiency
If you want a tool that works immediately without learning, Gaius (like Acme) isn’t it. If you want a tool that rewards mastery, welcome.
Rob Pike’s Influence
Pike’s essays—“Notes on Programming in C,” “A Lesson in Brevity,” various design rationales—express a philosophy:
- Clarity over cleverness
- Data structures over algorithms
- Composition over inheritance (before OOP made this controversial)
Gaius aspires to this clarity: a small set of concepts (grid, overlays, modes, commands) that compose into powerful workflows.
OODA Loop
Boyd’s OODA (Observe-Orient-Decide-Act) loop describes competitive decision-making under uncertainty. Gaius is explicitly designed to accelerate each phase.
The Loop in Gaius
Observe
The grid displays current system state. Health checks, agent positions, and topology overlays provide immediate perception without requiring sequential reading.
Tools: Grid view, /health, /gpu status, overlay modes
Orient
Context-building through overlays, memory search, and agent analysis. Multiple perspectives (risk, topology, temporal) help frame observations.
Tools: Overlay cycling (o), /search, /sitrep, MiniGrid projections
Decide
Slash commands, domain changes, and focus actions translate understanding into intent.
Tools: Command input (/), tenuki (t), mode cycling (v)
Act
Execute decisions: run analysis, apply fixes, export insights, trigger evolution.
Tools: /health fix, /evolve trigger, /render, /swarm
Fast OODA Wins
The competitive advantage of OODA comes from cycle speed. Gaius minimizes latency at every stage:
- Observe: Grid renders state instantly (no loading, no scrolling)
- Orient: Overlays toggle without delay (pre-computed)
- Decide: Keyboard-first eliminates mouse targeting time
- Act: Engine RPCs execute in <30s (most <1s)
OODA for Autonomous Agents
The same loop applies to Gaius’s autonomous systems:
| Phase | Health Observer | Evolution Daemon |
|---|---|---|
| Observe | Health checks | GPU utilization monitoring |
| Orient | FMEA risk scoring | Agent performance evaluation |
| Decide | Tier selection (0/1/2) | Candidate ranking |
| Act | Remediation or escalation | Promote or discard |
Fail Open Supports Observation
The fail open principle directly supports the Observe phase: by surfacing unknown states rather than hiding them, it ensures the OODA loop always has complete visibility.
Infrastructure
Gaius runs on a local development infrastructure managed by devenv (Nix-based), with process-compose for service orchestration and Just for task running.
Components
| Component | Purpose | Management |
|---|---|---|
| devenv | Nix-based development environment | devenv shell |
| process-compose | Service orchestration | devenv processes up/down |
| Just | Task runner (recipes) | just <recipe> |
| PostgreSQL | Primary database (:5444) | devenv process |
| Qdrant | Vector store (:6334) | devenv process |
| Aeron | IPC transport | devenv process |
| NiFi | Data ingestion | devenv process |
| Metabase | Analytics dashboards | devenv process |
| Gaius Engine | gRPC daemon (:50051) | devenv process |
Quick Start
# Enter development environment
devenv shell
# Start all services
devenv processes up
# Or clean restart (preferred)
just restart-clean
# Check status
uv run gaius-cli --cmd "/health" --format json
Architecture
devenv.nix is a pure service declaration file (~470 lines). It defines packages, environment variables, service configurations, and process dependency graphs. All process startup bash lives in scripts/processes/*.sh.
See:
- devenv Environment — Nix configuration details
- Process Scripts — Startup script architecture
- Just Task Runner — Available recipes
devenv Environment
Gaius uses devenv for a Nix-based development environment that provides all system dependencies reproducibly.
Structure
devenv.nix is a pure service declaration file. It defines:
- Packages: System tools (kubectl, k9s, mdbook, etc.) provided by Nix
- Environment variables: DATABASE_URL, PGPORT, KUBECONFIG, etc.
- Process definitions: One-liner
execblocks pointing to scripts - Dependency graphs: Process startup ordering via
depends_on - enterShell: Interactive shell setup (PATH, aliases, KUBECONFIG)
Key Design Rules
No Inline Bash
All process startup bash lives in scripts/processes/*.sh. The devenv.nix exec blocks are one-liners:
processes.gaius-engine = {
exec = ''
exec ${config.devenv.root}/scripts/processes/gaius-engine.sh
'';
};
Nix Store Paths as Env Vars
When a script needs Nix-managed binaries, pass them as environment variables:
processes.nifi = {
exec = ''
export NIFI_PACKAGE="${pkgs.nifi}"
exec ${config.devenv.root}/scripts/processes/nifi.sh
'';
};
KUBECONFIG Handling
enterShell only runs for interactive shells. Process scripts must set KUBECONFIG unconditionally from $HOME:
export KUBECONFIG="$HOME/.config/kube/rke2.yaml"
Never use fallback syntax (${KUBECONFIG:-...}) — the system KUBECONFIG may point to a root-owned path.
Environment Variables
| Variable | Value | Source |
|---|---|---|
DATABASE_URL | postgres://gaius:gaius@localhost:5444/zndx_gaius | devenv.nix |
PGPORT | 5444 | devenv.nix |
KUBECONFIG | ~/.config/kube/rke2.yaml | enterShell |
METAFLOW_SERVICE_URL | http://localhost:8180 | enterShell |
Nix-Managed Tools
kubectl and k9s are provided by Nix (not the system RKE2 binary). This ensures version consistency across environments.
Process Scripts
All process startup bash lives in scripts/processes/*.sh. Shared helpers are in scripts/lib/.
Process Scripts
| Script | Service | Dependencies |
|---|---|---|
aeron-driver.sh | Aeron IPC transport | None |
gaius-engine.sh | gRPC engine daemon | Aeron, PostgreSQL |
gaius-worker.sh | Background worker | Engine |
metabase.sh | Analytics dashboards | PostgreSQL |
metaflow-bootstrap.sh | Metaflow K8s setup | Kubernetes |
metaflow-db-setup.sh | Metaflow database | PostgreSQL |
metaflow-port-forwards.sh | K8s port forwarding | Kubernetes |
metaflow-ui.sh | Metaflow UI | Metaflow service |
nifi.sh | Data ingestion | PostgreSQL |
Shared Helpers
scripts/lib/process-helpers.sh
Common functions used by all process scripts:
| Function | Purpose |
|---|---|
banner | Print startup banner with service name |
check_disabled | Skip if service is disabled via env var |
wait_for_postgres | Block until PostgreSQL is accepting connections |
wait_for_aeron | Block until Aeron driver is ready |
scripts/lib/gpu-helpers.sh
GPU cleanup functions shared by gaius-engine.sh and the justfile:
| Function | Purpose |
|---|---|
gpu_cleanup | Kill orphan vLLM/CUDA processes |
Script Pattern
Every process script follows the same structure:
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/process-helpers.sh"
banner "Service Name"
check_disabled "SERVICE_NAME"
# Wait for dependencies
wait_for_postgres
# Set KUBECONFIG unconditionally (not from enterShell)
export KUBECONFIG="$HOME/.config/kube/rke2.yaml"
# Start the service
exec some-command --flags
Adding a New Process
- Create
scripts/processes/<name>.shwith the pattern above - Add process block to
devenv.nixwith one-liner exec - Pass any Nix-only values as env vars in the exec block
- Set dependency ordering with
process-compose.depends_on
Just Task Runner
Gaius uses Just as its task runner, replacing devenv-tasks which had SQLite locking issues.
Why Just
devenv-tasks 2.0.0 introduced a tasks.db SQLite file that deadlocks when tasks call devenv up. Just is a pure command runner with no state files — it reads justfile and executes recipes.
Key Recipes
just --list # Show all available recipes
# Core operations
just restart-clean # Full clean restart (preferred)
just proto-generate # Regenerate gRPC protobuf bindings
# GPU management
just gpu-cleanup # Kill orphan vLLM/CUDA processes
just gpu-deep-cleanup # Aggressive GPU memory cleanup
# Documentation
just docs-build # Build mdbook documentation
# Kubernetes
just k8s-cleanup # Clean up K8s resources
restart-clean
The most important recipe. Delegates to scripts/restart-clean.sh:
- Stops all devenv processes
- Kills stale vLLM/CUDA processes
- Cleans up GPU memory
- Strips DEVENV_* environment variables (uses
env -i) - Restarts everything fresh
Warm start time: ~13 seconds (Nix cached, tasks.db warm).
just restart-clean
Usage
Invoke from the devenv shell (or any shell with just + devenv):
devenv shell
just <recipe>
Recipes are defined in justfile at the project root.
Deployment
Gaius uses RKE2 Kubernetes for production services (Metaflow, supporting infrastructure) and local process-compose for the core platform.
Local Development
The primary deployment model is local, using devenv process-compose:
devenv processes up # Start all services
devenv processes down # Stop all services
just restart-clean # Clean restart (preferred)
Kubernetes Services
Supporting services run in RKE2 Kubernetes:
| Service | Namespace | Purpose |
|---|---|---|
| Metaflow metadata | default | Flow run tracking |
| Metaflow UI | default | Web dashboard |
Kubernetes resources are managed via Tilt in infra/tilt/.
See Also
- Kubernetes — RKE2 configuration
- Metaflow Service — Metaflow deployment
Kubernetes
Gaius uses an RKE2 cluster for running Metaflow and supporting services.
Kubeconfig Setup
RKE2 installs its kubeconfig at /etc/rancher/rke2/rke2.yaml (root-owned). Copy it to a user-accessible location:
mkdir -p ~/.config/kube
sudo cp /etc/rancher/rke2/rke2.yaml ~/.config/kube/rke2.yaml
sudo chown $(id -u):$(id -g) ~/.config/kube/rke2.yaml
Set KUBECONFIG:
export KUBECONFIG="$HOME/.config/kube/rke2.yaml"
This is set automatically by enterShell in devenv.nix for interactive shells. Process scripts set it unconditionally.
Nix-Managed Tools
kubectl and k9s are provided by Nix packages in devenv.nix — not the system RKE2 binary. This ensures version consistency.
Pod Networking
K8s pods need pg_hba.conf entries for cluster networks:
host all all 10.42.0.0/16 md5 # Pod network
host all all 10.43.0.0/16 md5 # Service network
Tilt
Development iteration on K8s resources uses Tilt, configured in infra/tilt/.
Cleanup
just k8s-cleanup # Clean up stale K8s resources
Metaflow Service
The Metaflow metadata service runs in Kubernetes and enables local flow execution with centralized run tracking.
Deployment
Metaflow service is deployed via Tilt in infra/tilt/.
Port Forwarding
The service runs in-cluster and needs port-forwarding for local access:
kubectl port-forward svc/metaflow-service 8180:8080
This is handled automatically by the metaflow-port-forwards.sh process script.
Environment
Set the service URL for flow runs:
export METAFLOW_SERVICE_URL=http://localhost:8180
This is set automatically by enterShell in devenv.nix.
Database
Metaflow uses the same PostgreSQL instance (port 5444) with its own database, set up by the metaflow-db-setup.sh process script.
Bootstrapping
The metaflow-bootstrap.sh script handles initial K8s resource creation for the Metaflow service.
Monitoring
Operational monitoring combines CLI health checks, GPU status tracking, and Metabase dashboards.
Quick Status
# Overall health
uv run gaius-cli --cmd "/health" --format json
# GPU status
uv run gaius-cli --cmd "/gpu status" --format json
# Health Observer incidents
uv run gaius-cli --cmd "/health incidents" --format json
Monitoring Stack
| Tool | Purpose | Access |
|---|---|---|
/health CLI | Infrastructure health checks | CLI/MCP |
/gpu status CLI | GPU and endpoint monitoring | CLI/MCP |
| Health Observer | Continuous background monitoring | Engine daemon |
| Metabase | Analytics dashboards | Web UI |
| Prometheus | Time-series metrics | Query API |
See Also
- Health Checks — Running diagnostics
- GPU Management — GPU operations
Health Checks
The /health command runs diagnostics across all system components and reports status.
Running Health Checks
# All checks
uv run gaius-cli --cmd "/health" --format json
# Specific category
uv run gaius-cli --cmd "/health gpu" --format json
uv run gaius-cli --cmd "/health endpoints" --format json
uv run gaius-cli --cmd "/health infrastructure" --format json
Interpreting Results
Each check reports a status:
| Status | Meaning |
|---|---|
PASS | Component is healthy |
WARN | Component has issues but is functional |
FAIL | Component is unhealthy |
Applying Fixes
When checks fail, use /health fix:
# Fix a specific service
uv run gaius-cli --cmd "/health fix engine" --format json
# Available services
# engine, dataset, nifi, postgres, qdrant, minio, endpoints, evolution
Always try /health fix before manual intervention. This exercises the self-healing system and helps it improve over time.
Manual Fallback
If /health fix fails:
# Full clean restart
just restart-clean
# GPU-specific cleanup
just gpu-cleanup
just gpu-deep-cleanup
FMEA Diagnostics
For deeper analysis:
# FMEA summary with RPN scores
uv run gaius-cli --cmd "/fmea" --format json
# Failure mode details
uv run gaius-cli --cmd "/fmea detail GPU_001" --format json
GPU Management
Gaius manages 6 NVIDIA GPUs across vLLM inference, LuxCore rendering, and embedding workloads.
GPU Allocation
| GPU | Typical Use |
|---|---|
| 0-1 | Reasoning endpoint (tensor_parallel=2) |
| 2-3 | Coding endpoint (tensor_parallel=2) |
| 4 | Embedding endpoint |
| 5 | Available for rendering/evolution |
Allocation is managed by the Orchestrator. GPUs can be temporarily reassigned for rendering or evolution workloads via makespan scheduling.
Status Monitoring
# Endpoint status
uv run gaius-cli --cmd "/gpu status" --format json
# GPU health (memory, temperature, utilization)
uv run gaius-cli --cmd "/gpu health" --format json
Cleanup
When GPU processes get stuck or memory leaks:
# Standard cleanup (kill orphan vLLM processes)
just gpu-cleanup
# Deep cleanup (aggressive memory recovery)
just gpu-deep-cleanup
The gpu-helpers.sh shared library provides the gpu_cleanup function used by both the engine startup script and the justfile recipes.
Common Issues
| Issue | Symptom | Fix |
|---|---|---|
| Orphan vLLM process | GPU memory used but no endpoint | just gpu-cleanup |
| OOM during model load | Endpoint stuck in STARTING | Free GPU, then /health fix endpoints |
| CUDA memory fragmentation | Degraded inference speed | just gpu-deep-cleanup then restart |
| OpenCV conflict | vLLM WorkerProc fails (cv2 error) | Already fixed via pyproject.toml override |
Rendering GPU Eviction
The viz pipeline temporarily evicts a low-priority endpoint to use a GPU for LuxCore rendering:
- Orchestrator evicts endpoint from target GPU
- LuxCore renders using PATHOCL engine with CUDA
clear_embeddings()releases Nomic model (~3GB)- Orchestrator restores evicted endpoint
Contributing
Gaius is an experiment in augmented cognition. Contributions that advance this vision are welcome.
Development Setup
# Clone and enter
git clone https://github.com/zndx/gaius.git
cd gaius
# Start devenv (provides all system dependencies)
devenv shell
# Install Python dependencies
uv sync
# Start all platform components
devenv processes up
# Or use a clean restart
just restart-clean
# Run the TUI
uv run gaius
# Run the CLI
uv run gaius-cli --cmd "/health" --format json
Project Structure
gaius/
├── src/gaius/ # Python source (26 packages)
│ ├── app.py # TUI application
│ ├── cli.py # Non-interactive CLI
│ ├── mcp_server.py # MCP server (163 tools)
│ ├── engine/ # gRPC engine (37 services)
│ ├── health/ # Self-healing infrastructure
│ ├── agents/ # Agent system
│ └── ...
├── scripts/
│ ├── processes/ # Process startup scripts
│ └── lib/ # Shared helpers
├── docs/current/ # mdbook documentation
├── config/ # HOCON configuration
├── justfile # Task runner recipes
├── devenv.nix # Development environment
├── pyproject.toml # Python dependencies
└── CLAUDE.md # Development guidelines
Development Workflow
Testing Changes
The CLI is the product. After every code change, verify via CLI:
# After editing code — always re-test
uv run gaius-cli --cmd "/health" --format json
Previous test outputs are invalidated by code changes. Don’t reason from stale context — run the command again.
Key Recipes
just --list # Show all available tasks
just restart-clean # Full clean restart
just proto-generate # Regenerate gRPC bindings
just gpu-cleanup # Clean up GPU processes
just docs-build # Build documentation
Design Principles
When contributing, these principles are mandatory:
- Fail-fast: Errors surface immediately with guru codes and remediation hints. No silent degradation.
- Engine-first: Business logic belongs in engine services, not in interfaces.
- Self-healing first: Prefer
/health fixover manual remediation. - Keyboard-first: Every operation available via keyboard.
- CLI verification: All new features must be testable via
gaius-cli.
Code Style
- Python 3.12+ features welcome
- Type hints for public interfaces
- Local imports inside functions for lazy loading in service modules
- Use
from gaius.core.config import get_database_urlfor DB URL (never hardcode)
Commit Messages
Use conventional commit style:
feat: add temporal overlay mode
fix: correct grid boundary check
docs: expand TDA explanation
refactor: simplify swarm initialization
Pull Request Process
- Create a feature branch
- Make changes with clear commits
- Verify via CLI:
uv run gaius-cli --cmd "/health" --format json - Ensure
cd docs/current && mdbook buildsucceeds if docs changed - Submit PR with description of changes
Architecture Decision Records
Key architectural decisions that shaped the system.
ADR-001: Engine-First Architecture
Context: Business logic was scattered across TUI, CLI, and utility scripts, causing duplication and inconsistency.
Decision: Centralize all business logic in the gRPC engine. TUI, CLI, and MCP become thin clients.
Consequences: Single source of truth for all operations. Engine manages GPU resources centrally. All interfaces get consistent behavior automatically.
ADR-002: Just Over devenv-tasks
Context: devenv-tasks 2.0.0 introduced SQLite locking on tasks.db that deadlocks when tasks call devenv up.
Decision: Migrate from devenv-tasks to Just as the task runner.
Consequences: Pure command runner with no state files. Recipes defined in justfile. No locking issues. scripts/restart-clean.sh still does actual work; Just recipe delegates to it.
ADR-003: Fail-Fast as Iron-Clad Principle
Context: Silent degradation hid problems until they became critical.
Decision: All code must surface errors immediately with guru meditation codes and remediation paths. No silent fallbacks.
Consequences: Higher initial friction (more explicit error handling) but dramatically faster diagnosis and resolution. Self-healing system built on reliable error detection.
ADR-004: FMEA for Health Monitoring
Context: Simple severity classifications don’t capture risk adequately — a rare but invisible failure is more dangerous than a frequent but obvious one.
Decision: Adopt FMEA (Failure Mode and Effects Analysis) with RPN scoring for health monitoring.
Consequences: Quantitative risk assessment (S x O x D). Tiered remediation based on risk level. Adaptive learning from outcomes.
ADR-005: LuxCore Over Blender for Visualization
Context: Blender’s Cycles renderer couldn’t render glass convincingly (opaque white blobs).
Decision: Use LuxCore unbiased path tracer for card visualization, initially via PyPI, later from source for GPU acceleration.
Consequences: Physically accurate glass rendering. GPU-accelerated via PATHOCL engine with CUDA. More complex build process but superior visual quality.
ADR-006: Process Scripts Over Inline Nix Bash
Context: devenv.nix contained inline bash blocks that were hard to debug and test.
Decision: Move all process startup bash to scripts/processes/*.sh with shared helpers. devenv.nix becomes a pure service declaration file with one-liner exec blocks.
Consequences: Scripts are independently testable. Shared helpers eliminate duplication. Nix store paths passed as environment variables.
Adding New ADRs
When making significant architectural decisions:
- Add an entry here with Context, Decision, and Consequences
- Reference the ADR in relevant code comments
- Update CLAUDE.md if the decision affects development workflow
Proto Change Workflow
Changes to the gRPC protobuf schema require a specific workflow to keep generated bindings, internal enums, and status mappings in sync.
Step-by-Step
1. Edit the Proto File
Edit src/gaius/engine/proto/gaius_service.proto. Append new enum values — don’t renumber existing values for wire compatibility.
2. Regenerate Bindings
just proto-generate
This generates gaius_service_pb2.py and gaius_service_pb2_grpc.py.
3. Update Generated Exports
Add new symbols to src/gaius/engine/generated/__init__.py:
- Add to the import block
- Add to the
__all__list
Critical: Skipping this causes engine startup failures.
4. Update Internal Enums
If there’s a parallel Python enum (e.g., in vllm_controller.py), sync it with the proto enum.
5. Update Status Mappings
Add string-to-proto mappings in the servicer’s _STATUS_MAP.
6. Verify
uv run python -c "from gaius.engine.generated import NEW_SYMBOL; print('OK')"
7. Restart and Test
just restart-clean
uv run gaius-cli --cmd "/gpu status" --format json
Common Mistakes
| Symptom | Cause | Fix |
|---|---|---|
| Engine fails to start | Missing export in __init__.py | Add symbol to imports and __all__ |
| Port 50051 not listening | Import error in gRPC server | Check engine logs |
| Status shows wrong value | Missing _STATUS_MAP entry | Add mapping |
See Protobuf Schema for more detail.
Testing
Gaius follows a CLI-first testing methodology. The CLI is the product — all functionality must be verifiable through it.
Core Rules
1. Re-Test After Every Code Change
Previous test outputs are invalidated by code changes. Don’t reason from stale context — run the command again.
# After editing code:
# BAD: "The fix should work based on my analysis"
# GOOD: Actually run it
uv run gaius-cli --cmd "/evolve status" --format json
2. No Static Test Data
We do not fall back to static test data. All functional aspects of new features must be verified directly against running services.
3. No Fallback Workarounds
Do not rely on fallbacks or workarounds when testing. If a service is down, fix it (via /health fix) rather than mocking around it.
Verification Patterns
Health Check
uv run gaius-cli --cmd "/health" --format json
Endpoint Status
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[] | {name, status}'
Evolution Status
uv run gaius-cli --cmd "/evolve status" --format json
Import Verification
For new modules or proto changes:
uv run python -c "from gaius.engine.generated import NewSymbol; print('OK')"
TUI Testing
TUI behavior must be tested using Textual Pilot before committing:
async with app.run_test() as pilot:
await pilot.press("h") # Navigate left
assert app.cursor_x == expected_x
Fail-Fast Compliance
Before committing, verify:
# No fallback patterns
grep -rn "fail_fast\|SELENIUM_AVAILABLE\)" src/gaius/
# No placeholder image colors
grep -rn "240, 240, 240" src/gaius/
All error messages must include guru meditation codes and remediation hints.
CLI Commands
63 slash commands are available in both the TUI and CLI interfaces. Commands are executed via:
# TUI: press / then type command
/health
# CLI: use --cmd flag
uv run gaius-cli --cmd "/health" --format json
Command Categories
Health & Diagnostics
| Command | Description |
|---|---|
/health | Run all health checks |
/health <category> | Run checks for category (gpu, endpoints, infrastructure) |
/health fix <service> | Apply automated fix strategy |
/health observer | Health Observer daemon status |
/health incidents | List active incidents |
/fmea | FMEA summary with RPN scores |
/fmea catalog | List all failure modes |
/fmea detail <id> | Failure mode details |
GPU & Endpoints
| Command | Description |
|---|---|
/gpu status | Endpoint and GPU status |
/gpu health | GPU memory, temperature, utilization |
Agents & Evolution
| Command | Description |
|---|---|
/swarm | Run swarm analysis |
/evolve status | Evolution daemon status |
/evolve trigger | Trigger evolution cycle |
/cognition | Trigger cognition cycle |
/thoughts | View recent thoughts |
/sitrep | Situational report |
/theta consolidate | Run theta consolidation |
Knowledge Base
| Command | Description |
|---|---|
/search <query> | Search knowledge base |
/kb list | List KB entries |
/kb create | Create KB entry |
System
| Command | Description |
|---|---|
/state | Current application state |
/render | Render card visualizations |
/xai budget | XAI API budget status |
Note: This is a representative subset. Run
/helpin the TUI for the complete list, or see the dispatch table insrc/gaius/cli.py.
MCP Tools
163 MCP tools expose Gaius functionality to Claude Code and other MCP-compatible clients.
Tool Categories
| Category | Count | Description |
|---|---|---|
| Health | ~20 | Health checks, FMEA, observer, incidents |
| Agents | ~15 | Swarm, evolution, cognition, theta |
| Inference | ~10 | Scheduler, endpoints, GPU status |
| Knowledge Base | ~15 | Search, CRUD, sync, semantic search |
| Observability | ~10 | Metrics, Prometheus, status |
| Data Pipeline | ~10 | Metaflow, lineage, flows |
| Visualization | ~5 | Render, card management |
| Bases | ~10 | Feature store queries, entity history |
| Collections | ~15 | Card collections, publishing |
| Articles | ~5 | Article curation, status |
| X Bookmarks | ~8 | Sync, auth, folders |
| Calibration | ~5 | Understanding calibration |
| Evolution | ~10 | Agent versions, optimization |
| System | ~25 | Config, models, sessions, research |
Naming Convention
Tools follow a consistent naming pattern: <domain>_<action> (e.g., health_observer_status, scheduler_submit, gpu_health).
Example Usage
From Claude Code:
> Use the health_observer_status tool to check system health
> Use the gpu_health tool to check GPU memory usage
> Use the search_kb tool to find articles about pensions
Server Configuration
{
"mcpServers": {
"gaius": {
"command": "uv",
"args": ["run", "gaius-mcp"],
"cwd": "/path/to/gaius"
}
}
}
Note: For the complete tool list with parameters, see
src/gaius/mcp_server.py.
Guru Meditation Codes
Complete catalog of error codes used across the Gaius platform.
Format
#<COMPONENT>.<SEQUENCE>.<MNEMONIC>
Catalog
DS — DatasetService
| Code | Description | Fix |
|---|---|---|
#DS.00000001.SVCNOTINIT | DatasetService not initialized | /health fix dataset |
NF — NiFi
| Code | Description | Fix |
|---|---|---|
#NF.00000001.UNREACHABLE | NiFi not reachable | /health fix nifi |
EN — Engine
| Code | Description | Fix |
|---|---|---|
#EN.00001.GRPC_BIND | gRPC port bind failure | Check port 50051 |
#EN.00002.VLLM_START | vLLM startup failure | /health fix endpoints |
#EN.00003.GPU_OOM | GPU out of memory | just gpu-cleanup |
#EN.00004.ORPHAN_PROC | Orphan vLLM process | just gpu-cleanup |
EP — Endpoints/Inference
| Code | Description | Fix |
|---|---|---|
#EP.00000001.GPUOOM | GPU out of memory during inference | /health fix endpoints |
GR — gRPC
| Code | Description | Fix |
|---|---|---|
#GR.00000001.CONNFAIL | gRPC connection failed | Check engine status |
ACP — Agent Client Protocol
| Code | Description | Fix |
|---|---|---|
#ACP.00000001.CONNFAIL | ACP connection failed | Check Claude Code |
#ACP.00000002.TIMEOUT | ACP connection timeout | Retry |
#ACP.SEC.00000002.NOTALLOWED | Repo not in allowlist | Update acp.conf |
#ACP.SEC.00000003.NOTPRIVATE | Repo not private | Make repo private |
ACF — Article Curation Flow
| Code | Description | Fix |
|---|---|---|
#ACF.00000013.NOHINTS | Empty keywords in article frontmatter | Add keywords/news_queries |
XB — X Bookmarks
| Code | Description | Fix |
|---|---|---|
#XB.00000001.NOTOKEN | No auth token | Complete OAuth flow |
#XB.00000011.NOFOLDER | Folders API unavailable (403) | Upgrade API tier |
HL — Health
| Code | Description | Fix |
|---|---|---|
#HL.00001.GRPC_DOWN | gRPC connection down | just restart-clean |
#HL.00002.GPU_OOM | GPU memory exhausted | just gpu-cleanup |
Note: This is a representative subset. Guru codes are assigned as new failure modes are identified. See
CLAUDE.mdfor the full format specification.
Database Schema
PostgreSQL database zndx_gaius on port 5444.
Connection
PGPASSWORD=gaius psql -h localhost -p 5444 -U gaius -d zndx_gaius
Connection URL: postgres://gaius:gaius@localhost:5444/zndx_gaius?sslmode=disable
Important: The database name is zndx_gaius, not gaius.
Key Tables
Cards & Content
| Table | Purpose |
|---|---|
cards | Card entities with metadata |
card_enrichments | Enrichment data for cards |
articles | Source articles |
article_content | Article text content |
FMEA & Health
| Table | Purpose |
|---|---|
fmea_catalog | Failure mode definitions (S/O/D scores) |
fmea_occurrences | Failure occurrence history |
fmea_outcomes | Remediation outcomes (for adaptive learning) |
fmea_approvals | Pending Tier 2 approvals |
healing_events | Self-healing audit trail |
health_observer_state | Observer daemon state |
Agents & Evolution
| Table | Purpose |
|---|---|
agent_versions | Agent prompt versions with lineage |
agent_evaluations | Evaluation results for evolution |
Operations
| Table | Purpose |
|---|---|
activity_log | System activity tracking |
x_bookmarks | X bookmark sync data |
x_bookmark_folders | X bookmark folder metadata |
x_sync_runs | X sync run history |
Accessing from Code
Always use the config helper:
from gaius.core.config import get_database_url
url = get_database_url()
Never hardcode connection parameters.
Configuration
Gaius uses HOCON configuration files with environment variable overrides. The canonical source is config/base.conf.
Configuration Hierarchy
config/base.conf— Default values~/.config/gaius/acp.conf— ACP-specific overrides- Environment variables — Highest priority
Database
| Key | Default | Env Var | Description |
|---|---|---|---|
database.host | localhost | PGHOST | PostgreSQL host |
database.port | 5444 | PGPORT | PostgreSQL port |
database.name | zndx_gaius | PGDATABASE | Database name |
database.user | gaius | PGUSER | Database user |
database.password | gaius | PGPASSWORD | Database password |
Important: Always use gaius.core.config.get_database_url() to get the connection URL. Never hardcode.
Engine
| Key | Default | Env Var | Description |
|---|---|---|---|
engine.grpc.host | 0.0.0.0 | GAIUS_ENGINE_HOST | gRPC bind host |
engine.grpc.port | 50051 | GAIUS_ENGINE_PORT | gRPC port |
engine.grpc.max_workers | 10 | — | Max gRPC worker threads |
engine.orchestrator.preload_endpoints | ["reasoning"] | — | Endpoints to load on startup |
engine.orchestrator.startup_timeout | 600 | — | Startup timeout (seconds) |
engine.scheduler.default_timeout | 120 | GAIUS_ENGINE_TIMEOUT | Default inference timeout |
engine.evolution.enabled | true | — | Enable evolution daemon |
engine.evolution.idle_threshold | 60 | — | GPU idle seconds before evolution |
Health
| Key | Default | Description |
|---|---|---|
health.check_interval | 60 | Health Observer poll interval (seconds) |
health.fmea.learning_rate | 0.2 | Adaptive S/O/D learning rate |
health.self_healing.enabled | true | Enable automatic remediation |
Agents
| Key | Default | Description |
|---|---|---|
agents.swarm.parallel | true | Enable parallel swarm execution |
agents.swarm.timeout | 60 | Swarm execution timeout (seconds) |
agents.theta.confidence_threshold | 0.8 | Theta consolidation confidence threshold |
ACP Security
Configured in ~/.config/gaius/acp.conf:
acp {
github {
allowed_repos = ["zndx/gaius-acp"]
require_private = true
verify_on_each_operation = true
cache_visibility_seconds = 300
}
}
Glossary
ACP — Agent Client Protocol. Integration layer for Claude Code to perform autonomous health maintenance.
AgendaTracker — Tracks scheduled endpoint transitions for makespan operations, preventing false-positive health incidents.
APO — Automatic Prompt Optimization. Technique for evolving agent system prompts.
Bases — Feature store for entity-centric data with temporal queries.
CLT — Cross-Layer Transcoder. Extracts sparse interpretable features from model activations.
Death Loop — An H1 topological feature (persistent cycle) in embedding space. Indicates feedback loops, circular dependencies, or systemic risk.
devenv — Nix-based development environment providing reproducible builds.
DQL — Domain Query Language. Query syntax for the Bases feature store.
FMEA — Failure Mode and Effects Analysis. Quantitative risk assessment using RPN scoring.
Guru Meditation Code — Unique error identifier (e.g., #DS.00000001.SVCNOTINIT). Inspired by the Amiga.
H0/H1/H2 — Homology dimensions: H0 = connected components, H1 = loops, H2 = voids.
HOCON — Human-Optimized Config Object Notation. Configuration file format used by Gaius.
Just — Command runner replacing devenv-tasks. Reads recipes from justfile.
KServe OIP — Open Inference Protocol. Standard gRPC interface for ML inference.
LuxCore — Unbiased path tracing renderer used for card visualizations.
Makespan — Total time from start to finish of a multi-step workload (eviction, loading, inference, restoration).
MCP — Model Context Protocol. Programmatic interface exposing 163 tools to AI assistants.
optillm — Inference-time reasoning enhancement (CoT, BoN, MoA techniques).
PATHOCL — LuxCore rendering engine using OpenCL/CUDA for GPU acceleration.
RASE — Rapid Agentic Systems Engineering. Python-native MBSE metamodel for verifiable agent training.
RLVR — Reinforcement Learning with Verifiable Reward. Agent training methodology.
RPN — Risk Priority Number. FMEA score: Severity x Occurrence x Detection (range 1-1000).
Tenuki — Go term for playing away from the current area. In Gaius, jumps the cursor to a strategic point.
Theta Consolidation — Memory compression inspired by hippocampal theta rhythms. Links knowledge across temporal slices.
TUI — Terminal User Interface. Interactive Textual application launched with uv run gaius.
vLLM — High-throughput LLM inference engine. Managed by the Orchestrator across 6 GPUs.
ACP Incident Resolution: 2026-01-01
A milestone in autonomous self-healing: Claude Code resolves GPU allocation conflicts using Gaius MCP tools.
Overview
On January 1, 2026, the Gaius HealthObserver daemon detected GPU memory exhaustion and escalated to Claude Code via the Agent Client Protocol (ACP). This document captures the complete investigation and resolution session, demonstrating the first successful end-to-end ACP escalation workflow.
Key Achievements
- Autonomous Root Cause Analysis: Claude Code identified overlapping GPU allocations between multiple endpoints
- MCP Tool Integration: Used
gpu_status,health_check, and orchestrator commands via gRPC - Self-Healing Validation: Monitored recovery and verified resolution through the same tooling
- Multiple Incident Cascade: Handled a follow-on incident caused by the same GPU contention pattern
Incident 1: GPU_001:gpu_1_health
▐▛███▜▌ Claude Code v2.0.54
▝▜█████▛▘ Opus 4.5 · Claude Max
▘▘ ▝▝ ~/local/src/zndx/gaius
Health Incident Context
Fingerprint: GPU_001:gpu_1_health
Endpoint: gpu_1_health
Failure Mode: GPU_001
RPN Score: 125 (S:5 × O:5 × D:5)
Escalation Tier: 2
Attempts: 3
Initial State
| GPU | Memory Used | Memory % | Healthy |
|---|---|---|---|
| 0 | 1.29 GB | 5.4% | ✓ |
| 1 | 23.6 GB | 98.4% | ✗ |
| 2 | 22.89 GB | 95.4% | ✓ |
| 3 | 22.89 GB | 95.4% | ✓ |
| 4 | 22.89 GB | 95.4% | ✓ |
| 5 | 0.56 GB | 2.3% | ✓ |
Endpoints at Incident Time
| Endpoint | Model | GPUs | Status | Issue |
|---|---|---|---|---|
| cap_reasoning | Qwen/QwQ-32B | [1,2,3,4] | healthy | Consuming GPU 1 |
| orchestrator | nvidia/Orchestrator-8B | [0,1] | starting | “Insufficient GPU memory” |
Root Cause Analysis
Claude Code identified the issue immediately:
Root Cause: GPU allocation conflict between two endpoints trying to use GPU 1:
cap_reasoning(Qwen/QwQ-32B) - Uses GPUs [1, 2, 3, 4], status: healthyorchestrator(nvidia/Orchestrator-8B) - Uses GPUs [0, 1], status: starting with “Insufficient GPU memory”The
cap_reasoningendpoint loaded first and consumed GPU 1’s memory (~23.6 GB of 24 GB). When orchestrator tries to start on GPUs [0, 1], GPU 0 has memory available (1.29 GB used) but GPU 1 is already exhausted.
Investigation Commands
# Check endpoint status
uv run gaius-cli --cmd "/gpu status" --format json
# Stop the stuck orchestrator
uv run gaius-cli --cmd "/gpu stop orchestrator" --format json
# Monitor recovery
uv run gaius-cli --cmd "/health gpu" --format json
Resolution
After stopping the orchestrator endpoint, the scheduler automatically rebalanced:
| GPU | Before | After |
|---|---|---|
| 0 | 5.4% | 0.01% |
| 1 | 98.4% | 0.01% ✓ |
| 2 | 95.4% | 92.2% |
| 3 | 95.4% | 94.3% |
| 4 | 95.4% | 94.3% |
| 5 | 2.3% | 0.01% |
Final State:
- orchestrator: HEALTHY (port 8094)
- coding: HEALTHY (port 8093)
- cap_reasoning: STOPPING
- fast: STARTING (port 8095)
Incident 2: VLLM_001:coding
Immediately after resolving the first incident, a second cascaded incident appeared.
Health Incident Context
Fingerprint: VLLM_001:coding
Endpoint: coding
Failure Mode: VLLM_001
RPN Score: 125 (S:5 × O:5 × D:5)
Escalation Tier: 2
Attempts: 3
Conflict Analysis
| Endpoint | GPUs | Status | Memory on GPU 1 |
|---|---|---|---|
| cap_reasoning | [1,2,3,4] | healthy | 22.89 GB (95.4%) |
| orchestrator | [0,1] | stopping | competing |
| coding | [1] | failed | can’t allocate |
| fast | [0] | healthy | - |
Resolution
The scheduler handled this automatically:
- Stopped
cap_reasoningto free GPUs [1,2,3,4] - Stopped
orchestratorandcoding - Cleared all GPU memory (95%+ → 0%)
- Restarted endpoints with non-overlapping allocations
Final State:
| Endpoint | Status |
|---|---|
| orchestrator | HEALTHY ✓ |
| coding | HEALTHY ✓ |
| reasoning | STOPPING |
| fast | STARTING |
Observations
What Worked
- FMEA-Based Escalation: RPN scoring correctly identified severity (125 = S:5 × O:5 × D:5)
- MCP Tool Chain: All diagnostic commands worked through gRPC proxying
- Scheduler Self-Healing: Automatic GPU reallocation after conflicts cleared
- Cascading Incident Detection: Second incident properly tracked with separate fingerprint
Identified Gaps
- GPU Overlap Detection: Scheduler allowed conflicting GPU assignments (
cap_reasoningandorchestratorboth claimed GPU 1) - Startup Ordering: No precedence constraints ensured larger models claim GPUs first
- Runtime Validation: GPU allocations only validated at scheduling time, not continuously
Order 3+ RCA Observations
These connect to CP-SAT constraints in makespan_scheduler.py:
| Constraint | Gap Identified |
|---|---|
GPU_MUTUAL_EXCLUSION | Enforced at planning time, not at runtime |
CONTIGUITY_REQUIREMENT | TP endpoints need contiguous GPU blocks |
PRECEDENCE | Large models should claim GPUs before small ones |
Significance
This incident represents a milestone in Gaius’s self-healing capabilities:
- First Successful ACP Escalation: HealthObserver → Claude Code → MCP tools → Resolution
- Closed-Loop Verification: Claude Code verified resolution using same tools that detected the issue
- RCA Framework Validation: Order 3+ observations identified scheduler constraint gaps
- Multi-Incident Handling: Cascading incidents tracked and resolved in sequence
The GPU allocation conflict exposed architectural issues that led to the RCA (Root Cause Analysis) framework development, enabling future incidents to be classified as OPERATIONAL (transient) or ARCHITECTURAL (needs code fix).
Captured from ACP session on 2026-01-01 04:11-04:45 UTC
Changelog
Notable changes and milestones in Gaius development.
2026-03 (Current)
- Bases feature store with DQL query language
- Card publishing gates on enrichment completeness
- Content Freshness health check
- KV coherence health check
- Per-type watchdog timeouts for scheduled tasks
2026-02
- LuxCore GPU rendering (PATHOCL engine with CUDA)
- Just task runner (replacing devenv-tasks)
- Process script architecture (no inline bash in devenv.nix)
- FMEA health framework with adaptive learning
- Article curation flow with Brave search
- X Bookmarks sync with folder-first strategy
- OpenCV/vLLM dependency conflict resolution
2026-01
- ACP (Agent Client Protocol) for Claude Code integration
- Health Observer daemon with ACP escalation
- Guru Meditation Code system
- Content sanitization for ACP security
2025-12
- gRPC engine with 37 services
- Orchestrator with makespan scheduling
- Evolution daemon with APO optimization
- Cognition service (autonomous thoughts)
- Theta consolidation pipeline
2025-11
- Initial TUI with 19x19 grid
- Persistent homology visualization
- Multi-agent swarm execution
- mdbook documentation foundation
- MiniGrid orthographic projections