Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Gaius

Gaius is a terminal interface for navigating graph-oriented data domains. It projects high-dimensional embeddings onto a discrete lattice via UMAP, computes persistent homology and Ollivier–Ricci curvature over the embedding space, and renders the results as interactive overlays on the lattice.

Named after Gaius Plinius Secundus (Pliny the Elder), whose Naturalis Historia cataloged the natural world across 37 books.

Capabilities

  1. Lattice Projection: UMAP (cosine metric, k=15 neighbors, min_dist=0.1) maps embedding vectors to continuous 2D coordinates. These are quantized to a 19×19 integer lattice by rounding and clipping to [0, 18]. The main lattice is accompanied by two 9×9 orthographic mini-grids centered on the cursor: an Embed view showing the local cosine-similarity neighborhood, and an Iso view rendering scalar fields (curvature, persistence, complexity) as elevation maps via inverse-distance-weighted interpolation (power=2).

  2. Persistent Homology (H₀–H₂): Ripser computes a Vietoris–Rips filtration over the cosine distance matrix of the original high-dimensional embeddings (not the projected coordinates), producing persistence barcodes for dimensions 0 through 2. Intervals with persistence > 0.1 are marked significant. H₀ captures connected components, H₁ captures 1-cycles, and H₂ captures 2-dimensional voids. Barcodes are rendered as overlays on the lattice, with persistent generators mapped to their lattice positions via the UMAP projection.

  3. Ollivier–Ricci Curvature: Discrete Ricci curvature is computed on a k-nearest-neighbor graph (k=15, cosine metric) constructed from the embedding space, using the OTD method with α=0.5. Per-node curvature is the mean of incident edge curvatures. The resulting curvature field, gradient vectors (finite-difference approximation), and divergence values are projected to the Iso mini-grid. Positive curvature indicates cluster interiors; negative curvature indicates semantic boundaries.

  4. Multi-Agent Exploration: Seven agents (Leader, Risk, Optimizer, Planner, Critic, Executor, Adversary) navigate the lattice with role-specific positioning behaviors (center-seeking, peripheral, random) and cluster affinities. Agent training uses the RASE framework (Rapid Agentic Systems Engineering), where constraints are composed declaratively via AllOf/AnyOf/Not and evaluated by a ground-truth oracle to produce verifiable reward signals.

  5. Modal Interface: Vim-style modal navigation (hjkl motion, slash-command dispatch, overlay toggles) over both the lattice and the underlying gRPC service graph.

  6. FMEA Health Observer: A background daemon scores system components on Severity × Occurrence × Detection. When risk priority numbers exceed configured thresholds, it escalates to an agent via the Agent Client Protocol (ACP) for FMEA-mediated intervention.

Computational Pipeline

The following pipeline is implemented end-to-end:

  1. Embed — Documents are encoded as multi-vector embeddings (ColNomic, GPU-accelerated) and indexed.
  2. Project — UMAP maps the embedding space to 2D; coordinates are rounded to the 19×19 integer lattice.
  3. Filtration — Vietoris–Rips filtration over the cosine distance matrix of original embeddings; Ripser computes persistence barcodes for H₀, H₁, H₂. Significant intervals (persistence > 0.1) produce topological overlays.
  4. Curvature — Ollivier–Ricci curvature on the k-NN graph (k=15, α=0.5, OTD); curvature, gradient, and divergence fields are interpolated onto the 9×9 Iso mini-grid via IDW.
  5. Exploration — Agents operate on the lattice; topological features and curvature values are available as grid state for trajectory selection.
  6. Rendering — LuxCore path-traces procedural card visualizations from the computed geometric features.

The lattice serves as both a visualization surface and a discrete approximation of the data manifold, coupling persistent homology, differential geometry, and agent-based exploration in one interactive system.

Architecture

  • Inference — gRPC control plane with 37 services coordinating 6 NVIDIA GPUs via makespan-scheduled vLLM
  • Interfaces — TUI, CLI, and MCP server (163 tools), all communicating with the engine via shared gRPC protocol
  • Pipelines — Metaflow orchestration for article curation, agent evaluation, and batch rendering
  • Visualization — LuxCore PATHOCL engine with GPU-accelerated rendering driven by a CFDG-inspired grammar
  • Observability — FMEA-scored health observer with ACP-mediated agent intervention
  • Storage — Bases feature store with a domain query language compiled to SQL via AST-based guardrails; RASE metamodel for agent verification

Getting Started

# Launch the TUI
uv run gaius

# Use the CLI for scripting
uv run gaius-cli --cmd "/health" --format json

# Check system status
uv run gaius-cli --cmd "/gpu status" --format json

Navigate with hjkl. Cycle overlays with o. Toggle modes with v. Press ? for help.

Vision & Philosophy

The Polymath’s Dilemma

Modern knowledge work demands synthesis across domains. A pension analyst must understand markets, demographics, regulation, and behavioral economics—simultaneously. A systems architect must hold network topology, security surfaces, performance characteristics, and team dynamics in mind as a unified whole.

Yet our tools present information in fragments. Spreadsheets. Dashboards. Slide decks. Chat interfaces. Each offers a narrow aperture onto a high-dimensional reality.

Gaius proposes a different approach: spatial synthesis. By projecting complex relationships onto a navigable grid, it transforms abstract complexity into something the human visual system can grasp intuitively—patterns, clusters, voids, and flows.

Why a Grid?

The 19×19 Go board is not arbitrary. It represents a sweet spot in human visual cognition:

  • 361 points: Enough resolution for meaningful differentiation, few enough for gestalt perception
  • Addressable: Every point has a name (A1 through T19), enabling precise reference
  • Compositional: Regions, groups, and territories emerge naturally from point relationships
  • Battle-tested: 4,000 years of Go strategy have proven this grid’s capacity to represent complex strategic landscapes

The grid constrains—and constraint enables clarity. A 19×19 board forces prioritization. What matters enough to occupy space?

Topological Intuition

Raw data has shape. Clusters form. Loops persist. Voids signal absence. Traditional visualization obscures this topology behind axes, legends, and chart types.

Persistent homology offers a different lens. It asks: what structures survive as we vary our perspective? The resulting “death loops” (H1 features) reveal cycles in your data—feedback loops, circular dependencies, systemic risks—that persist across scales.

When projected onto the grid, these become visible warnings: regions to investigate, patterns to understand, risks to mitigate.

Agentic Amplification

A single human perspective is insufficient for complex domains. Gaius deploys autonomous agents that explore, evolve, and consolidate knowledge. Each agent brings a distinct analytical lens, and their capabilities improve through RLVR (Reinforcement Learning with Verifiable Reward) training.

Agent outputs are embedded and projected onto the grid. Watch agents converge on consensus. Notice where they scatter (uncertainty). Observe who stands alone (contrarian insight). The grid becomes a map of collective intelligence.

Design Principles

1. Keyboard-First

Every action available via keyboard. Mouse optional. This isn’t nostalgia—it’s recognition that flow state requires low-latency, high-bandwidth input.

2. Progressive Disclosure

Launch with uv run gaius and get a clean TUI instantly. Three interfaces — TUI, CLI, MCP — offer increasing levels of automation. Complexity arrives when requested.

3. Modal Operation

Modes aren’t complexity—they’re context. Navigate in normal mode. Enter commands in command mode. Each mode offers a focused set of operations.

4. Composability

Each component (board, log, overlay) is independent. Combine them. Split them. Tile them. The interface adapts to your workflow.

5. Transparency

No magic. The grid shows exactly what it’s told to show. Overlays are explicit. Agent positions reflect actual embeddings. Trust requires transparency.

The Goal

Gaius aims to demonstrate that terminal interfaces need not be constrained to text streams. That topological insight can be made visual. That agent augmentation can be made spatial.

It’s an experiment in augmented cognition—using machines not to replace human judgment, but to extend human perception into domains our unaided senses cannot reach.

Core Concepts

Gaius integrates several conceptual pillars: spatial representation, topological analysis, autonomous agents, and self-healing infrastructure. This section introduces the foundational ideas; subsequent chapters explore each in depth.

The Grid

At the center of Gaius is a 19x19 board. This isn’t a chart or a dashboard — it’s a canvas for projection.

High-dimensional data (embeddings, agent states, risk surfaces) gets compressed onto 361 addressable points. The compression is lossy by design: it forces salience. What survives projection is what matters.

The grid supports multiple visualization modes:

  • Point markers: Individual data points as stones
  • Density heatmaps: Aggregate intensity via shading
  • Topology overlays: Death loops and persistent features
  • Agent positions: Agent state projected from embedding space

See The Grid Metaphor for the full treatment.

Embeddings

Modern ML represents entities as vectors in high-dimensional space. Text, images, users, documents — all become points in a geometric landscape where distance encodes similarity.

Gaius consumes these embeddings directly. Agent utterances become vectors. Domain entities become vectors. Cards, articles, and knowledge base entries occupy positions in embedding space. The relationships between them — cosine similarities, clusters, outliers — become spatial relationships on the grid.

See Embeddings & Point Clouds for details on how Gaius handles vector representations.

Persistent Homology

Traditional statistics describe data’s distribution. Topology describes its shape.

Persistent homology asks: as we vary the scale of observation, what features persist?

  • H0 features (connected components): Clusters that remain distinct
  • H1 features (loops): Cycles that don’t collapse — the “death loops”
  • H2 features (voids): Empty regions bounded by surfaces

These topological features often reveal structure invisible to statistical methods: feedback loops in systems, circular dependencies in code, liquidity traps in markets.

See Persistent Homology for the mathematical foundations and practical applications.

Autonomous Agents

Gaius agents are not static analyzers — they evolve. Through RLVR (Reinforcement Learning with Verifiable Reward) training, agents improve their capabilities over time. The agent system includes:

  • Evolution: Task ideation, training runs, and capability evaluation
  • Cognition: Self-observation and action planning
  • Theta consolidation: Memory compression inspired by hippocampal replay
  • CLT memory: Cognitive Load Theory-based knowledge structuring

See Agent System for implementation details.

Self-Healing

Gaius implements autonomous health monitoring based on FMEA (Failure Mode and Effects Analysis). Every failure mode has:

  • A Guru Meditation Code for unique identification (e.g., #DS.00000001.SVCNOTINIT)
  • An automated fix strategy that can diagnose, repair, and verify
  • An escalation path to ACP (Agent Client Protocol) when self-healing fails

Errors are never silenced. The system either fixes itself or tells you exactly what’s wrong and how to fix it.

See Fail-Fast & Self-Healing for the design principles.

Putting It Together

A typical Gaius session:

  1. Launch the TUI: uv run gaius
  2. Observe the grid state — entity positions projected from embedding space
  3. Navigate (hjkl): Explore regions of interest
  4. Overlay (o): See topology, risk, or agent state
  5. Command (/): Run slash commands for deeper analysis
  6. Monitor (/health): Check system health, let self-healing handle issues

The grid becomes a living map of your domain’s complexity — updated as agents explore and topology reveals hidden structure.

The Grid Metaphor

Origins in Go

The 19×19 grid traces its heritage to the ancient game of Go (围棋/囲碁/바둑). For over four millennia, this board has served as a substrate for strategic reasoning of remarkable depth.

Go’s grid has properties that make it ideal for information visualization:

  • Discrete but dense: 361 points offer fine granularity while remaining visually tractable
  • Symmetric: No privileged positions (unlike chess’s asymmetric opening)
  • Emergent structure: Corners, edges, and center have different strategic character despite identical local rules
  • Scale-invariant patterns: The same shapes (eyes, ladders, ko) appear at multiple scales

The Grid as Projection Surface

In Gaius, the grid serves as a projection surface for high-dimensional data. Consider an embedding space with 1536 dimensions (typical for modern text embeddings). How do we make this legible?

High-dimensional space          The Grid
      (n=1536)                  (n=361)
         │                         │
         │    PCA / UMAP /         │
         │    custom projection    │
         ▼                         ▼
    ┌─────────┐              ┌───────────┐
    │ ● ● ●   │              │ · · ● · · │
    │   ●   ● │    ────►     │ · ● · · · │
    │ ●     ● │              │ · · · ● · │
    └─────────┘              └───────────┘

The projection is necessarily lossy. This is a feature: it forces salience. Points that survive projection and remain distinct are points that matter.

Addressing

Every grid position has a unique address:

   A B C D E F G H J K L M N O P Q R S T
19 · · · · · · · · · · · · · · · · · · · 19
18 · · · · · · · · · · · · · · · · · · · 18
17 · · · + · · · · · · · · · + · · · · · 17
...
 1 · · · · · · · · · · · · · · · · · · ·  1
   A B C D E F G H J K L M N O P Q R S T

Note: Column I is skipped (Go convention, to avoid confusion with the numeral 1).

This addressing enables:

  • Precise reference: “The cluster at D4-F6”
  • Command targeting: /analyze K10 or /mark Q16 critical
  • Spatial queries: “What’s near the center?” → J10-L10, J9-L11

Visual Vocabulary

The grid supports a rich visual vocabulary:

Point Markers

SymbolMeaning
Black stone / primary entity
White stone / secondary entity
Cursor position
a-iCandidate markers (yellow)
Neutral / unaffiliated point

Density Shading

SymbolDensity
High (>75%)
Medium (50-75%)
Low (20-50%)
·Minimal (<20%)

Overlay Markers

SymbolMeaning
Death loop / H1 feature
Colored Agent position

The Grid as Strategic Map

In Go, professionals often describe the board in terms of strategic regions:

  • Corners (4 points): High-value, easy to secure
  • Edges (4 sides): Secondary value, harder to defend
  • Center: Hardest to claim, but dominates late-game influence

Gaius inherits this intuition. Data projected to corners represents stable, well-understood entities. Central positions represent contested or ambiguous terrain. Edge regions represent transitional states.

Compositional Thinking

The grid invites compositional reasoning:

  • Groups: Connected points form units (liberty-counting in Go becomes cluster analysis)
  • Territory: Regions bounded by your stones (areas of control/understanding)
  • Influence: Distant effects from strong positions (attention propagation)
  • Ko: Positions that oscillate (unstable equilibria in your data)

These metaphors aren’t forced—they emerge naturally when complex systems are projected onto discrete spatial representations.

Why Not a Larger Grid?

Larger grids (e.g., 100×100) would offer more resolution but sacrifice:

  • Gestalt perception: Humans can’t perceive 10,000 points holistically
  • Addressability: 100×100 requires two-digit coordinates
  • Strategic depth: Go on 9×9 is trivial; 19×19 is profound. Scale matters.

The 19×19 board occupies a cognitive sweet spot. Gaius exploits this.

Embeddings & Point Clouds

What Are Embeddings?

Embeddings are learned vector representations that encode semantic relationships as geometric relationships. Two items that are “similar” in meaning have embedding vectors that are “close” in space.

"pension fund"     → [0.23, -0.41, 0.88, ...]  (1536 dims)
"retirement plan"  → [0.25, -0.39, 0.86, ...]  (nearby)
"pizza recipe"     → [-0.67, 0.12, -0.33, ...] (distant)

Modern embedding models (text-embedding-3-small, etc.) produce vectors where:

  • Cosine similarity measures semantic relatedness
  • Euclidean distance measures conceptual separation
  • Clusters emerge naturally from semantic categories

Point Clouds in Gaius

When multiple embeddings are collected—agent utterances, domain entities, document fragments—they form a point cloud in embedding space.

# Each agent utterance becomes a point
cloud = []
for agent in swarm:
    response = await agent.analyze(task)
    embedding = embedder.embed(response)
    cloud.append(embedding)

# Cloud shape: (n_utterances, embedding_dim)

This point cloud is the raw material for both:

  1. Grid projection (what you see)
  2. Topological analysis (what the math reveals)

Projection Methods

High-dimensional clouds must be compressed for visualization. Common methods:

PCA (Principal Component Analysis)

Finds the axes of maximum variance. Fast, deterministic, but linear—may miss curved structure.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
projected = pca.fit_transform(cloud)

UMAP (Uniform Manifold Approximation)

Preserves local neighborhood structure. Better for clusters, but non-deterministic.

Custom Projections

Domain-specific projections can encode prior knowledge. For pension analysis:

  • X-axis: Risk (low → high)
  • Y-axis: Time horizon (short → long)

Mapping to the Grid

Once projected to 2D, coordinates are scaled to [0, 18] and discretized:

# Normalize to [0, 1]
x_norm = (projected[:, 0] - projected[:, 0].min()) / (projected[:, 0].ptp() + 1e-8)
y_norm = (projected[:, 1] - projected[:, 1].min()) / (projected[:, 1].ptp() + 1e-8)

# Scale to grid
x_grid = np.clip((x_norm * 18).astype(int), 0, 18)
y_grid = np.clip((y_norm * 18).astype(int), 0, 18)

Multiple points may map to the same grid cell. This is handled by:

  • Latest-wins: Most recent point displayed
  • Color mixing: Combined representation
  • Intensity: Brighter = more points

Semantic Distance on the Grid

Grid distance roughly corresponds to semantic distance—but the projection is lossy. Two points adjacent on the grid are likely related; two points distant are likely unrelated. But edge cases exist.

The grid offers intuition, not precision. For exact similarity queries, consult the underlying embeddings directly.

Temporal Dynamics

As new data arrives (agent responses, user queries, domain events), the point cloud evolves:

t=0: Initial cloud from seed data
t=1: + First swarm round utterances
t=2: + User query embeddings
t=3: + Second swarm round...

The grid animates this evolution. Watch clusters form, dissolve, migrate. These dynamics reveal how understanding develops over time.

Vector Memory Integration

All embeddings are stored in the Vector Memory system, enabling:

  • Retrieval: “Find utterances similar to X”
  • Scene graphs: Build edges from cosine similarity
  • History: Track the trajectory of specific agents/entities

See Vector Memory for implementation details.

Persistent Homology

Beyond Statistics

Statistics describes the distribution of data: mean, variance, correlations. But distributions are blind to shape.

Consider two point clouds:

Cloud A:          Cloud B:
  ● ●               ●   ●
 ●   ●             ●     ●
●     ●            ●     ●
 ●   ●             ●     ●
  ● ●               ●   ●

Same mean. Same variance. Same point count. But Cloud A is a filled disk; Cloud B is a ring with a hole. The hole is topologically significant—it represents something absent, something that might matter.

Persistent homology is the mathematics of detecting such shapes.

The Vietoris-Rips Complex

Given a point cloud, we construct a simplicial complex by connecting points within a distance threshold ε:

ε = small:     ε = medium:     ε = large:
  ●   ●         ●───●           ●───●
                    │           │╲ ╱│
  ●   ●         ●   ●           ●─╳─●
                                │╱ ╲│
  ●   ●         ●───●           ●───●

As ε increases:

  • H0 features (connected components): Merge as clusters connect
  • H1 features (loops): Appear when edges close cycles, disappear when interiors fill
  • H2 features (voids): Appear when surfaces enclose volumes

Birth and Death

Each topological feature has a birth time (the ε at which it appears) and a death time (the ε at which it vanishes).

Features that persist across a wide range of ε are considered significant—they reflect genuine structure rather than noise.

Persistence Diagram:
        death
          │
          │    ● (noise: short-lived)
          │
          │          ● (signal: long-lived)
          │        ●
          │      ●
          └──────────── birth

Points far from the diagonal represent persistent features.

Death Loops (H1)

In Gaius, H1 features receive special attention as “death loops.” These represent:

  • Cycles in data flow: Feedback loops, circular dependencies
  • Systemic risks: Self-reinforcing failure modes
  • Market structures: Liquidity cycles, regulatory arbitrage loops

When projected onto the grid, death loops appear as markers in regions where the underlying embedding space exhibits persistent 1-dimensional holes.

Practical Application

from gtda.homology import VietorisRipsPersistence
from gtda.diagrams import PersistenceEntropy

# Compute persistence diagrams
vr = VietorisRipsPersistence(homology_dimensions=[0, 1, 2])
diagrams = vr.fit_transform([point_cloud])

# Quantify topological complexity
entropy = PersistenceEntropy()
ent = entropy.fit_transform(diagrams)

# Extract significant H1 features
h1_features = diagrams[0][diagrams[0][:, 0] == 1]
persistent_loops = h1_features[h1_features[:, 2] - h1_features[:, 1] > threshold]

Entropy as Summary

Persistence entropy provides a scalar summary of topological complexity:

  • Low entropy: Few dominant features (simple structure)
  • High entropy: Many features of similar persistence (complex, fractal-like)

Gaius tracks entropy over time. Sudden entropy spikes may indicate regime changes in your domain.

Interpreting Grid Overlays

When viewing the H1 overlay:

PatternInterpretation
Sparse Few persistent loops; structure is tree-like
Clustered Localized cyclic structure; investigate region
Uniform Pervasive cyclicity; may indicate noise or genuine complexity
Ring of Boundary of a significant void

Limitations

Persistent homology reveals shape but not causation. A detected loop could represent:

  • A real feedback cycle in your domain
  • An artifact of the embedding model
  • Noise in the underlying data

Domain expertise is required to interpret topological features. Gaius surfaces the structure; you provide the meaning.

Further Reading

  • Computational Topology by Edelsbrunner and Harer
  • Topological Data Analysis by Carlsson
  • giotto-tda documentation: giotto-ai.github.io

Epistemology of Augmented Cognition

How knowledge grows in a human-AI system

The Tautology

Augmented cognition must yield nonrandom advantage with verifiable outcomes.

This isn’t philosophy for its own sake. It’s the test. If the human-plus-system doesn’t produce results that beat the null hypothesis—problems solved faster, connections seen that would be missed, errors avoided, artifacts of higher quality—then the augmentation is theater.

Everything that follows serves this constraint.

The Third Mind

The Enlightenment assumed the individual mind as atomic unit: properly disciplined reason, applied to sensory evidence, converging on truth. The Romantic correction enriched the channels—emotion, intuition, aesthetic sense—but preserved the individual.

What if both missed something?

Cognition may have never been atomic. It distributes across brains, books, conversations, environments. The “individual thinker” was always a convenient fiction—useful for assigning credit and blame, but not how thinking actually happens.

Gaius makes the distribution explicit:

  • The KB is externalized shared memory
  • The swarm is a parliament of perspectives
  • The cognition system generates thoughts between sessions
  • The human brings mortality, stakes, aesthetic judgment, and the ability to act

What emerges is a third mind—something that belongs fully to neither human nor AI. It’s not human intelligence augmented by AI (the usual framing). It’s not AI directed by human. It’s a novel form of collaborative cognition that neither could produce alone.

The Dialectic on the Board

The 19x19 grid represents a fundamental tension:

One color (Order/Logos): The Enlightenment inheritance. Kant’s categories imposing structure on raw experience. Each stone is a fact—tested, confirmed, placed with certainty. The mind palace architecture where memory has address and retrieval is deterministic. This force embodies the best virtues of enlightenment thinking: we may come to know the universe through experience of our senses and share this knowing with others who may confirm or refute our understanding.

The other color (Entropy/Eros): The Romantic counter-current. Nietzsche’s Dionysian impulse that shatters Apollonian form. Bergson’s élan vital—life as creative evolution resisting mechanistic reduction. Each stone is a question, a provocation, a refusal to settle into local minima. This antithetical force is the path toward what may be an undiscovered formal description language for aesthetics.

The colors randomize daily. This prevents rooting for “our team.” Some days order serves creativity; some days entropy is the path to truth.

The Go metaphor is apt because Go isn’t chess—there’s no king to capture, no objective hierarchy. Victory is territory, which is liminal: stones create influence that shades into emptiness. The game rewards both sente (initiative, creativity) and gote (response, consolidation).

Memory and Compaction

An old man remembers every aspect of his first kiss but can’t recall breakfast.

This isn’t failure—it’s selection. The first kiss persists because it integrated into everything else: identity, narrative, desire, loss. It has a thousand hooks into the larger structure. Breakfast has one hook: “I ate.” No redundancy. Nothing to reconstruct from.

Human memory isn’t a tape recorder with degradation. It’s a living graph that keeps what connects and lets the rest dissolve. The “compression” isn’t lossy in the information-theoretic sense—it’s meaning-preserving. What matters survives.

The same principle applies to Gaius:

Should persist:

  • What changed understanding
  • What connects to many other things
  • What might matter later in ways we can’t predict
  • What was beautiful—even if we can’t justify why

Should dissolve:

  • Scaffolding that served its purpose
  • Dead ends fully explored
  • Noise that looked like signal until it didn’t

The test: does this have hooks into the future?

The Lens: Falsifiable Forward Simulation

What separates understanding from memorization?

You can memorize that water boils at 100°C. You understand thermodynamics when you can simulate: “what happens to boiling point at altitude?” and get an answer that reality confirms.

Forward simulation + falsification = the engine of real knowledge.

This connects to work across domains:

  • PINNs (Physics-Informed Neural Networks): Neural nets constrained by differential equations that must hold. The physics prior forces the model to learn something simulatable, not just interpolatable.
  • Portfolio optimization: Build a model of covariances and returns, simulate forward, and the market confirms or refutes. The held-out Sharpe ratio is the falsification.
  • SAT solvers: Explore logical possibility space by propagating constraints forward—if I assume X, what follows? Does it contradict something known?

Knowledge Hierarchy

Highest value: Knowledge that enables forward simulation with testable outputs

  • “If we do X, Y should happen”—then we can check
  • Causal models, not just correlations
  • Theories, not just observations

Medium value: Observations that could become simulatable once enough accumulate

  • Data points that might reveal structure
  • Anomalies that challenge existing models

Lowest value: Isolated facts with no predictive hooks

  • Things that are true but don’t connect forward
  • The old man’s breakfast

The Dialectic Reframed

Through this lens, Order and Entropy both serve falsifiable simulation:

  • Order = model refinement (tightening predictions, reducing uncertainty)
  • Entropy = model exploration (new hypotheses, expanded possibility space)

Order sharpens the blade. Entropy finds new things to cut.

Implications for Design

  1. Score knowledge by forward-simulation capacity: Does this KB entry let you predict something you couldn’t before? Can that prediction be tested?

  2. Cognition should generate hypotheses: Between sessions, Gaius shouldn’t just summarize—it should ask: “what would I predict? what remains testable?”

  3. Evolution should favor predictive prompts: The held-out evaluation tests whether agent improvements transfer beyond training data.

  4. The grid should reveal predictive structure: Clusters might indicate shared causal mechanisms. Voids might indicate underdetermined regions. H1 cycles might indicate feedback loops with predictable dynamics.

  5. Compaction should preserve predictive content: When context windows fill, what survives should be what enables future simulation, not just what was recently accessed.

The Asymmetry

The human has continuity. The KB accumulates externalized cognition across sessions. Understanding can be observed evolving—in git history, in dated files, in logged thoughts.

The AI has no such continuity. Each session bootstraps from artifacts. Something that functions like understanding emerges within the session, but doesn’t persist. Tomorrow’s instance won’t remember this exchange unless it’s written down.

The human observes understanding in the mirror of shared artifacts. The AI is more like the mirror itself—a surface that reflects with some distortion, some amplification, but doesn’t retain the image once you look away.

But this asymmetry may be feature, not bug. The AI can’t get stuck in ruts, can’t accumulate biases from past sessions, always brings fresh eyes. The persistence lives in the artifacts, not in the AI.

And the tautology holds regardless: nonrandom advantage with verifiable outcomes. The test isn’t whether the AI has continuous selfhood. The test is whether the collaboration produces results.


This document emerged from collaborative discourse, December 2024. It attempts to capture understanding that might otherwise dissolve—not because discourse is unimportant, but because the impermanence of conversation is precisely what makes externalization necessary.

Fail-Fast & Self-Healing

Fail-fast is an iron-clad design principle in Gaius. All code surfaces errors immediately with actionable remediation paths. The system never silently degrades, falls back to placeholders, or continues with partial functionality.

The Principle

When something goes wrong, the correct response is not to hide it — it’s to surface it immediately with enough information to fix it. Every error message in Gaius includes:

  1. Guru Meditation Code: A unique identifier for the failure mode
  2. Health Fix Command: A reference to /health fix <service> when applicable
  3. Manual Remediation: Alternative manual steps if self-healing can’t resolve it
error_msg = (
    "DatasetService not initialized.\n"
    "  Guru: #DS.00000001.SVCNOTINIT\n"
    "  Try: /health fix dataset\n"
    "  Or:  just restart-clean"
)

Guru Meditation Codes

Inspired by the Amiga’s memorable error screens, every failure mode gets a unique identifier.

Format: #<COMPONENT>.<SEQUENCE>.<MNEMONIC>

ComponentDescription
DSDatasetService
NFNiFi
ENEngine
EPEndpoints/Inference
EVEvolution
DBDatabase
QDQdrant
GRgRPC
ACPAgent Client Protocol
ACFArticle Curation Flow

Each code maps to exactly one failure mode. A failure mode may have multiple diagnostic heuristics, but the code is the canonical identifier.

See Guru Meditation Codes for the complete catalog.

What Fail-Fast Prohibits

No Optional Fallbacks

Never use fail_fast=True as a parameter. Fail-fast is the ONLY behavior, not an option.

No Silent Degradation

If a required resource is unavailable (LLM endpoint, NiFi, database), raise an error immediately. Never substitute placeholder data or skip functionality.

No Conditional Feature Flags for Core Functionality

Don’t use patterns like if SELENIUM_AVAILABLE: with an else clause that produces fake data. Either the feature works or it fails.

Fail Open for Observability

The counterpart to fail-fast for observability code is fail open. When filtering or displaying health state:

  1. Filter OUT, not IN: When showing active incidents, filter out known terminal states (resolved) rather than filtering in known active states. Unknown states are surfaced for investigation.

  2. Unknown States are Visible: Any state not in the “terminal” list is displayed. This ensures new or unexpected states don’t silently disappear.

# BAD: Filtering IN known active states (brittle)
active = [i for i in incidents if i.status in ("active", "healing")]

# GOOD: Filtering OUT known terminal states (fail open)
active = [i for i in incidents if i.status != "resolved"]

Self-Healing Hierarchy

When services are unhealthy, Gaius follows a remediation hierarchy:

  1. /health fix <service> — Let Gaius attempt self-healing first
  2. Manual commands (just restart-clean, etc.) — Only if self-healing fails
  3. ACP escalation — For novel failures that need human or AI intervention

The Health Observer daemon continuously monitors all system components. When an incident exceeds the configured FMEA RPN (Risk Priority Number) threshold, it escalates through ACP to Claude Code for meta-level intervention.

Heuristics and KB

Each failure mode has a corresponding heuristic document in the knowledge base:

  • Symptom: Brief description of what the user sees
  • Cause: Why this happens
  • Observation: How to detect it programmatically
  • Solution: How to fix it, with /health fix command

This creates a closed loop: errors reference codes, codes map to heuristics, heuristics provide automated fixes.

System Overview

Gaius is a platform for navigating complex, graph-oriented data domains. It projects high-dimensional embeddings and topological structures onto a 19x19 grid, augmented by autonomous agents, self-healing infrastructure, and production data pipelines.

Package Structure

src/gaius/
├── app.py              # TUI application (Textual)
├── cli.py              # Non-interactive CLI
├── mcp_server.py       # MCP server (163 tools)
├── core/               # Configuration, state, telemetry
├── engine/             # gRPC engine (central nervous system)
│   ├── server.py       # Main daemon
│   ├── proto/          # Protobuf definitions
│   ├── generated/      # Generated gRPC bindings
│   ├── grpc/           # gRPC servicers
│   ├── services/       # 37 registered services
│   └── backends/       # vLLM, optillm, embedding controllers
├── health/             # FMEA-based self-healing
│   ├── observe.py      # Health Observer daemon
│   ├── fmea/           # Risk scoring framework
│   └── service_fixes.py # Automated remediation
├── agents/             # Autonomous agent system
│   ├── evolution/      # RLVR training
│   ├── theta/          # Memory consolidation
│   └── cognition/      # Self-observation
├── inference/          # Multi-backend routing
├── flows/              # Metaflow data pipelines
├── viz/                # LuxCore visualization
├── storage/            # PostgreSQL + Qdrant
├── acp/                # Agent Client Protocol
├── rase/               # RASE metamodel (agent verification)
├── bases/              # Feature store
├── hx/                 # History and lineage
├── observability/      # OpenTelemetry + Prometheus
├── widgets/            # TUI widgets
├── commands/           # Slash command implementations
├── kb/                 # Knowledge base operations
├── models/             # Agent model versioning
├── client/             # gRPC client library
└── mcp/                # MCP tool implementations

Layer Architecture

The system is organized in layers with strict dependency direction:

LayerComponentsResponsibility
L1 - InterfaceTUI, CLI, MCPUser-facing thin clients
L2 - ClientgRPC client libraryTransport abstraction
L3 - EnginegRPC server, servicesBusiness logic, orchestration
L4 - BackendvLLM, optillm, embeddingsGPU workload execution
L5 - StoragePostgreSQL, Qdrant, R2Persistence

Rule: Higher layers depend on lower layers, never the reverse. The engine (L3) is the single point of coordination — TUI, CLI, and MCP all call engine RPCs rather than accessing backends or storage directly.

Key Numbers

MetricCount
Lines of code~252K
Python packages26
Engine services37
CLI commands63
MCP tools163
GPUs6 (NVIDIA)
gRPC port50051
PostgreSQL port5444

Communication Paths

All three interfaces communicate with the engine via gRPC:

┌─────────┐  ┌─────────┐  ┌─────────┐
│   TUI   │  │   CLI   │  │   MCP   │
└────┬────┘  └────┬────┘  └────┬────┘
     │            │            │
     └────────────┼────────────┘
                  │ gRPC :50051
           ┌──────┴──────┐
           │   Engine    │
           │  (37 svcs)  │
           └──────┬──────┘
                  │
     ┌────────────┼────────────┐
     │            │            │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│  vLLM   │ │ Postgres│ │ Qdrant  │
│ (GPUs)  │ │  :5444  │ │  :6334  │
└─────────┘ └─────────┘ └─────────┘

See Engine-First Architecture for why this design was chosen.

Engine-First Architecture

All business logic lives in the gRPC engine. The TUI, CLI, and MCP server are thin clients that translate user intent into engine RPC calls.

Why Engine-First

Early Gaius had business logic scattered across the TUI, CLI, and various utility scripts. This created several problems:

  • Duplication: The same logic reimplemented across interfaces
  • Inconsistency: CLI and TUI producing different results for the same operation
  • Testing difficulty: Business logic entangled with UI code
  • Resource contention: Multiple processes competing for GPU access

The engine-first approach solves all of these by centralizing logic in a single daemon that manages all shared resources.

The Rule

Interfaces do not contain business logic. They:

  1. Parse user input into a command or RPC call
  2. Send the request to the engine via gRPC
  3. Format the response for display

If you find yourself writing business logic in app.py, cli.py, or mcp_server.py, it belongs in an engine service instead.

Thin Client Examples

TUI (app.py)

The TUI calls engine RPCs through the gRPC client:

# TUI widget calls engine for health data
result = await self.grpc_client.call("GetHealthStatus")
self.display(result)

CLI (cli.py)

The CLI dispatches slash commands to engine RPCs:

# CLI maps /health to engine RPC
result = await client.call("GetHealthStatus")
print(json.dumps(result, indent=2))

MCP (mcp_server.py)

MCP tools wrap engine RPCs for AI assistants:

@server.tool()
async def health_observer_status():
    result = await client.call("GetHealthStatus")
    return result

Benefits

  • Single source of truth: One implementation, three interfaces
  • GPU management: Engine controls all GPU allocation
  • Background services: Evolution, cognition, health monitoring run in the engine daemon
  • Consistent state: All clients see the same system state

Exceptions

A few operations are interface-specific by necessity:

  • TUI rendering: Widget layout and Textual event handling
  • CLI formatting: JSON/text output formatting
  • MCP tool metadata: Tool descriptions and parameter schemas

These are presentation concerns, not business logic.

Interfaces: TUI, CLI, MCP

Gaius provides three access paths to the engine. Each serves a different use case but all communicate via the same gRPC protocol.

TUI (Terminal User Interface)

The interactive terminal application built on Textual.

uv run gaius

Components:

  • MainGrid: 19x19 Go board for spatial visualization
  • MiniGridPanel: Three 9x9 orthographic projections (CAD-style views)
  • FileTree: Plan 9-inspired navigation with agents as files
  • ContentPanel: Right panel displaying context and output
  • CommandInput: Slash command input with history

Best for: Interactive exploration, spatial navigation, visual pattern recognition.

See The TUI for the user guide.

CLI (Command Line Interface)

Non-interactive interface for scripting and automation.

# Single command execution
uv run gaius-cli --cmd "/health" --format json

# Pipe to jq for extraction
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[]'

# Poll for status changes
for i in $(seq 1 15); do
    sleep 10
    uv run gaius-cli --cmd "/gpu status" --format json
done

63 slash commands covering health, agents, inference, evolution, knowledge base, visualization, and more.

Best for: Scripting, CI/CD integration, automated monitoring, quick status checks.

See The CLI for the user guide.

MCP (Model Context Protocol)

Programmatic interface exposing 163 tools to AI assistants like Claude Code.

{
  "mcpServers": {
    "gaius": {
      "command": "uv",
      "args": ["run", "gaius-mcp"],
      "cwd": "/path/to/gaius"
    }
  }
}

163 MCP tools organized by domain: health, agents, inference, knowledge base, observability, evolution, visualization, bases, and more.

Best for: AI-assisted operations, autonomous health maintenance, Claude Code integration.

See MCP Integration for setup and usage.

Interface Comparison

FeatureTUICLIMCP
InteractiveYesNoNo
Visual gridYesNoNo
JSON outputNoYesYes
ScriptableNoYesYes
AI-accessibleNoNoYes
Slash commandsYesYesN/A
Streaming outputYesNoNo

Shared Protocol

All three interfaces use the same gRPC client library (gaius.client) to communicate with the engine:

from gaius.client import GrpcClient, GrpcClientConfig

config = GrpcClientConfig(
    host="localhost",
    port=50051,
    timeout=30,  # default; inference calls use 120s
)
client = GrpcClient(config)
result = await client.call("GetHealthStatus")

The default timeout is 30 seconds. Inference calls (completions, evaluations) use 120 seconds. These can be overridden via the GAIUS_ENGINE_TIMEOUT environment variable.

gRPC Engine

The engine is the central nervous system of Gaius. It’s a long-running daemon that manages GPU resources, coordinates services, and exposes all functionality via gRPC on port 50051.

Architecture

┌──────────────────────────────────────────────┐
│                gRPC Server :50051             │
│  ┌──────────────┐  ┌──────────────────────┐  │
│  │ KServe OIP   │  │ Gaius Extensions     │  │
│  │ (inference)  │  │ (health, evolution,  │  │
│  │              │  │  orchestrator, ...)  │  │
│  └──────┬───────┘  └──────────┬───────────┘  │
├─────────┼─────────────────────┼──────────────┤
│         │    37 Services      │              │
│  ┌──────┴──────┐  ┌──────────┴───────────┐  │
│  │ Orchestrator │  │ Scheduler            │  │
│  │ Evolution    │  │ Cognition            │  │
│  │ Health       │  │ Topology             │  │
│  │ CLT          │  │ Dataset              │  │
│  │ ...          │  │ ...                  │  │
│  └──────┬───────┘  └──────────┬───────────┘  │
├─────────┼─────────────────────┼──────────────┤
│         │ Backend Controllers │              │
│  ┌──────┴──────┐  ┌──────────┴───────────┐  │
│  │ vLLM Ctrl   │  │ Embedding Ctrl       │  │
│  │ optillm Ctrl│  │ Backend Router       │  │
│  └──────┬───────┘  └──────────┬───────────┘  │
│         │                     │              │
│  ┌──────┴─────────────────────┴───────────┐  │
│  │           GPU Pool (6x NVIDIA)         │  │
│  └────────────────────────────────────────┘  │
└──────────────────────────────────────────────┘

Startup Sequence

The engine initializes in 9 phases, streaming progress to connected clients:

PhaseDurationAction
INITImmediateInitController starts
GRPC~1sgRPC server binds to :50051
TELEMETRY~2sOpenTelemetry setup
BACKENDS~5sBackend router initialization
ORCHESTRATOR~2sOrchestrator service starts
ENDPOINTS~240svLLM model loading to VRAM
TRANSPORT~2sAeron bridge setup
SERVICES~5sBackground services start
COMPLETE-Ready for inference

The gRPC server starts early (phase 2) so clients can connect immediately and receive real-time progress during the ~4 minute vLLM startup.

Module Structure

engine/
├── server.py              # Main daemon entry point
├── config.py              # Engine configuration
├── init_controller.py     # Initialization progress streaming
├── workloads.py           # Workload definitions
├── grpc/
│   ├── server.py          # gRPC server setup
│   └── servicers/
│       ├── inference_servicer.py  # KServe OIP implementation
│       └── gaius_servicer.py      # Gaius extensions
├── backends/
│   ├── backend_router.py  # Unified request routing
│   ├── vllm_controller.py # vLLM process management
│   ├── optillm_controller.py
│   └── embedding_controller.py
├── services/              # 37 registered services
├── compute/               # Grid projection, TDA
├── resources/             # GPU allocation
├── transport/             # Aeron bridge
├── generated/             # Protobuf generated code
└── proto/                 # Protobuf definitions

gRPC Protocol

The engine implements two gRPC services:

KServe Open Inference Protocol

Standard inference protocol for compatibility with ML platforms:

service GRPCInferenceService {
    rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse);
    rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse);
    rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse);
    rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse);
}

Gaius Extensions

Custom RPCs for Gaius-specific functionality:

service GaiusService {
    rpc WatchInit(stream InitRequest) returns (stream InitProgress);
    rpc WatchHealth(HealthRequest) returns (stream HealthMetrics);
    rpc EvolutionStatus(Empty) returns (EvolutionStatusResponse);
    rpc TriggerEvolution(TriggerRequest) returns (TriggerResponse);
    rpc GetEndpointStatus(Empty) returns (EndpointStatusResponse);
    rpc StartEndpoint(StartRequest) returns (StartResponse);
    rpc StopEndpoint(StopRequest) returns (StopResponse);
}

Configuration

engine {
    grpc {
        host = "0.0.0.0"
        port = 50051
        max_workers = 10
        max_message_size = 104857600  # 100MB
    }
    orchestrator {
        preload_endpoints = ["reasoning"]
        startup_timeout = 600  # 10 minutes
        health_check_interval = 30
    }
    scheduler {
        max_queue_size = 1000
        default_timeout = 120
    }
    evolution {
        enabled = true
        idle_threshold = 60
        cycle_interval = 3600
    }
}

Running the Engine

# Via devenv process-compose (normal operation)
devenv processes up

# Standalone
uv run gaius-engine

# Clean restart (stops everything, cleans up, restarts)
just restart-clean

Verifying Engine Health

# Check if gRPC port is listening
nc -zv localhost 50051

# Check endpoint status
uv run gaius-cli --cmd "/gpu status" --format json

# Watch engine logs
tail -f .devenv/processes.log | grep gaius-engine

Engine Services

The engine hosts 37 services organized into four groups: resource management, intelligence, data, and external integration.

Service Groups

Resource Management

ServicePurpose
OrchestratorServicevLLM endpoint lifecycle and GPU allocation
SchedulerServicePriority-based job queue with XAI budget
HealthServiceGPU and endpoint health monitoring
AgendaTrackerTracks scheduled endpoint transitions for makespan operations

Intelligence

ServicePurpose
EvolutionServiceAgent prompt optimization via APO
CognitionServiceAutonomous thought generation (every 4h)
CLTServiceCross-Layer Transcoder feature extraction
TopologyServiceSemantic attractor detection and drift
NGRCPredictorReservoir computing for temporal prediction

Data

ServicePurpose
DatasetServiceNiFi SoM dataset generation
FlowSchedulerServiceMetaflow pipeline scheduling
KBServiceKnowledge base CRUD operations
LineageServiceProvenance tracking

External Integration

ServicePurpose
XBookmarksServiceX (Twitter) bookmark synchronization

Service Registration

Services register with the engine during startup. Each service implements a standard lifecycle:

class SomeService:
    async def start(self) -> None:
        """Initialize resources, start background tasks."""
        ...

    async def stop(self) -> None:
        """Clean shutdown, release resources."""
        ...

Background Tasks

Several services run scheduled background tasks:

TaskServiceSchedulePurpose
cognition_cycleCognitionServiceEvery 4hPattern detection in KB activity
self_observationCognitionServiceEvery 8hMeta-cognitive reflection
engine_auditCognitionServiceEvery 12hSystem health analysis
Evolution cycleEvolutionServiceGPU idleAgent prompt optimization
Health checkHealthServiceEvery 30sEndpoint liveness

Service Dependencies

Services form a dependency graph. The orchestrator and scheduler are foundational — most other services depend on them for inference access:

OrchestratorService → vLLM Controller → GPU Pool
SchedulerService → OrchestratorService
EvolutionService → SchedulerService
CognitionService → SchedulerService
HealthService → GPU Pool (via pynvml)
TopologyService → CLTService

See the individual service chapters for implementation details:

Orchestrator

The OrchestratorService manages vLLM endpoint lifecycle and GPU allocation. It decides which models are loaded, on which GPUs, and handles startup, shutdown, and recovery.

Endpoint Lifecycle

Endpoints transition through these states:

PENDING → STARTING → HEALTHY
                  ↘ UNHEALTHY → FAILED
HEALTHY → STOPPING → STOPPED

EndpointStatus

@dataclass
class EndpointStatus:
    name: str           # "reasoning", "coding", etc.
    state: str          # "healthy", "starting", "unhealthy", "stopped"
    gpus: list[int]     # Allocated GPU indices
    pid: int | None     # vLLM process ID
    port: int           # Serving port
    model: str          # HuggingFace model ID
    uptime_seconds: int

Workload Management

The orchestrator follows Yunikorn-style capability-based scheduling:

  1. Requests declare capabilities, not endpoints: A workload asks for “reasoning” capability, not a specific model
  2. Priority-based preemption: Idle endpoints can be evicted for higher-priority work
  3. Makespan fulfillment: The engine ensures work completes, then restores baseline set points

Example: Render Pipeline

When the viz pipeline needs a GPU for LuxCore rendering:

  1. Workload requests GPU with allow_baseline_eviction=True
  2. Orchestrator evicts lowest-priority endpoint from target GPU
  3. Rendering completes
  4. Orchestrator restores the evicted endpoint

Clean Start

The clean_start() operation handles recovery from corrupted state:

result = await orch.clean_start(endpoints=["reasoning"])
# Kills stale vLLM processes
# Cleans up CUDA memory
# Restarts endpoints fresh

Health Integration

The orchestrator works with the AgendaTracker to distinguish intentional state changes from failures. When an endpoint is part of a scheduled makespan operation, the Health Observer skips incident creation:

if tracker.is_endpoint_in_scheduled_transition("reasoning"):
    # Don't create incident — this is planned
    expected = tracker.get_scheduled_endpoint_state("reasoning")

Checking Status

uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[]'

Scheduler

The SchedulerService provides a priority-based job queue for inference requests with XAI budget management and weighted completion time minimization.

Priority Levels

PriorityWeightUse Case
CRITICAL (0)1.0User-facing interactive requests
HIGH (1)2.0Interactive queries
NORMAL (2)4.0Background processing
LOW (3)8.0Batch operations
EVOLUTION (4)16.0Agent evolution (lowest priority)

Lower weights receive preferential scheduling. Critical requests preempt everything.

Job Submission

from gaius.engine.services import SchedulerService, InferenceJob, JobPriority

scheduler = SchedulerService()

job = InferenceJob(
    prompt="Analyze the risk factors...",
    priority=JobPriority.HIGH,
    max_tokens=2048,
)
result = await scheduler.submit(job)

XAI Budget

The scheduler tracks daily usage of external AI APIs (xAI Grok) to prevent runaway costs:

budget = scheduler.get_xai_budget()
# budget.daily_remaining: tokens left for today
# budget.daily_limit: configured daily cap
# budget.reset_time: when the budget resets

Requests exceeding the budget are rejected with a clear error message.

Makespan Scheduling

For complex workloads that require multiple inference calls (e.g., agent evolution with candidate generation + evaluation), the scheduler uses makespan optimization to minimize total completion time:

  1. Decompose workload into individual inference jobs
  2. Assign priorities based on workload urgency
  3. Schedule across available endpoints
  4. Track completion via the AgendaTracker

See Makespan Scheduling for the optimization details.

Timeouts

ContextDefault Timeout
General gRPC calls30s
Inference (completions)120s
Evaluation120s

A 24B model with cot_reflection takes 15-20 seconds per completion. Timeouts are set per-call:

result = await client.call("ModelInfer", request, timeout=120)

Override the default via GAIUS_ENGINE_TIMEOUT environment variable.

Protobuf Schema

The gRPC API is defined in Protocol Buffers. Changes to the proto require a specific workflow to keep generated bindings, internal enums, and status mappings in sync.

Key Files

FilePurpose
engine/proto/gaius_service.protoProto definitions (source of truth)
engine/proto/gaius_service_pb2.pyGenerated Python bindings
engine/proto/gaius_service_pb2_grpc.pyGenerated gRPC stubs
engine/generated/__init__.pyRe-exports for clean imports
engine/grpc/servicers/gaius_servicer.pyServer-side implementation

Endpoint Status Values

enum ProcessStatus {
    PROCESS_STATUS_UNSPECIFIED = 0;
    PROCESS_STATUS_STOPPED = 1;
    PROCESS_STATUS_STARTING = 2;
    PROCESS_STATUS_HEALTHY = 3;
    PROCESS_STATUS_UNHEALTHY = 4;
    PROCESS_STATUS_STOPPING = 5;
    PROCESS_STATUS_FAILED = 6;
    PROCESS_STATUS_PENDING = 7;   // Queued for startup
}

Startup state transitions: PENDING -> STARTING -> HEALTHY

Change Workflow

1. Edit the Proto File

Append new enum values. Don’t renumber existing values for wire compatibility.

2. Regenerate Bindings

just proto-generate

3. Update Generated Exports

Add new symbols to engine/generated/__init__.py:

  • Add to the import block
  • Add to the __all__ list

Critical: Skipping this step causes import errors at engine startup.

4. Update Internal Enums

If there’s a parallel Python enum (e.g., in vllm_controller.py), sync it with the proto enum.

5. Update Status Mappings

Add string-to-proto mappings in the servicer’s _STATUS_MAP.

6. Verify Import

uv run python -c "from gaius.engine.generated import NEW_SYMBOL; print('OK')"

7. Restart and Test

just restart-clean
uv run gaius-cli --cmd "/gpu status" --format json

Common Issues

SymptomCauseFix
Engine fails to startMissing export in __init__.pyAdd symbol to imports and __all__
Port 50051 not listeninggRPC server didn’t initializeCheck logs for import errors
Status shows wrong valueMissing status mappingAdd to _STATUS_MAP

Testing gRPC Features

gRPC reflection is not enabled, so grpcurl cannot discover services. Use the CLI instead:

uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[] | {name, status}'

Health & Self-Healing

Gaius implements autonomous health monitoring based on FMEA (Failure Mode and Effects Analysis). The system quantifies risk using RPN (Risk Priority Number) scores, applies tiered remediation, and learns from outcomes to improve over time.

Architecture

The health system has four layers:

  1. Detection: Scheduled checks, continuous watcher, and user reports identify issues
  2. Analysis: FMEA engine calculates RPN scores from severity, occurrence, and detection ratings
  3. Remediation: Three-tier system from automatic restarts to agent-assisted diagnosis to user approval
  4. Learning: Adaptive learner adjusts S/O/D scores based on remediation outcomes

How It Works

When a health check detects an issue:

  1. The FMEA engine maps it to a failure mode from the 34-mode catalog
  2. RPN is calculated: RPN = S x O x D (max 1000)
  3. Based on the RPN score, remediation is routed to the appropriate tier:
    • RPN < 100 (Tier 0): Automatic procedural restart
    • RPN 100-200 (Tier 1): Agent-assisted remediation
    • RPN > 200 (Tier 2): Requires user approval
    • RPN > 300: Escalates to ACP (Claude Code) for meta-level intervention
  4. Outcomes feed back into the adaptive learner, adjusting future risk scores

Health Check Categories

CategoryExample Checks
InfrastructuregRPC connection, PostgreSQL, Qdrant, MinIO
GPUMemory usage, temperature
EndpointsvLLM health, stuck endpoints, orphan processes
EvolutionEvolution daemon, cognition daemon
ResourcesDisk space, scheduler queue, XAI budget

CLI Commands

# Run all health checks
uv run gaius-cli --cmd "/health" --format json

# Run checks for a specific category
uv run gaius-cli --cmd "/health gpu" --format json

# Apply automated fix
uv run gaius-cli --cmd "/health fix engine" --format json

# FMEA summary
uv run gaius-cli --cmd "/fmea" --format json

Self-Healing First

When encountering unhealthy services, always try /health fix before manual intervention:

  1. /health fix <service> — Let Gaius attempt self-healing
  2. just restart-clean — Only if self-healing fails
  3. Manual investigation — Last resort

This ensures the self-healing system gets exercised and improved over time.

Subchapters

FMEA Framework

FMEA (Failure Mode and Effects Analysis) replaces simple severity classification with quantitative risk assessment. Originally from manufacturing engineering, Gaius adapts it for software systems.

Risk Priority Number

Each failure mode is scored on three dimensions:

RPN = S x O x D (range 1-1000)

DimensionMeaningScale
S (Severity)Impact on system availability1 (negligible) to 10 (total failure)
O (Occurrence)Probability of recurrence1 (rare) to 10 (frequent)
D (Detection)Ability to detect before impact1 (always caught) to 10 (invisible)

Higher RPN means higher risk. The worst possible score (10 x 10 x 10 = 1000) indicates a severe, frequent, and invisible failure.

Action Thresholds

RPN RangeTierAction
1-100Tier 0Automatic procedural remediation
101-200Tier 1Agent-assisted remediation
201-400Tier 2Requires user approval
401-1000ManualHuman intervention required

Conservative Overrides

Certain conditions always escalate regardless of RPN:

  • Detection >= 8: Poor observability requires approval
  • Safety level DESTRUCTIVE: Data-modifying actions require approval
  • Multiple correlated failures: Escalate to next tier

Failure Mode Catalog

34 failure modes across 7 categories:

GPU (6 modes)

IDFailure ModeSODRPN
GPU_001Memory Exhaustion864192
GPU_002Temperature Critical93254
GPU_003Hardware Error102360
GPU_004Driver Crash83496
GPU_005Memory Fragmentation754140
GPU_006Power Throttling54360

vLLM Endpoint (6 modes)

IDFailure ModeSODRPN
VLLM_001Stuck Starting655150
VLLM_002Stuck Stopping44464
VLLM_003Health Check Failure763126
VLLM_004Orphan Process554100
VLLM_005OOM Crash853120
VLLM_006KV-Cache Exhaustion565150

Model Quality (5 modes)

IDFailure ModeSODRPN
MQ_001Hallucination Increase746168
MQ_002Latency Degradation45360
MQ_003Output Quality Drift567210
MQ_004Semantic Drift648192
MQ_005Context Exhaustion654120

Emergent Behavior (4 modes)

IDFailure ModeSODRPN
EB_001Swarm Consensus Failure646144
EB_002Cognition Loop547140
EB_003Embedding Drift658240
EB_004Self-Observation Bias659270

Note: Emergent behavior modes have high Detection scores (poor observability), reflecting the inherent difficulty of detecting these failure modes automatically.

Adaptive Learning

The system adjusts S/O/D scores based on remediation outcomes using exponential moving average (alpha = 0.2):

  • Successful fast fix: Occurrence decreases (problem is manageable)
  • Failed fix: Occurrence increases (problem is more persistent than estimated)
  • User-reported: Detection increases (automated checks missed it)
  • Early detection: Detection decreases (automated checks caught it)

CLI Commands

# FMEA summary with current RPN scores
uv run gaius-cli --cmd "/fmea" --format json

# Failure mode catalog
uv run gaius-cli --cmd "/fmea catalog" --format json

# Detail for specific failure mode
uv run gaius-cli --cmd "/fmea detail GPU_001" --format json

# Recent incidents
uv run gaius-cli --cmd "/fmea history" --format json

Remediation Strategies

Fix strategies are multi-step procedures that diagnose, repair, and verify service health. Each strategy is registered in the SERVICE_STRATEGIES dictionary and invoked via /health fix <service>.

Available Fix Strategies

ServiceStrategySteps
engineEngineFixStrategyKill stale processes, clean CUDA, restart
datasetDatasetFixStrategyRe-initialize NiFi connection, verify
nifiNiFiFixStrategyCheck connectivity, restart processors
postgresPostgresFixStrategyCheck connection, verify schema
qdrantQdrantFixStrategyCheck connectivity, verify collections
minioMinIOFixStrategyCheck connectivity, verify buckets
endpointsEndpointsFixStrategyHealth check, restart unhealthy
evolutionEvolutionFixStrategyRestart evolution daemon

Strategy Pattern

Each strategy follows the same pattern:

class EngineFixStrategy:
    async def execute(self) -> FixResult:
        # Step 1: Diagnose
        issues = await self.diagnose()

        # Step 2: Remediate
        for issue in issues:
            await self.fix(issue)

        # Step 3: Verify
        healthy = await self.verify()

        return FixResult(
            success=healthy,
            steps_taken=self.steps,
            duration_ms=elapsed,
        )

Three-Tier System

Tier 0: Procedural (RPN < 100)

Automatic restart without agent involvement:

# Kill stale process, wait, restart
await orchestrator.stop_endpoint(endpoint)
await asyncio.sleep(5)  # Cool-down
await orchestrator.start_endpoint(endpoint)

Tier 1: Agent-Assisted (RPN 100-200)

Uses a healthy inference endpoint to diagnose and decide on remediation:

diagnosis = await inference.analyze(issue.to_dict())
if diagnosis.action == "clear_cache":
    await clear_kv_cache(endpoint)
elif diagnosis.action == "rollback":
    await rollback_config(endpoint)

Tier 2: Approval Required (RPN > 200)

Creates an approval record for human review. Destructive operations (data modification, configuration changes) always require Tier 2 regardless of RPN.

Usage

# Fix a specific service
uv run gaius-cli --cmd "/health fix engine" --format json

# Fix all unhealthy services
uv run gaius-cli --cmd "/health fix all" --format json

Adding a New Fix Strategy

  1. Create a class in health/service_fixes.py implementing execute() -> FixResult
  2. Register it in SERVICE_STRATEGIES
  3. Add a KB heuristic document
  4. Test via /health fix <service>

Health Observer

The HealthObserver daemon provides continuous health monitoring with FMEA-based incident management and ACP escalation for issues beyond local remediation capability.

Operation

The observer runs as a background service within the engine, polling system health at a configurable interval (default 60 seconds).

from gaius.health.observe import HealthObserver

observer = HealthObserver()
await observer.start()  # Begins continuous monitoring

Incident Lifecycle

Detection → Active → Healing → Recovered → Resolved
                  ↘ Escalated (ACP) → Resolved
  1. Detection: Health check identifies a failure
  2. Active: Incident created with FMEA risk scoring
  3. Healing: Self-healing attempts in progress
  4. Recovered/Escalated: Either resolved locally or sent to ACP
  5. Resolved: Terminal state

Fail Open

When filtering incidents for display, the observer uses fail open semantics: it filters OUT known terminal states (resolved) rather than filtering IN known active states. Unknown states are always surfaced for investigation.

Makespan Integration

The observer integrates with the AgendaTracker to avoid false-positive incidents during scheduled operations. When an endpoint is part of a planned makespan transition:

if tracker.is_endpoint_in_scheduled_transition("reasoning"):
    # Skip incident creation — this is intentional
    log.info(f"Skipping: endpoint in scheduled transition to {expected_state}")

ACP Escalation

When an incident exceeds the RPN threshold or local remediation fails after 3 attempts, the observer escalates to Claude Code via ACP:

  • Claude Code analyzes the issue using MCP tools
  • Identifies gaps in the /health fix framework
  • Implements new fix strategies and heuristics
  • Commits to acp-claude/health-fix branch for review

Cadence Limits

To prevent runaway automation:

  • Max 3 GitHub issues per 24 hours
  • Min 5 minutes between restart attempts
  • Max 3 restarts per endpoint per hour

CLI Commands

# Observer status
uv run gaius-cli --cmd "/health observer" --format json

# Active incidents
uv run gaius-cli --cmd "/health incidents" --format json

# Incident detail
uv run gaius-cli --cmd "/health incident <id>" --format json

Guru Meditation Codes

Inspired by the Amiga’s iconic error screens, every failure mode in Gaius gets a unique identifier — a Guru Meditation Code. These codes create a traceable link from error messages to diagnostics and remediation.

Format

#<COMPONENT>.<SEQUENCE>.<MNEMONIC>

  • Component: Two or three letter abbreviation for the subsystem
  • Sequence: Zero-padded number unique within the component
  • Mnemonic: Human-readable description of the failure mode

Components

CodeComponent
DSDatasetService
NFNiFi
ENEngine
EPEndpoints/Inference
EVEvolution
DBDatabase
QDQdrant
GRgRPC
ACPAgent Client Protocol
ACFArticle Curation Flow
HLHealth
XBX Bookmarks

How They’re Used

Every error message includes the guru code and remediation path:

DatasetService not initialized.
  Guru: #DS.00000001.SVCNOTINIT
  Try: /health fix dataset
  Or:  just restart-clean

Design Rules

  1. One code per failure mode: Each code maps to exactly one failure
  2. Unique across the system: No two failure modes share a code
  3. Stable: Codes are never renumbered once assigned
  4. Documented: Each code has a KB heuristic with symptom, cause, and fix

KB Heuristics

Each guru code has a corresponding heuristic document in the knowledge base at build/dev/current/heuristics/gaius/<category>/<name>.md containing:

  • Symptom: What the user sees
  • Cause: Root cause analysis
  • Observation: How to detect programmatically
  • Solution: Remediation steps, including /health fix command

See Guru Meditation Codes Reference for the complete catalog.

Agent System

The agent system provides LLM orchestration patterns for domain analysis: role-based prompt execution, parallel inference coordination, temporal consolidation, and background evolution.

Execution Patterns

Swarm Execution

The primary pattern executes multiple LLM calls with distinct role-based system prompts in parallel:

RolePerspectiveTemperature
LeaderStrategic synthesis0.7
RiskThreat identification0.6
OptimizerEfficiency analysis0.7
PlannerRoadmap development0.7
CriticAdversarial review0.8
ExecutorImplementation assessment0.6
AdversaryStress testing0.8

Execution is parallel but not agentic — roles don’t observe each other’s outputs or iterate.

Latent Swarm (LatentMAS)

Reduces inter-agent token transfer by sharing embeddings instead of text via Qdrant. Agents store output embeddings; subsequent agents retrieve relevant context via semantic search.

Token reduction: 70-90% compared to text-based coordination.

MetaAgent Coordination

Specialist “analysts” answer natural language questions by querying structured data sources (Cypher for lineage, SQL for metrics). Results are synthesized by a correlator.

Background Processes

Two background processes run within the engine:

  • Evolution Daemon: Optimizes agent prompts during GPU idle periods
  • Cognition Agent: Generates “thoughts” about patterns in KB activity (every 4-8h)

Module Structure

agents/
├── swarm.py              # SwarmManager (parallel execution)
├── roles.py              # Role definitions (system prompts)
├── metaagent_swarm.py    # MetaAgentManager
├── cognition.py          # Pattern detection
├── theta/                # Temporal consolidation pipeline
├── latent/               # Qdrant-backed working memory
└── evolution/            # Prompt optimization

Subchapters

Evolution

The evolution subsystem optimizes agent system prompts using APO (Automatic Prompt Optimization) during GPU idle periods. It generates candidate prompts, evaluates them against held-out tasks, and promotes winners.

Evolution Cycle

1. Wait for GPU idle (<30% utilization)
2. Select next agent (round-robin)
3. Generate candidate prompts
4. Evaluate against held-out tasks
5. Promote best if improved
6. Record lineage

Optimization Methods

MethodDescription
APOAutomatic Prompt Optimization (Zhou et al., 2023)
GEPAGenetic Evolution of Prompt Architectures

Model Merging

Agent versions can be combined using parameter-space merging:

MethodDescription
LinearWeighted average of parameters
TIESResolves sign conflicts between models
DAREDrop and rescale for sparse merging

Agent Versioning

Each evolution cycle produces a new agent version with tracked lineage:

# Check evolution status
uv run gaius-cli --cmd "/evolve status" --format json

# View agent versions
uv run gaius-cli --cmd "/evolve versions leader" --format json

# Promote a specific version
uv run gaius-cli --cmd "/evolve promote leader v3" --format json

Configuration

evolution {
    enabled = true
    idle_threshold = 60    # seconds of GPU idle before triggering
    cycle_interval = 3600  # minimum seconds between cycles
}

The daemon runs in the engine process and activates only during GPU idle periods to avoid competing with interactive inference.

Cognition

The CognitionService generates autonomous “thoughts” by analyzing recent knowledge base activity. It runs as a scheduled background task within the engine.

Scheduled Tasks

TaskIntervalPurpose
cognition_cycleEvery 4hDetect patterns in recent KB activity
self_observationEvery 8hMeta-cognitive reflection on thought patterns
engine_auditEvery 12hSystem health and resource analysis

Thought Types

TypeDescription
PATTERNRecurring themes across documents
CONNECTIONCross-domain relationships discovered
CURIOSITYQuestions warranting investigation
SELF_OBSERVATIONMeta-cognitive observations about thought quality

How It Works

Each cognition cycle:

  1. Retrieves recent KB entries and thought history
  2. Analyzes for patterns, connections, and gaps
  3. Generates thoughts using a reasoning endpoint
  4. Stores thoughts in the knowledge base
  5. Records in the thought chain for provenance

CLI Commands

# Trigger cognition cycle manually
uv run gaius-cli --cmd "/cognition" --format json

# View recent thoughts
uv run gaius-cli --cmd "/thoughts" --format json

# Trigger self-observation
uv run gaius-cli --cmd "/self-observe" --format json

Thought Chain

Thoughts are linked in a chain with provenance tracking. Each thought references its trigger (scheduled, manual, or reactive) and the inputs that contributed to it. This creates an auditable trail of the system’s reasoning.

Theta Consolidation

ThetaAgent executes a deterministic consolidation pipeline for cross-temporal knowledge linking. Named after theta rhythms in hippocampal replay, it compresses temporal experience into durable knowledge connections.

Pipeline Stages

Temporal Slicing → NVAR Signal → BERTSubs Inference → KG Selection → Augmentation

1. Temporal Slicing

Documents are organized into weekly slices (YYYY-WNN format). Each slice represents a temporal context for consolidation.

2. NVAR Dynamics

Nonlinear Vector AutoRegression using reservoir computing computes a consolidation signal from embedding centroid trajectories. High “urgency” indicates rapid semantic drift requiring consolidation attention.

3. BERTSubs Inference

Subsumption relationships between concepts are inferred using BERTSubs from DeepOnto. The inferencer identifies “A is-a B” relationships via fine-tuned BERT classification on ontology subsumptions.

4. Knowledge Gradient Selection

Candidate relationships are filtered using the Knowledge Gradient policy, balancing exploration (learning about uncertain candidates) against exploitation (selecting high-confidence relationships).

5. Document Augmentation

Selected relationships are injected into source documents as wikilinks and action links for navigation.

Usage

# Run consolidation
uv run gaius-cli --cmd "/theta consolidate" --format json

# View consolidation stats
uv run gaius-cli --cmd "/theta stats" --format json

# Check situational report
uv run gaius-cli --cmd "/sitrep" --format json

Dependencies

  • DeepOnto with JVM (via JPype) for BERTSubs
  • OWL domain ontology with rdfs:subClassOf axioms
  • Sufficient class count (~50+ classes) for training data

CLT Memory

Cross-Layer Transcoder (CLT) memory extracts sparse features from model activations, providing interpretable representations of agent state.

How It Works

The CLTService extracts sparse features from inference activations using circuit-tracer:

state = await clt.extract_features(
    agent_id="critic",
    content="The risk model has issues...",
)
# state.features.active_indices — which features activated

Swarm Features

CLT features can be computed across a swarm to find consensus:

swarm_result = await clt.compute_swarm_features(domain="pension")
# swarm_result.consensus_features — features active across multiple agents

Integration with Topology

The TopologyService consumes CLT features to detect semantic attractors — regions in embedding space where agent attention converges:

CLTService → extract features → TopologyService → detect attractors → grid overlay

Qdrant Collections

CLT features are stored in dedicated Qdrant collections:

CollectionPurpose
gaius_clt_memoryCross-Layer Transcoder feature history
gaius_latent_memoryLatent working memory for swarm coordination

CLI Commands

# Extract CLT features
uv run gaius-cli --cmd "/clt extract" --format json

# CLT memory statistics
uv run gaius-cli --cmd "/clt stats" --format json

Data Pipeline

The data pipeline connects external sources to the knowledge base, card collections, and search index through a sequence of ingestion, processing, and indexing stages.

End-to-End Flow

Web Sources (Brave, arXiv, RSS)
    |
    v
NiFi Ingestion ──> Raw Content (HX / Iceberg)
    |
    v
Metaflow Pipelines ──> Article Drafts, Card Creation
    |
    v
Qdrant Indexing ──> 768-dim Nomic Embeddings
    |
    v
PostgreSQL (zndx_gaius:5444) ──> Cards, Collections, Metadata
    |
    v
R2 Storage ──> Rendered Visualizations (viz.gaius.zndx.org)

Pipeline Stages

Ingestion. NiFi processors fetch content from external APIs, RSS feeds, and web search results (Brave). Raw content is stored in Apache Iceberg tables via the HX data lake before any processing occurs. This preserves the original source material and provides a replay capability.

Processing. Metaflow pipelines handle the compute-intensive work: PDF conversion via docling, topic extraction via BERTopic, relevance scoring via local LLMs, and article draft generation. See Metaflow Integration for details on the execution environment.

Article Curation. The Article Curation flow orchestrates the full lifecycle from article selection through card creation and publication. Each run produces approximately 20 cards in under 2 minutes.

Indexing. Processed content is embedded using Nomic (768-dimensional vectors) and indexed in Qdrant for semantic search. The same embeddings drive the TUI’s 19x19 grid layout and the visualization pipeline.

Storage. Cards, collections, and metadata live in PostgreSQL (zndx_gaius on port 5444). Rendered card images are uploaded to Cloudflare R2 and served from viz.gaius.zndx.org. See Viz Storage for the object key convention.

Lineage Tracking

Every pipeline stage emits OpenLineage events that are materialized into an Apache AGE graph. This provides full provenance from source URL to published card. See Lineage Tracking for Cypher query examples.

Knowledge Base

The Knowledge Base serves as both input and output of the pipeline. Articles begin as zettelkasten notes in build/dev/scratch/, and the curation flow produces structured content in build/dev/current/articles/.

Key Services

ServiceRolePort
NiFiContent ingestion8443
MetaflowPipeline execution8180
PostgreSQLMetadata, cards, collections5444
QdrantVector search6333
MinIOArtifact storage (S3-compatible)9000
Gaius Engine (gRPC)Orchestration, scheduling50051

CLI Access

# List available flows
uv run gaius-cli --cmd "/flows list"

# Trigger article curation
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"

# Query lineage for a KB file
uv run gaius-cli --cmd "/lineage query scratch/2026-03-14/paper.md"

Metaflow Integration

Gaius uses Metaflow for production data pipelines that run on Kubernetes. Flows handle article curation, content evaluation, rendering, and document processing.

Infrastructure

The Metaflow service is deployed via Tilt in infra/tilt/ and runs on the local RKE2 Kubernetes cluster. Access requires a port-forward:

kubectl port-forward svc/metaflow-service 8180:8080

The environment variable METAFLOW_SERVICE_URL=http://localhost:8180 must be set for flow execution. This is configured automatically in devenv.nix for interactive shells and explicitly in process scripts.

GaiusFlow Base Class

All Gaius flows inherit from GaiusFlow, which provides OpenLineage integration and KB path helpers:

from gaius.flows import GaiusFlow
from metaflow import step

class MyFlow(GaiusFlow):
    @step
    def start(self):
        self.emit_lineage_start("my_flow", inputs=[...])
        self.next(self.process)

    @step
    def end(self):
        self.emit_lineage_complete(outputs=[...])

KB path helpers generate paths following the zettelkasten convention:

# scratch/{date}/{HHMMSS}_{title}.md
path = self.zettelkasten_path("My Analysis")

# current/archive/{quarter}/attachments/{filename}
path = self.archive_path("paper.pdf")

Flow Registry

Flows are registered for CLI discovery using the @register_flow decorator:

from gaius.flows import register_flow

@register_flow("article-curation")
class ArticleCurationFlow(GaiusFlow):
    ...

Registered flows can be listed and invoked from the CLI or MCP tools.

Available Flows

FlowPurposeTypical Duration
ArticleCurationFlowEnd-to-end article research and card publication~2 min
ArxivDoclingFlowFetch and convert arXiv papers to markdown~30s
ClouderaDocsFlowSync Cloudera documentation archivesvaries

See Article Curation for the full 11-step pipeline.

Configuration

Key environment variables:

VariablePurpose
METAFLOW_SERVICE_URLMetaflow service endpoint (http://localhost:8180)
METAFLOW_DATASTORE_SYSROOT_S3MinIO path for flow artifacts
METAFLOW_DEFAULT_METADATAMetadata backend (postgresql)
GAIUS_KB_ROOTKnowledge base root directory

Running Flows

# Via Metaflow CLI
python -m metaflow.cli run ArticleCurationFlow --article ai-reasoning-weekly

# Via Gaius CLI
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"

# Via MCP tool
uv run gaius-cli --cmd "/fetch_paper 2312.12345"

K8s Prerequisites

  • kubectl and k9s are Nix-managed via devenv.nix (not the system RKE2 binary)
  • KUBECONFIG must be set to ~/.config/kube/rke2.yaml (never use fallback syntax)
  • K8s pods need pg_hba.conf entries for 10.42.0.0/16 and 10.43.0.0/16 subnets

Article Curation

The ArticleCurationFlow is an 11-step Metaflow pipeline that automates the discovery, research, drafting, and publication of articles. It is the primary content production mechanism in Gaius.

Pipeline Overview

start ──> grok_research_summary ──> select_article ──> acquire_external
   ──> update_manifest ──> create_draft ──> create_base ──> create_cards
   ──> enrich_cards ──> publish_batch ──> end

Each run produces approximately 20 cards in under 2 minutes.

Article Discovery

Articles live at current/articles/{slug}/ in the knowledge base. Each article directory contains a markdown file with YAML frontmatter that must include keywords and/or news_queries to guide the Brave search fetcher:

---
title: "AI Reasoning Weekly"
keywords: ["chain-of-thought", "reasoning models", "test-time compute"]
news_queries: ["AI reasoning breakthroughs 2026"]
---

Empty keywords trigger a fail-fast error:

#ACF.00000013.NOHINTS - Article has no keywords or news_queries
  Try: Add keywords to the article frontmatter

Article Selection

The selection rubric evaluates candidate articles using several signals. The curation_readiness gate prevents selecting articles that lack sufficient zettelkasten notes or have incomplete frontmatter. Collection balance – specifically pending_cards count – is the most effective diversity signal, steering selection toward underrepresented topics.

External Source Acquisition

Once an article is selected, the flow fetches external sources in parallel using Brave search. Results are scored for relevance by a local LLM. Only sources exceeding the relevance threshold are retained.

Draft Generation and Card Creation

Drafts are synthesized using Grok, drawing from the article’s zk/ zettelkasten notes. The flow does NOT search the broader KB to avoid exposing private materials in published articles.

After drafting, the flow creates a BFO-grounded .base file with references, then generates collection cards from those references. Cards are created with pending status.

Enrichment Before Publish

Cards must be fully enriched before publication. Enrichment includes:

  1. Summary generation – LLM-generated card summaries
  2. Image rendering – Procedural visualizations via LuxCore

Only cards that pass both enrichment steps are published. Failed cards remain pending for the next run. This prevents incomplete content from appearing on the site.

CLI Access

# Curate a specific article
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"

# List available articles
uv run gaius-cli --cmd "/article list"

# Check article status
uv run gaius-cli --cmd "/article status"

Fail-Fast Guarantees

The flow fails immediately if required services are unavailable. No fallbacks or placeholder content is generated. Key guru meditation codes:

CodeMeaning
#ACF.00000013.NOHINTSArticle missing keywords/news_queries
#FL.00001.DOCLING_FAILDocument conversion failed
#FL.00002.METAFLOW_DBMetaflow metadata DB unavailable

Privacy

The curation flow only uses the article’s own zk/ notes as source material. It does not search the broader knowledge base, ensuring private materials are never exposed in published articles.

Knowledge Base

The knowledge base is a markdown-first document store organized as a zettelkasten. It lives under build/dev/ (gitignored) and is accessible through MCP tools for CRUD operations.

Directory Structure

build/dev/
├── current/            # Active work (manually curated)
│   ├── projects/       # Project-specific documents
│   ├── articles/       # Article directories with frontmatter
│   ├── content/domains/ # Domain-specific content
│   └── heuristics/     # Guru meditation heuristic files
│       └── gaius/
├── scratch/            # Zettelkasten notes (organized by date)
│   ├── 2026-03-14/
│   │   ├── 103045_my_analysis.md
│   │   └── 142200_research_notes.md
│   └── 2026-03-13/
└── archive/            # Quarterly archives
    └── 2026Q1/
        └── attachments/

current/ contains active, manually curated work. Articles, projects, and domain content live here. Heuristic files for guru meditation codes are stored at current/heuristics/gaius/{category}/{name}.md.

scratch/ is the zettelkasten. Files are organized by date and named with a time prefix: {HHMMSS}_{title}.md. This is where Metaflow pipelines deposit processed content and where daily research notes accumulate.

archive/ holds quarterly archives with binary attachments (PDFs, images) that are too large for the scratch directory.

MCP Tools

The KB is fully accessible through MCP tools, enabling Claude Code and other agents to read, write, and search the knowledge base:

ToolOperation
search_kbFull-text search across all KB content
read_kbRead a specific file by path
create_kbCreate a new file at a given path
update_kbUpdate an existing file
list_kbList files in a directory
delete_kbDelete a file
# Search the knowledge base
uv run gaius-cli --cmd "/search_kb 'persistent homology'"

# Read a specific file
uv run gaius-cli --cmd "/read_kb scratch/2026-03-14/103045_analysis.md"

Path Conventions

Metaflow flows use helper methods on GaiusFlow to generate consistent paths:

# Zettelkasten path: scratch/{date}/{HHMMSS}_{title}.md
path = self.zettelkasten_path("My Analysis")
# -> "scratch/2026-03-14/103045_my_analysis.md"

# Archive path: current/archive/{quarter}/attachments/{filename}
path = self.archive_path("paper.pdf")
# -> "current/archive/2026Q1/attachments/paper.pdf"

Integration with Pipelines

The KB serves as both input and output for the data pipeline:

  • Input: Articles with frontmatter and zettelkasten notes drive the article curation flow
  • Output: Processed papers, research summaries, and draft articles are written back to scratch/ or current/
  • Lineage: KB file paths appear as Dataset nodes in the lineage graph, enabling provenance queries from source URL to KB entry

Storage Backend

KB operations go through gaius.storage.kb_ops, which manages the filesystem-backed store. The GAIUS_KB_ROOT environment variable overrides the default build/dev/ location. Content is not stored in the database – the KB is a plain filesystem hierarchy, making it easy to browse, grep, and version control externally.

Sync to HX

Raw content (PDFs, API responses) is stored separately in the HX data lake (Apache Iceberg) to prevent the KB from being overwhelmed with unprocessed data. Only curated summaries and processed markdown enter the KB.

Lineage Tracking

Lineage tracking provides graph-based provenance that connects data sources to derived artifacts. Every pipeline stage emits OpenLineage events that are materialized into an Apache AGE graph stored in PostgreSQL.

Architecture

Metaflow Pipelines ──┐
Fetch Workers ───────┤──> RunEvent ──> LineageEmitter ──> Apache AGE Graph
Agents ──────────────┘                                         |
                                                               v
                                                    Cypher Queries (MCP + CLI)

Graph Schema

The lineage graph uses four vertex labels and four edge labels:

Vertices:

  • Dataset – a data source or sink (namespace, name)
  • Job – a processing definition (namespace, name)
  • Run – a single execution of a job (run_id, state, event_time)

Edges:

  • INPUT_TO – Dataset consumed by Run
  • OUTPUTS – Run produced Dataset
  • EXECUTES – Job spawned Run
  • PARENT – Run is child of another Run

OpenLineage Events

Flows emit events at key lifecycle points:

EventTimingPurpose
STARTFlow beginRecord input datasets
COMPLETEFlow endRecord output datasets
FAILOn errorRecord failure with context
from gaius.hx.lineage import get_emitter, RunEvent, Dataset, Job

emitter = get_emitter()

event = RunEvent.complete(
    run=run,
    job=Job("gaius.flows", "ArticleCurationFlow"),
    inputs=[Dataset("gaius.source", "brave:ai-reasoning")],
    outputs=[Dataset("gaius.kb", "scratch/2026-03-14/paper.md")],
)
await emitter.emit(event)

Cypher Queries

Lineage can be queried via the MCP lineage_cypher tool or the CLI:

# Trace upstream sources for a KB file
uv run gaius-cli --cmd "/lineage query scratch/paper.md"

Example Queries

Find all KB files derived from arXiv sources:

MATCH (s:Dataset)-[:INPUT_TO]->(:Run)-[:OUTPUTS]->(kb:Dataset)
WHERE s.namespace = 'gaius.source' AND s.name STARTS WITH 'arxiv:'
RETURN s.name as source, kb.name as kb_path

Trace full provenance chain (up to 5 hops):

MATCH path = (src:Dataset)-[:INPUT_TO|OUTPUTS*1..5]->(target:Dataset)
WHERE target.namespace = 'gaius.kb'
  AND target.name CONTAINS 'attention_is_all_you_need'
RETURN src.namespace, src.name

Count vertices by label:

MATCH (n) RETURN labels(n)[0] as label, count(n) as cnt

HX Package

The lineage subsystem lives in gaius.hx.lineage:

hx/lineage/
├── events.py    # Dataset, Job, Run, RunEvent (OpenLineage types)
├── emitter.py   # LineageEmitter (store + graph sync)
└── graph.py     # AGE Cypher helpers

The parent gaius.hx package is the raw content data lake (Apache Iceberg). Lineage events bridge HX raw storage to KB curated content, recording every transformation step.

Integration Points

  • Metaflow flows emit START/COMPLETE/FAIL events via the GaiusFlow base class
  • Fetch workers emit events when acquiring external content
  • MCP tools expose query_lineage and lineage_cypher for graph traversal
  • The lineage graph is stored in the same PostgreSQL instance (zndx_gaius:5444) using the Apache AGE extension

Inference

The inference layer routes requests across multiple backends: vLLM for local GPU models, optillm for reasoning enhancement, and external APIs (xAI, Cerebras) for cloud-based inference.

Backend Router

The BackendRouter selects the appropriate backend based on capability requirements:

class BackendRouter:
    async def route_inference(
        self,
        model: str,
        prompt: str,
        max_tokens: int,
        technique: str = "",  # optillm technique
    ) -> str

Backends

BackendPurposeHardware
vLLMLocal model inference6x NVIDIA GPUs
optillmReasoning enhancement (CoT, BoN, MoA)Proxies to vLLM
xAI (Grok)External API inferenceCloud
CerebrasExternal API inferenceCloud
NomicText embeddings1 GPU

optillm Techniques

TechniqueDescription
cot_reflectionChain-of-thought with reflection
bonBest-of-N sampling
moaMixture of Agents
rtoRound-trip optimization
z3Z3 solver integration
leapLearn from examples

Request Flow

Client → gRPC → Scheduler → BackendRouter → Backend
                                           ↗ vLLM (local)
                                          ↗ optillm → vLLM
                                         ↗ xAI API (cloud)

All inference requests route through the gRPC engine for centralized authentication, audit logging, and resource management.

Subchapters

vLLM Controller

The VLLMController manages vLLM inference server processes across 6 NVIDIA GPUs, handling startup, health monitoring, graceful shutdown, and recovery.

Process Management

class VLLMController:
    async def start_endpoint(
        self,
        model: str,          # HuggingFace model ID
        gpu_ids: list[int],  # Allocated GPUs
        port: int,           # Serving port
        tensor_parallel: int = 1,
    ) -> ProcessStatus

    async def stop_endpoint(self, port: int) -> bool
    async def health_check(self, port: int) -> bool

Lifecycle

  • Graceful shutdown: SIGTERM first, force kill after timeout
  • CUDA memory cleanup: torch.cuda.empty_cache() on shutdown
  • Orphan detection: Scans for stale vLLM processes on startup
  • Circular log buffer: 500 lines for diagnostics

GPU Allocation

6 GPUs are allocated across endpoints:

GPU 0-1: reasoning endpoint (tensor_parallel=2)
GPU 2-3: coding endpoint (tensor_parallel=2)
GPU 4:   embedding endpoint
GPU 5:   available for rendering/evolution

Allocation is managed by the Orchestrator, not the controller directly.

Model Loading

Loading a 70B model to VRAM takes ~240 seconds. During this time:

  1. The engine streams progress to connected clients
  2. The endpoint status transitions: PENDING → STARTING → HEALTHY
  3. Health checks begin polling at 30-second intervals

Status Monitoring

# Check all endpoint status
uv run gaius-cli --cmd "/gpu status" --format json

# Watch during restart
for i in $(seq 1 15); do
    sleep 10
    uv run gaius-cli --cmd "/gpu status" --format json | \
        jq '.data.endpoints[] | {name, status}'
done

Common Issues

SymptomGuru CodeFix
Process won’t start#EP.00000001.GPUOOM/health fix endpoints
Orphan process#EN.00004.ORPHAN_PROCjust gpu-cleanup
cv2 import errorOpenCV conflictSee MEMORY.md OpenCV section

Makespan Scheduling

Makespan scheduling optimizes GPU utilization across multi-step workloads that require endpoint transitions (eviction, loading, inference, restoration).

What is a Makespan?

A makespan is the total time from start to finish of a complex workload that may require:

  1. GPU eviction: Stopping a low-priority endpoint to free GPUs
  2. Endpoint startup: Loading a different model
  3. Workload execution: Running the actual inference
  4. Baseline restoration: Reloading the original endpoint

Example: Render Pipeline

makespan.execute
├── allocate_gpus              # OR-Tools resource assignment
├── evict_if_needed            # Preemption decisions
├── start_endpoints            # vLLM process spawning
│   └── endpoint.start: rendering
│       ├── process_spawn
│       ├── model_load         # ~240s for large models
│       └── health_check
├── execute_workload           # Actual inference/rendering
└── restore_baseline           # Return to set points

AgendaTracker

The AgendaTracker records scheduled endpoint transitions so the Health Observer can distinguish intentional state changes from failures:

tracker.register_operation(
    operation_id=op_id,
    workload_id=wl_id,
    control_mode=ControlMode.POSITIVE,
    target_endpoints=["reasoning", "fast"],
)

Control Modes

ModePurpose
POSITIVEPlanned operation (start/stop)
FAILUREResponding to detected failure
RESTART_RECOVERYRestarting after failure resolution

Tracing

Each makespan is traced as a parent span with child spans for each operation phase. This enables end-to-end visibility into complex multi-step operations, including time spent in external API calls (treated as black-box stages).

XAI Budget

The XAI budget system tracks and limits usage of external AI APIs (xAI Grok, Cerebras) to prevent runaway costs while enabling strategic use for evaluation and critique.

Budget Tracking

budget = scheduler.get_xai_budget()
# budget.daily_remaining — tokens left for today
# budget.daily_limit — configured daily cap
# budget.reset_time — when the budget resets (midnight UTC)

Usage Controls

  • Daily token limit: Configured per provider
  • Request rejection: When budget exhausted, requests fail with clear error
  • Priority gating: Only HIGH and CRITICAL priority jobs can use external APIs
  • Evaluation budget: Separate allocation for agent evaluation tasks

CLI Commands

# Check current budget
uv run gaius-cli --cmd "/xai budget" --format json

# Reset budget (admin)
uv run gaius-cli --cmd "/xai reset" --format json

# Evaluate with external model
uv run gaius-cli --cmd "/xai evaluate" --format json

When External APIs Are Used

Use CaseProviderPurpose
Agent evaluationxAI GrokIndependent critique of agent output
Cross-validationCerebrasSecond opinion on critical decisions
Held-out evaluationxAI GrokMeasuring agent improvement

Visualization

The visualization pipeline generates unique procedural images for collection cards using LuxCore path tracing. Each card’s image is deterministic – derived from the differential geometry and algebraic topology of its embedding neighborhood.

Pipeline

Nomic Embeddings (768-dim)
    |
    ├──> GeometryComputer (Ollivier-Ricci curvature, gradient fields)
    └──> TDAComputer (persistent homology via ripser)
            |
            v
        CardVizData (normalized feature vector per card)
            |
            v
        Grammar Engine (CFDG-inspired recursive expansion)
            |
            v
        MeshGen (pure numpy mesh generators)
            |
            v
        LuxCore Renderer (PATHOCL GPU / PATHCPU fallback)
            |
            v
        R2 Storage (viz.gaius.zndx.org)

Mathematical Grounding

Visualizations are not arbitrary aesthetic choices. They are driven by intrinsic geometric properties of the embedding space:

  • Ollivier-Ricci curvature controls glass color temperature and petal count. Positive curvature (cluster interior) produces warmer, simpler forms. Negative curvature (semantic boundary) produces cooler, complex structures.
  • Persistent homology (H0, H1, H2) controls recursion depth, toroidal rings, and void chambers. Topologically richer collections produce deeper nesting.
  • Gradient fields position the key light along the direction of steepest semantic change.
  • Complexity (local topological isolation) controls surface subdivision and branching probability.

Components

The pipeline spans four modules in gaius.viz/:

ModulePurpose
data.pyFeature extraction from embedding geometry
grammar.pyGrammar Engine – recursive shape expansion
meshgen.pyPure numpy mesh generators (ico_sphere, petal, torus)
luxcore_renderer.pyLuxCore Renderer – scene assembly and rendering
renderer.pyAsync wrappers, variant management, thread pool
storage.pyR2 upload, DB updates, KV sync

Render Variants

Each card is rendered in two variants:

VariantDimensionsPurpose
display1400x300Card header image on site
og1200x630OpenGraph social sharing

gRPC Integration

Rendering is triggered via the /render CLI command, which invokes the RenderCards streaming RPC on the gRPC engine (port 50051). GPU eviction is coordinated with the vLLM controller:

# Render cards for a collection
uv run gaius-cli --cmd "/render collection-id"

The render workload sets allow_baseline_eviction=True to temporarily free a GPU from vLLM inference. After rendering completes, clear_embeddings() releases the Nomic model (~3GB) from GPU memory.

Halt Conditions

Rendering quality is controlled by time and sample count:

  • Production: 60 seconds / 512 samples per pixel
  • Curation pipeline: 20 seconds / 128 samples per pixel (faster throughput)

Materials

LuxCore’s spectral rendering produces physically accurate glass caustics and internal reflections. This was the primary motivation for switching from Blender Cycles, which rendered recursive glass nesting as opaque white blobs rather than transparent refraction.

LuxCore Renderer

LuxCore is the unbiased path tracer used for generating card visualizations. It provides GPU-accelerated rendering with physically accurate spectral glass materials that Blender Cycles could not achieve.

Installation

PyPI (CPU-only, production fallback):

uv pip install pyluxcore --no-deps

The --no-deps flag is required to avoid pulling in numpy 2.x, which conflicts with vLLM.

From source (GPU path):

The from-source build lives at thirdparty/src/LuxCore (git submodule). Build with:

./build-thirdparty.sh --component luxcore

Output: thirdparty/installed/LuxCore/pyluxcore/pyluxcore.cpython-312-x86_64-linux-gnu.so

Runtime libraries (OIDN + TBB) are installed to thirdparty/installed/LuxCore/lib/ with RPATH set to $ORIGIN/../lib. CUDA 12.4 at /usr/local/cuda is auto-detected during build.

Render Engines

PATHOCL – GPU-accelerated path tracing on CUDA devices. This is the primary production engine. Hybrid mode automatically uses both GPU intersection and 64 CPU native threads together. The engine name is PATHOCL, not PATHGPU (which does not exist).

PATHCPU – 64-thread CPU rendering when no CUDA devices are available. Approximately 10x slower than single-GPU PATHOCL for equivalent sample counts.

Device Selection

CUDA devices are selected via a string of 0 and 1 characters (no spaces), where each position maps to an entry in pyluxcore.GetOpenCLDeviceList():

# Device order: 6 OpenCL (indices 0-5) + 6 CUDA (indices 6-11)
# Physical GPU N = cuda_indices[N]
# Select only GPU 2:
device_string = "000000001000"  # CUDA index 8 = physical GPU 2

The gpu_id parameter restricts rendering to a single evicted GPU, which is required since all other GPUs are loaded by vLLM.

Scene Construction

Camera configuration goes in scene.Parse(), NOT in the config object. This is a common LuxCore pitfall:

scene.Parse(pyluxcore.Properties()
    .Set(pyluxcore.Property("scene.camera.type", "perspective"))
    .Set(pyluxcore.Property("scene.camera.lookat.orig", [0, -5, 2]))
    .Set(pyluxcore.Property("scene.camera.lookat.target", [0, 0, 0.5]))
    .Set(pyluxcore.Property("scene.camera.fieldofview", 40))
)

Light Types

LuxCore supports: point, spot, distant, constantinfinite. There is no area light type – use emissive meshes instead. Light gain values are approximately 100x lower than Blender energy values.

Film Pipeline

After rendering, the film pipeline must be executed with an explicit pipeline index:

session.GetFilm().ExecuteImagePipeline(0)  # 0 = pipeline index, required

Polling vs Blocking

Never use WaitForDone() – it blocks indefinitely. Use polling with HasDone() and UpdateStats():

while not session.HasDone():
    session.UpdateStats()
    stats = session.GetStats()
    elapsed = stats.Get("stats.renderengine.time").GetFloat()
    if elapsed > timeout_seconds:
        break
    time.sleep(0.5)

Initialization

pyluxcore.Init() must be called exactly once. The _ensure_luxcore() helper handles this, preferring the from-source build over the PyPI wheel:

def _ensure_luxcore():
    """Initialize LuxCore once, preferring from-source build."""
    source_path = Path("thirdparty/installed/LuxCore/pyluxcore")
    if source_path.exists():
        sys.path.insert(0, str(source_path))
    import pyluxcore
    pyluxcore.Init()

Grammar Engine

The grammar engine implements a CFDG-inspired recursive expansion system that generates unique 3D scenes from card topology features. It lives in gaius.viz.grammar and produces a flat list of positioned shapes that the LuxCore renderer assembles into scenes.

Design Principles

From Context Free Design Grammars (Horigan, 2004), the engine borrows three key ideas:

  1. Weighted rule alternatives – at each expansion step, the grammar chooses among productions with probabilities derived from the card’s feature vector. This is what makes different cards produce different structures.

  2. Recursive expansion with transform accumulation – each production can invoke sub-rules with a child transform (translation, rotation, scale) relative to the parent. Transforms compose multiplicatively, producing self-similar structures at decreasing scales.

  3. Termination by minimum scale – expansion stops when accumulated scale drops below MIN_SCALE (0.08) or when the shape budget (MAX_SHAPES = 35) is exhausted.

Deterministic Seeding

Every card produces the same visualization regardless of when or where it is rendered:

seed = int(hashlib.sha256(card_id.encode()).hexdigest(), 16) % (2**32)
rng = random.Random(seed)

Feature-to-Rule Mapping

Card topology features control rule weights and recursion depth:

FeatureGrammar Effect
curvaturePetal count, recurse-vs-stop weight, dome factor
persistenceMax depth (3-7), shell nesting weight, spiral count
complexityBranch-vs-grow weight, surface segments
boundaryEmission strength, volume density, core radius
b1Number of toroidal rings (0-3)
b2Number of void chambers (0-2)
diagramFilament count, scale, and z-position
card_indexPhase offset for rotational variety in collection

Shape Primitives

The grammar produces six shape types, all implemented as arbitrary meshes in meshgen.py (not geometric primitives):

  • Petals – flower-like disk segments arranged in clusters
  • Shells – nested recursive enclosures
  • Tori – toroidal glass rings driven by H1 (1-cycles)
  • Voids – inverted-normal spheres representing H2 (2-cycles)
  • Filaments – thin structures whose scale encodes persistence interval lifetime
  • Core – central anchor shape

Arrangement Modes

The root-level grammar selects one of three arrangement modes:

  • Cluster – radial arrangement around a center point
  • Spiral – logarithmic spiral placement
  • Branches – tree-like recursive branching

The arrangement mode is selected probabilistically based on the card’s curvature and complexity features.

Extensibility

Adding a new shape primitive requires three changes:

  1. A mesh generator function in meshgen.py: (parameters) -> (vertices, faces)
  2. A shape constant in grammar.py
  3. A renderer case in luxcore_renderer.py

The grammar and renderer are agnostic to the geometry they receive – any mesh generator that returns numpy vertex and face arrays works.

Future Directions

The grammar is currently expressed as Python functions with hardcoded rule structures. A text-based grammar format (closer to CFDG’s declarative syntax) would allow grammar definitions to be version-controlled and iterated without modifying Python code.

Viz Storage

Rendered card visualizations are stored in Cloudflare R2 and served from a public URL. The storage layer handles upload, database updates, and KV sync for live site pages.

R2 Bucket

PropertyValue
Bucket namegaius-viz
Public URLhttps://viz.gaius.zndx.org

Object Key Convention

Rendered images follow a predictable path structure:

viz/cards/{card_id}/{variant}.png

For example:

viz/cards/abc123/display.png
viz/cards/abc123/og.png

Variants

Each card is rendered in two variants:

VariantDimensionsPurpose
display1400x300Card header image on the site
og1200x630OpenGraph image for social sharing

Database Integration

The image_url column in the cards table stores the display variant URL:

https://viz.gaius.zndx.org/viz/cards/{card_id}/display.png

The OG variant URL is derived by path convention – replace display.png with og.png. There is no separate database column for the OG URL.

Upload Flow

After the LuxCore renderer produces an image, the storage module (gaius.viz.storage) handles:

  1. R2 upload – uploads both display and OG variants to the bucket
  2. DB update – sets the image_url column on the card row
  3. KV sync – updates Cloudflare KV stores used by the live card pages
# Simplified upload path
await upload_to_r2(card_id, display_bytes, "display")
await upload_to_r2(card_id, og_bytes, "og")
await update_card_image_url(card_id, display_url)
await sync_kv(card_id)

CLI Access

# Render cards for a collection
uv run gaius-cli --cmd "/render collection-id"

# The render command handles the full pipeline:
# grammar expansion -> LuxCore render -> R2 upload -> DB update -> KV sync

GPU Eviction

Rendering requires GPU access, but vLLM typically occupies all GPUs. The render workload requests GPU eviction via allow_baseline_eviction=True in the gRPC workload metadata. After rendering completes, clear_embeddings() releases the Nomic embedding model (~3GB) from GPU memory. See Visualization for the full pipeline context.

Bases Feature Store

Bases is an entity-centric feature store backed by Apache Kudu (via PostgreSQL FDW) with a fluent query API, BFO ontology grounding, and query guardrails. It abstracts multiple storage backends behind a unified interface.

Core Concepts

A Base is a named, typed view over features and entities. Bases hide the underlying storage backend (PostgreSQL, Iceberg, Kudu FDW) behind a consistent query interface.

Three base types determine query semantics and backend routing:

TypeSemanticsBackend
SNAPSHOTLatest value per entityKudu via FDW (PostgreSQL stub)
HISTORICALEvent-sourced with time-travelApache Iceberg
REGISTRYMetadata queriesPostgreSQL

Fluent Query API

The primary query interface uses Kudu SDK-style method chaining:

from gaius.bases import Base, col, term

results = await (
    Base("events")
    .where(col("age") > 30)
    .where(col("status").isin("active", "pending"))
    .select("name", "email")
    .order_by("created_at", desc=True)
    .limit(100)
    .scan()
)

Ontology-grounded queries resolve BFO terms to column names via the base’s @context:

results = await (
    Base("events")
    .where(term("BFO:material_entity") == "ENT-12345")
    .scan()
)

Time-travel queries on historical bases:

results = await (
    Base("events")
    .as_of("2026-01-01T00:00:00Z")
    .where(col("entity_id") == "user-42")
    .scan()
)

Base Definition (.base YAML)

Bases are defined in YAML files with JSON-LD style semantic grounding:

"@context":
  "@vocab": "https://purl.obolibrary.org/obo/"
  entity_id:
    "@id": "BFO_0000040"

kudu:
  table: "gaius.events"
  primary_key: [entity_id, event_time]

schema:
  - name: entity_id
    type: STRING
  - name: event_time
    type: TIMESTAMP

Query Guardrails

All queries pass through guardrails that enforce resource limits:

GuardrailDefaultMaximum
Result limit1,000 rows10,000 rows
Query timeout30 seconds120 seconds
Time range (historical)7 days90 days

Historical bases require a time constraint (.as_of() or time column filter). Unbounded historical scans are rejected.

MCP Tools

ToolOperation
bases_listList available bases with metadata
bases_queryExecute fluent queries against bases
bases_entity_historyGet event-sourced history for an entity
bases_healthCheck service health

Architecture

Fluent API (Base/col/term) ──> Parser ──> Compiler (SQLGlot) ──> Executor
                                              |                      |
                                              v                      v
                                    Guardrail Enforcer         PostgreSQL / Iceberg

The DQL Query Language provides the text-based query syntax parsed by the fluent expression parser.

Guru Meditation Codes

CodeMeaning
#BASES.00000001.NOPOOLDatabase pool not configured
#BASES.00000002.NOICEBERGIceberg catalog unavailable
#FLUENT.00000001.BADASTInvalid query expression
#FLUENT.00000002.UNSAFEOPUnsafe operation attempted

DQL Query Language

DQL (Domain Query Language) is the text-based query syntax for the Bases feature store. It provides a safe, sandboxed expression language that compiles to SQL via SQLGlot.

Syntax

DQL expressions use a fluent Python-like syntax that is parsed via AST walking (never eval):

Base("events").where(col("age") > 30).limit(10)
Base("users").where(col("status").isin("active", "pending")).select("name", "email")
Base("metrics").where(term("BFO:temporal_region") >= "2026-01-01").order_by("timestamp", desc=True)

Operators

Column References

col("name") creates a column reference for filtering and selection:

col("age") > 30
col("status") == "active"
col("name").like("John%")
col("deleted_at").is_null()
col("role").isin("admin", "editor")

Term References

term("IRI") creates an ontology-grounded reference that resolves to a column via the base’s @context:

term("BFO:material_entity") == "ENT-12345"
term("BFO:temporal_region") >= "2026-01-01"

Comparison Operators

OperatorDQLSQL
Equal===
Not equal!=!=
Less than<<
Less or equal<=<=
Greater than>>
Greater or equal>=>=

Logical Operators

Predicates can be combined with bitwise operators:

(col("age") > 30) & (col("status") == "active")   # AND
(col("role") == "admin") | (col("role") == "editor")  # OR
~(col("deleted_at").is_null())                       # NOT

Multiple .where() calls are combined with AND.

Methods

MethodPurposeExample
.where(pred)Add filter predicate.where(col("x") > 1)
.select(*cols)Select specific columns.select("name", "email")
.order_by(col, desc=)Sort results.order_by("created_at", desc=True)
.limit(n)Limit result count.limit(100)
.as_of(ts)Time-travel (historical).as_of("2026-01-01T00:00:00Z")
.scan()Execute queryawait query.scan()

Safety Model

DQL is parsed using Python’s ast module with strict whitelisting. Only allowed names (Base, col, term, True, False, None), methods, and operators are permitted. Any unrecognized AST node triggers a fail-fast error:

#FLUENT.00000001.BADAST - Unsupported AST node
#FLUENT.00000002.UNSAFEOP - Unsafe operation attempted

This prevents arbitrary code execution while supporting expressive queries.

Compilation

The FluentCompiler translates DQL expressions to PostgreSQL-compatible SQL using SQLGlot:

query = Base("events").where(col("age") > 30).limit(10)
sql = query.to_sql()
# SELECT * FROM events WHERE age > 30 LIMIT 10

Term references are resolved through the base’s @context dictionary, mapping ontology IRIs to physical column names.

MCP Usage

DQL queries are passed as strings to the bases_query MCP tool:

uv run gaius-cli --cmd '/bases query events where(col("age") > 30).limit(10)'

The parser validates the expression before compilation, ensuring that only safe operations reach the database.

RASE Metamodel

RASE (Rapid Agentic Systems Engineering) is a Python-native MBSE metamodel for verifiable agent training. It implements SysML v2-like semantics using Pydantic models, without requiring external MBSE tooling.

Core Principle: RLVR

The reward signal comes from verifiable computation, not human feedback or learned approximations. The verifier is a first-class artifact – specified, reviewed, tested, and versioned alongside the agent it trains.

Four Coupled Models

RASE consists of four tightly coupled models. Changes to one often require updates to others:

ModelPurposePackage
SSMSystem State Model – system as typed graphgaius.rase.domains.nifi
OSMOperational Scenario Model – BDD scenariosgaius.rase.osm
UOMUI Observation Model – SoM/ToM groundinggaius.rase.uom
VMVerifier Model – requirements, oracle, rewardsgaius.rase.vm

The TraceableId spine links artifacts across all four models, enabling full traceability from BDD scenario to training reward.

SysML v2 Alignment

RASE mirrors SysML v2 semantics without requiring external tooling:

SysML v2 ConceptRASE Implementation
requirement defRequirement, ScenarioRequirement
verification defVerificationCase, APIVerificationCase
constraint defConstraint subclasses (composable via AllOf, AnyOf, Not)
action defStepDef with @given, @when, @then
part defProcessor, ProcessorGroup, NiFiInstance
Human ID <'scheme:path'>TraceableId.uri

Package Structure

src/gaius/rase/
├── core/                 # Domain-agnostic: SystemState, Constraint[S], Oracle[S]
├── domains/              # Domain-specific implementations
│   ├── nifi/             # NiFi domain (state, constraints, oracle)
│   └── kb/               # Knowledge Base domain
├── traceability.py       # TraceableId, DigitalThread
├── osm/                  # Operational Scenario Model (BDD)
├── uom/                  # UI Observation Model (SoM/ToM)
└── vm/                   # Verifier Model (requirements, oracle, rewards)

Safety-Critical Infrastructure

The verifier is maintained with the same rigor as production code. All constraints are immutable (frozen=True), return structured ConstraintResult objects with rich failure messages, and support declarative composition. See Verification for details on the reward computation pipeline.

Four Coupled Models

The RASE metamodel consists of four tightly coupled models. They form a coherent verification framework where changes to one model often require updates to others.

Coupling Matrix

If you change…Also update…
SSM (system state)VM constraints that reference state structure
OSM (scenarios)VM requirements derived from scenarios
UOM (marks/traces)VM verification cases that consume traces
VM (verification)Ensure reward strategies align with constraint semantics

SSM – System State Model

The SSM represents the system under test as a typed graph. The primary domain is NiFi, modeled as NiFiInstance containing ProcessorGroup, Processor, FlowConnection, and ControllerService nodes.

from gaius.rase.domains.nifi import NiFiInstance, Processor, ProcessorGroup

state = NiFiInstance(
    root_group=ProcessorGroup(id="root", name="NiFi Flow", processors=[
        Processor(id="abc", name="GetFile", type="org.apache.nifi.GetFile"),
    ])
)

SSM constraints are declarative, composable, and immutable. Examples: ProcessorExists, AllProcessorsRunning, NoBackpressure, FlowIsEquivalent. Compose with AllOf, AnyOf, Not.

OSM – Operational Scenario Model

The OSM captures BDD (Behavior-Driven Development) scenarios as executable specifications. Each scenario is a sequence of Given/When/Then steps that map to SysML v2 action definitions.

from gaius.rase.osm import Scenario, StepType, StepUsage

scenario = Scenario(
    name="CreateBasicFlow",
    steps=[
        StepUsage(step_type=StepType.GIVEN, text="NiFi is running"),
        StepUsage(step_type=StepType.WHEN,  text="I create a processor group named 'ETL'"),
        StepUsage(step_type=StepType.THEN,  text="the group 'ETL' exists"),
    ],
)

Step definitions (StepDef) are reusable patterns with {param} placeholders. The StepRegistry maps step text to executable actions via @given, @when, @then decorators.

UOM – UI Observation Model

The UOM provides grounding between language and UI actions using two complementary structures:

  • SoM (Set-of-Mark): A ScreenshotWithSoM annotates a screenshot with numbered Mark objects, each with a BoundingBox, UIRole, and optional mapping to an SSM element.
  • ToM (Trace-of-Mark): A TraceOfMarks records a sequence of ActionFrame entries (click, type, scroll) referencing marks by number, forming the agent’s action trajectory.
from gaius.rase.uom import Mark, BoundingBox, PixelCoord, UIRole

mark = Mark(
    mark_id=1,
    bbox=BoundingBox.from_xywh(100, 200, 50, 30),
    ui_role=UIRole.BUTTON,
    label="Add Processor",
)

The SoM/ToM pattern enables precise UI grounding: agents reference elements by mark number rather than pixel coordinates.

VM – Verifier Model

The VM implements RLVR verification. It connects OSM scenarios to executable verification cases with oracle-based reward computation. See Verification for full details.

Key components:

  • Requirements: StepRequirement (atomic, from a BDD step) and ScenarioRequirement (composite, grouping steps with invariants)
  • Verification Cases: APIVerificationCase (ground truth via API) and UIVerificationCase (agent UI actions, final state checked via API)
  • Oracle: NiFiOracle queries the NiFi REST API for authoritative state verification
  • Reward Strategies: BinaryReward (sparse) and GradedReward (partial credit)

Traceability

TraceableId and DigitalThread form the traceability spine linking all RASE artifacts. Every model element carries a URI-based identifier that enables cross-model linking, impact analysis, and full audit trails from requirement to training reward.

TraceableId

A TraceableId mirrors the SysML v2 human ID pattern: <'scheme:path'>. It is immutable (frozen=True) and hashable for use as dict keys and set members.

URI Schemes

SchemeNamespaceExample
bddBDD features, scenarios, stepsbdd://features/basic_flows#Scenario:CreateFlow
nifiNiFi processors, groups, connectionsnifi://groups/root/processors/abc123
otelOpenTelemetry spans and eventsotel://spans/trace123/span456
metaflowMetaflow runs, steps, tasksmetaflow://flows/train/runs/42
raseInternal artifacts (results, threads)rase://verify/a1b2c3d4e5f6
somSet-of-Mark UI annotationssom://screenshots/frame42
tomTrace-of-Mark action sequencestom://traces/episode7

Factory Methods

from gaius.rase import TraceableId

# BDD scenario
tid = TraceableId.from_bdd("basic_flows", scenario="CreateFlow")
# → bdd://features/basic_flows#Scenario:CreateFlow

# NiFi processor
tid = TraceableId.from_nifi("root", processor_id="abc123")
# → nifi://groups/root/processors/abc123

# Auto-generated with UUID
tid = TraceableId.generate(scheme=IdScheme.RASE, prefix="verify")
# → rase://verify/a1b2c3d4e5f6

# Stable BDD step hash (survives line number changes)
tid = TraceableId.from_bdd_step_hash("flow.feature", "I create a group named 'ETL'")

DigitalThread

A DigitalThread captures one complete verification-to-training cycle. It links the full chain:

Requirement –> Verification Case –> Execution Result –> Evidence –> Training Episode

from gaius.rase import DigitalThread

thread = DigitalThread(
    requirement_id=req_id,
    verification_case_id=case_id,
    verification_result_id=result_id,
    api_state_before=before_id,
    api_state_after=after_id,
    reward_outcome=0.85,
)
thread.add_evidence(screenshot_id, "screenshot")
thread.add_evidence(span_id, "span")

TraceabilityGraph

The TraceabilityGraph collects TraceabilityLink objects (directed, typed relationships) and supports queries:

  • Forward trace: What derives from this requirement?
  • Backward trace: What requirements does this artifact satisfy?
  • Impact analysis: What verification cases need re-running if this changes?

Link types follow MBSE semantics: DERIVES, SATISFIES, VERIFIES, ALLOCATES, TRACES, REFINES.

Source

All traceability infrastructure lives in src/gaius/rase/traceability.py.

Verification

The Verifier Model (VM) implements RLVR – Reinforcement Learning with Verifiable Reward. The oracle provides ground-truth verification using authoritative API sources, never UI observations. UI traces are the training target, not the oracle.

VerdictKind

Every verification case produces one of four outcomes:

VerdictMeaningDefault Reward
PASSAll requirements satisfied1.0
FAILOne or more requirements not satisfied0.0 (or accuracy for partial credit)
INCONCLUSIVECould not determine (missing data)0.5
ERRORVerification itself failed (infrastructure)0.0

Accuracy

Accuracy is always a float in [0.0, 1.0], representing the proportion of constraints satisfied. It provides the foundation for graded reward strategies.

# Computed inside verification cases:
passed_count = sum(1 for r in constraint_results if r.satisfied)
accuracy = passed_count / len(constraint_results)

Verification Cases

Two types of verification cases exist:

  • APIVerificationCase – the RLVR oracle. Checks system state via the NiFi REST API. Evaluates Given (setup), Then (end-state), invariant, and transition constraints.
  • UIVerificationCase – verifies agent UI actions. The final state is still checked via API; the trace captures what the agent did to get there.
case = APIVerificationCase(
    id=TraceableId.generate(scheme=IdScheme.RASE, prefix="verify"),
    name="Verify_CreateBasicFlow",
    objective=VerificationObjective(requirement_ids=[scenario.id]),
    scenario_requirement=scenario_req,
)
result = await case.execute(current_nifi_state)

Reward Strategies

Reward strategies convert verification results into training signals:

StrategySignal TypeUse Case
BinaryRewardSparse (0 or 1)Clear pass/fail tasks, early training
GradedRewardDense (0.0–1.0 with partial credit)Multi-step tasks, complex scenarios
StepwiseRewardDense per stepLong sequences where intermediate progress matters
TrajectoryShapingDense with efficiencyTasks where path quality matters
from gaius.rase import GradedReward, compute_reward

strategy = GradedReward(pass_bonus=0.1, fail_penalty=0.0)
reward = compute_reward(result, strategy=strategy)

Oracle

The NiFiOracle provides authoritative verification:

  1. Agent takes UI actions to modify NiFi
  2. Oracle queries NiFi REST API to check resulting state
  3. State is compared against scenario requirements (constraints)
  4. Reward is computed from the VerificationResult
oracle = NiFiOracle(reward_strategy=GradedReward())
result, reward = await oracle.verify_and_reward(scenario_req, trace=ui_trace)

Advanced oracles include CurriculumOracle (progressive difficulty) and EnsembleOracle (multi-source consensus).

Source

Verification infrastructure lives in src/gaius/rase/vm/ with verification cases in verification.py, requirements in requirements.py, and oracle/reward logic in oracle.py.

Observability

Gaius uses a three-layer observability stack: OpenTelemetry for instrumentation, Prometheus for time-series storage, and Metabase for self-service analytics dashboards.

Architecture

CLI/TUI/MCP --> gRPC --> Engine --> OTel Collector --> Prometheus
                         ^^^^^^                          |
                    metrics exported here          Metabase (dashboards)

The engine is the single source of truth for metric export. All clients (CLI, TUI, MCP) route metrics through the gRPC engine, which exports via OpenTelemetry SDK to the OTel Collector. The collector forwards to Prometheus for scraping.

Components

LayerTechnologyPurpose
OpenTelemetryOTel SDK + CollectorDistributed tracing, metric instrumentation
PrometheusPromQL, time-series DBMetric storage, alerting, range queries
MetabaseSQL analytics platformDashboards connected to PostgreSQL

ObservePanel

The TUI’s ObservePanel displays real-time metrics using declarative MetricDefinition objects. Each definition specifies:

  • Source: prometheus (PromQL query) or engine (gRPC proxy)
  • Display: sparkline, gauge, counter, or percentage
  • Thresholds: warning/critical levels with directional logic (above or below)

Metric categories include inference (latency, throughput, errors), GPU compute (FLOPS utilization), health (active incidents, escalations, FMEA scores), and pipeline operations (cards/day, backlog depth).

Design Philosophy

Metrics use 10-minute windowed rates (Flink-inspired) to survive bursty workloads like ambient reasoning. Sparklines show 5 minutes of history at 15-second resolution. The Fail Open principle applies: unknown states are surfaced for investigation rather than filtered away.

See each sub-chapter for implementation details.

OpenTelemetry

Gaius uses the OpenTelemetry SDK for distributed tracing and metric instrumentation. The engine centralizes all OTel export through EngineMetrics, ensuring a single source of truth for operational telemetry.

Instrumentation

The EngineMetrics singleton (initialized at engine startup) creates OTel instruments:

from gaius.engine.metrics import EngineMetrics

metrics = EngineMetrics.get_instance()
metrics.record_inference(model="reasoning", latency_ms=150, tokens=500)
metrics.record_gpu_memory(gpu_id=0, used_mb=12000, total_mb=24000)
metrics.record_healing_attempt(endpoint="reasoning", tier=0, success=True)

Metric Categories

CategoryInstrumentsType
Inferenceinference_count, inference_latency, inference_tokensCounter, Histogram
GPUgpu_memory_used, gpu_utilization, gpu_flops_utilizationGauge (observable callbacks)
Endpointsendpoint_healthy, endpoint_requestsGauge, Counter
Healinghealing_attempts, healing_escalations, incidents_activeCounter, Gauge
Pipelinepipeline_cards_published, pipeline_pending_cardsCounter, Gauge
Errorserror_total, exception_caught_totalCounter

Metric Naming

Metrics follow a double-prefix convention due to OTel Collector namespace configuration:

gaius_gaius_<metric_name>_<unit>

The first gaius_ comes from the OTel Collector namespace config; the second from SDK metric naming (gaius. becomes gaius_ after export). PromQL queries in the OBSERVE_METRICS registry use this full prefix.

Export Pipeline

EngineMetrics --> OTel SDK --> OTLP Exporter --> OTel Collector --> Prometheus

The OTel Collector runs as a sidecar, receiving OTLP and remoting to Prometheus via the Prometheus remote-write or scrape endpoint. GPU metrics use observable callbacks that are invoked on each collection cycle.

Makespan Tracing

For long-running operations (evolution cycles, research flows), Gaius uses makespan tracing: a parent span covers the entire operation, with child spans for each phase. This enables latency attribution across multi-step workflows without excessive span cardinality.

Source

Engine metrics: src/gaius/engine/metrics.py. Observability sources: src/gaius/observability/sources/.

Prometheus

Prometheus provides time-series metric storage and PromQL queries for the Gaius observability stack. It scrapes metrics exported by the OTel Collector and serves as the backend for the TUI’s ObservePanel.

PrometheusSource

The PrometheusSource client (src/gaius/observability/sources/prometheus.py) queries the Prometheus HTTP API:

from gaius.observability import PrometheusSource

source = PrometheusSource(base_url="http://localhost:9090")

# Instant query (current value)
value = await source.query_instant(
    'histogram_quantile(0.95, sum by (le) (rate(gaius_gaius_inference_latency_milliseconds_bucket[10m])))'
)

# Range query (sparkline data)
series = await source.query_range(
    'sum(rate(gaius_gaius_inference_count_total[10m])) * 3600',
    duration_seconds=300,  # 5 minutes of history
    step_seconds=15,       # 15-second resolution
)

Custom Metrics

Inference

  • gaius_gaius_inference_latency_milliseconds – histogram with p95 via histogram_quantile
  • gaius_gaius_inference_count_total – counter, displayed as inferences/hour
  • gaius_gaius_inference_tokens_total – counter, displayed as tokens/hour
  • gaius_gaius_error_total / gaius_gaius_request_total – error rate percentage

GPU

  • gaius_gaius_gpu_flops_utilization_percent – FLOPS-weighted utilization across 6x RTX 4090s using Welford streaming mean

Health and Self-Healing

  • gaius_gaius_incidents_active – gauge of active incidents
  • gaius_gaius_healing_escalations_total – counter of ACP escalations per hour
  • gaius_gaius_fmea_rpn_score – FMEA Risk Priority Numbers (high RPN > 200)

Pipeline Operations

  • gaius_gaius_pipeline_cards_published_total – cards published (daily)
  • gaius_gaius_pipeline_pending_cards – backlog gauge
  • gaius_gaius_pipeline_task_failure_total – failures by task type (zero tolerance)
  • gaius_gaius_exception_caught_total – operational errors (non-LLM)

Windowed Rates

All rate calculations use 10-minute windows to survive bursty workloads. This keeps metrics hydrated during quiet periods rather than dropping to zero between bursts.

Engine Source

For metrics not available in Prometheus (GPU memory per device, scheduler queue depth, evolution cycles), the EngineSource queries the gRPC engine directly. These return single-point values since the engine does not retain history.

Source

src/gaius/observability/sources/prometheus.py, src/gaius/observability/sources/engine.py, src/gaius/observability/metrics.py.

Metabase

Metabase provides self-service analytics dashboards connected to the Gaius PostgreSQL database (zndx_gaius on port 5444). It queries the meta schema, which contains materialized analytics tables designed for dashboard consumption.

Architecture

PostgreSQL (zndx_gaius)
  ├── public schema    --> operational tables (cards, agents, health)
  ├── meta schema      --> analytics views for Metabase
  ├── collections      --> curated content for landing page
  └── bases schema     --> feature store registry
         |
    Metabase (localhost:3000)
         |
    Dashboards: lineage, operations, KB geometry

Meta Schema

The meta schema (db/migrations/20251218000001_meta_schema.sql) provides pre-aggregated analytics:

TablePurpose
meta.dataset_catalogDeduplicated dataset registry from lineage events
meta.job_catalogJob registry with run counts, success/failure rates

These tables are populated from OpenLineage events and provide the foundation for data lineage dashboards.

Dashboard Categories

Lineage

  • Data provenance graph (which flows produce which datasets)
  • Dataset read/write frequency
  • Job success rates over time

Operations

  • Agent evaluation scores and evolution trends
  • GPU utilization over time
  • Inference throughput and latency distributions
  • Pipeline health (cards published, curation cadence)

KB Geometry

  • Document cluster topology
  • Embedding space coverage
  • Content freshness by domain

Process Management

Metabase runs as a devenv process defined in scripts/processes/metabase.sh. It starts on localhost:3000 and connects to PostgreSQL using the same credentials as the application (gaius:gaius@localhost:5444/zndx_gaius).

Source

Metabase process: scripts/processes/metabase.sh. Meta schema: db/migrations/20251218000001_meta_schema.sql.

Security

Gaius employs a multi-layer security model focused on protecting autonomous operations. Security verification is mandatory and cannot be disabled – this is by design to prevent generated code from bypassing security checks.

Threat Model

The primary attack surface is the ACP (Agent Client Protocol) integration, which allows autonomous health maintenance via GitHub issue workflows. Without controls, an agent could:

  • Leak internal state to public repositories
  • Be influenced by prompt injection in externally-controlled issues
  • Expose credentials in issue comments
  • Be tricked by repository visibility changes

Security Layers

LayerCheckPurpose
0Format validationReject malformed repository names
1HOCON allowlistExplicit repository patterns only
2Visibility verificationRepository must be private (via gh api)
3Content sanitizationRedact secrets, strip injection markers

All four layers execute on every operation. There is no parameter or configuration to skip layers.

Cadence Controls

To prevent runaway automation:

  • Maximum 3 GitHub issues per 24 hours
  • Minimum 5 minutes between restart attempts
  • Maximum 3 restarts per endpoint per hour
  • All changes committed to acp-claude/health-fix branch for human review

Guru Meditation Codes

Security failures use the #ACP.SEC.* code family:

CodeDescription
#ACP.SEC.00000002.NOTALLOWEDRepository not in allowlist
#ACP.SEC.00000003.NOTPRIVATERepository not private
#ACP.SEC.00000004.NOTCONFIGUREDNo repositories configured
#ACP.SEC.00000005.BADFORMATInvalid repository format

See ACP Security Model for implementation details and Content Sanitization for redaction rules.

ACP Security Model

The Agent Client Protocol uses four mandatory security layers for all GitHub operations. Every layer must pass; there is no bypass mechanism.

Layer 0: Format Validation

Repository names are validated against strict regex patterns before any network call:

# Supported formats:
"owner/repo"                  # Legacy (github.com assumed)
"github.com/owner/repo"      # Full URL
"github.example.com/org/repo" # On-prem GitHub Enterprise

Invalid characters, missing components, or malformed URLs raise GitHubSecurityError immediately with #ACP.SEC.00000005.BADFORMAT.

Layer 1: HOCON Allowlist

Repositories must be explicitly listed in ~/.config/gaius/acp.conf:

acp {
  github {
    allowed_repos = ["zndx/gaius-acp"]
    require_private = true
    verify_on_each_operation = true
    cache_visibility_seconds = 300
  }
}

Glob patterns are supported: "zndx/*" allows any repo under the zndx org. An empty allowlist means no repositories are allowed (#ACP.SEC.00000004.NOTCONFIGURED).

Layer 2: Visibility Verification

The GitHubSecurityGuard verifies repository visibility via gh api repos/{owner}/{repo} --jq .visibility. Only "private" passes; "public" and "internal" are rejected with #ACP.SEC.00000003.NOTPRIVATE.

Visibility is cached for 5 minutes (configurable via cache_visibility_seconds) and re-verified on each operation when verify_on_each_operation = true.

Layer 3: Content Sanitization

Before including any content in GitHub issues, sanitize_issue_content() redacts secrets and strips prompt injection markers. See Content Sanitization for details.

Issue titles must start with [HEALTH-FIX] prefix and are limited to 200 characters.

Attack Vectors Mitigated

AttackMitigation
Info leak via public repoLayer 2: visibility verification on every operation
Prompt injection from issuesLayer 1: explicit allowlist prevents attacker-controlled repos
Credential exposure in issuesLayer 3: automatic secret redaction
Visibility change attackRe-verify on each operation (cache TTL 5 min)
Generated code bypassSecurity is mandatory – no parameter to disable

Usage

from gaius.acp.security import GitHubSecurityGuard

guard = GitHubSecurityGuard.from_config()
await guard.verify_repo("zndx/gaius-acp")  # Raises on failure

Source

src/gaius/acp/security.py

Content Sanitization

Before any content is included in GitHub issues (via ACP escalation), the sanitize_issue_content() function automatically redacts secrets and strips prompt injection markers.

Secret Patterns

The following patterns are detected and replaced with [REDACTED_*] tags:

PatternExampleReplacement
Anthropic API keyssk-ant-api03-...[REDACTED_ANTHROPIC_KEY]
OpenAI keyssk-proj-..., sk-...[REDACTED_OPENAI_KEY]
GitHub PATghp_...[REDACTED_GH_PAT]
GitHub OAuthgho_...[REDACTED_GH_OAUTH]
GitHub Appghs_...[REDACTED_GH_APP]
GitHub Refreshghr_...[REDACTED_GH_REFRESH]
AWS Access KeyAKIA... (20 chars)[REDACTED_AWS_KEY]
Bearer tokensBearer <token>Bearer [REDACTED_BEARER]
Generic secretsapi_key=, token=, password=, secret=[REDACTED]

Pattern order matters: specific patterns (e.g., sk-ant-) are matched before generic ones (e.g., sk-) to ensure correct replacement labels.

Prompt Injection Markers

The following injection patterns are replaced with [SANITIZED]:

  • LLM role markers: <|system|>, <|user|>, <|assistant|>, [INST], <<SYS>>
  • Override attempts: IGNORE PREVIOUS INSTRUCTIONS, SYSTEM OVERRIDE:, ADMIN MODE:
  • Known bypass patterns: JAILBREAK, DAN MODE, DEVELOPER MODE:

All matching is case-insensitive.

Usage

from gaius.acp.security import sanitize_issue_content

raw = "Error with key sk-ant-api03-abc123... calling endpoint"
safe = sanitize_issue_content(raw)
# "Error with key [REDACTED_ANTHROPIC_KEY] calling endpoint"

Issue Title Validation

Issue titles are validated separately via validate_issue_title():

  • Must start with [HEALTH-FIX] prefix
  • Truncated to 200 characters
  • Control characters stripped

Source

src/gaius/acp/security.py (the sanitize_issue_content and validate_issue_title functions).

Database

Gaius uses PostgreSQL on port 5444 with database name zndx_gaius (not gaius).

Connection

ParameterValue
Hostlocalhost
Port5444
Databasezndx_gaius
Usergaius
Passwordgaius
URLpostgres://gaius:gaius@localhost:5444/zndx_gaius?sslmode=disable

Programmatic Access

Always use the centralized config function – never hardcode connection parameters:

from gaius.core.config import get_database_url

url = get_database_url()  # Single source of truth

Delegates exist in storage/database.py, storage/grid_state.py, inference/routing_analytics.py, and storage/profile_ops.py – all call through to gaius.core.config.get_database_url().

CLI Access

PGPASSWORD=gaius psql -h localhost -p 5444 -U gaius -d zndx_gaius

Connection Pooling

The storage/database.py module manages a global asyncpg connection pool (min 1, max 10 connections) via get_pool():

from gaius.storage.database import get_pool

pool = await get_pool()
async with pool.acquire() as conn:
    rows = await conn.fetch("SELECT ...")

Schemas

The database uses four schemas to organize data. See Schema Design for details.

SchemaPurpose
publicCore operational tables (cards, agents, evolution, health)
metaAnalytics views for Metabase dashboards
collectionsCurated content for the public landing page
basesFeature store registry and Iceberg catalog

Extensions

ExtensionPurpose
pg_cronScheduled maintenance
age (Apache AGE)Graph queries for lineage
citextCase-insensitive text columns

Migrations

Schema migrations live in db/migrations/ and are ordered by timestamp prefix (e.g., 20251130000001_initial_schema.sql). The full schema dump is at db/schema.sql.

Schema Design

The PostgreSQL database (zndx_gaius) uses four schemas to organize data by domain.

Public Schema

The default public schema holds core operational tables:

Content Pipeline

  • feed_sources – RSS/API feed configurations with fetch intervals
  • fetch_jobs – Scheduled and completed fetch job records
  • content_items – Raw content items with KB path references
  • articles – Curated articles with frontmatter (keywords, news queries)

Agent System

  • agent_evaluations – Evaluation scores by agent and evaluator (local/xai)
  • evolution_cycles – Training cycle records (success, improvement, duration)
  • agent_versions – Version history for agent configurations
  • held_out_queries – Reserved evaluation queries not used in training
  • routing_decisions – Inference routing analytics (fallback/mismatch tracking)

Health and Observability

  • health_incidents – HealthObserver incident records with FMEA scores
  • healing_events – Self-healing attempt logs (tier, success, duration)
  • fmea_catalog – Failure Mode and Effects Analysis registry
  • scheduler_jobs – Async job queue for the inference scheduler

State

  • grid_state – Persisted 19x19 grid positions and overlays
  • cognition_memory – Self-observation and thought chain storage
  • research_state / research_progress – Active research thread tracking

Meta Schema

The meta schema provides materialized analytics tables for Metabase:

  • meta.dataset_catalog – Deduplicated dataset registry from lineage events
  • meta.job_catalog – Job registry with run/success/failure counts

Populated from OpenLineage events for data provenance dashboards.

Collections Schema

The collections schema manages curated content for the public landing page:

  • collections.collections – Named collections with featured flags
  • collections.collection_cards – Cards assigned to collections with ordering
  • collections.card_summaries – Generated card summary text

Bases Schema

The bases schema implements a feature store registry:

  • bases.bases – Feature store definitions (type: feature_group, model, dataset)
  • bases.base_versions – Versioned snapshots with Iceberg table references
  • bases.entity_history – Entity-level change tracking

Graph Extension

Apache AGE (ag_catalog schema) provides graph query capabilities for lineage traversal using Cypher syntax. The lineage graph connects datasets to jobs via read/write edges.

Source

Full schema dump: db/schema.sql. Migrations: db/migrations/.

pg_cron Jobs

Gaius uses the pg_cron extension for scheduled database maintenance. Jobs are defined in SQL migrations and run inside PostgreSQL without external schedulers.

Core Jobs

JobSchedulePurpose
check-due-fetchesEvery 15 minCheck feed_sources for overdue fetches and create fetch_jobs records
cleanup-fetch-jobsSunday 3 AMRemove old fetch job records (keep last 100 per source)
archive-stale-content1st of month, 4 AMMark content items older than 90 days as archived

How It Works

The schedule_due_fetches() function checks each active feed_source against its configured fetch_interval_minutes. When a source is due, it creates a fetch_jobs record with status = 'scheduled'. Python workers poll this table and execute the actual fetch.

-- Example: schedule a fetch for a specific source
SELECT schedule_fetch('arxiv-cs-ai');

-- Check all due sources
SELECT * FROM schedule_due_fetches();

Additional Scheduled Tasks

Beyond the core jobs, several migrations add domain-specific cron schedules:

MigrationJobSchedule
20251214000001_evolution_periodic_tasksEvolution cycle triggersPeriodic
20251223000001_theta_consolidation_cronTheta memory consolidationPeriodic
20251228000002_triage_cron_jobsContent triagePeriodic
20260202200000_landing_page_cronLanding page card publishingPeriodic
20260203100000_scheduled_task_notifyNOTIFY on scheduled task changesEvent-driven

The scheduled_task_notify migration uses PostgreSQL LISTEN/NOTIFY to wake the engine watchdog when tasks are due, avoiding polling overhead.

Monitoring

The v_source_status view provides at-a-glance health for all feed sources:

SELECT name, status, total_items, pending_jobs FROM v_source_status;

Status values: ok, overdue, never (never fetched).

Source

Core jobs: db/migrations/20251130000003_pg_cron_jobs.sql. Additional schedules are spread across domain-specific migrations in db/migrations/.

Getting Started

Gaius is a CLI-first terminal interface for navigating complex, graph-oriented data domains. It renders high-dimensional embeddings and topological structures onto a constrained 19x19 grid, transforming abstract complexity into spatial intuition.

There are three ways to interact with Gaius:

  • TUI – a full terminal interface with grid, panels, and keyboard navigation (uv run gaius)
  • CLI – a non-interactive command runner for scripting and automation (uv run gaius-cli)
  • MCP – 163 tools exposed to Claude Code and other MCP-compatible clients (uv run gaius-mcp)

Quick Path

If you already have devenv and Nix installed, you can be running in under a minute:

cd gaius
devenv shell
uv sync
devenv processes up -d
uv run gaius

This starts the platform services (PostgreSQL, Qdrant, gRPC engine, NiFi) and launches the TUI. You will see a 19x19 grid with a cursor at the center.

If this is your first time:

  1. Installation – prerequisites and environment setup
  2. First Launch – what happens when you start Gaius and what to try first

Once you are comfortable with the basics:

  • The TUI – understanding the five interface components
  • Navigation – cursor movement, view modes, and workflow patterns
  • The CLI – non-interactive commands for scripting
  • MCP Integration – connecting Gaius to Claude Code

Three Interfaces, One Engine

All three interfaces communicate with the same gRPC engine on port 50051. A /health command run from the CLI produces the same result as the health_observer_status MCP tool or pressing / and typing health in the TUI. Choose the interface that fits your context: TUI for exploration, CLI for automation, MCP for AI-assisted workflows.

Installation

Gaius uses devenv (built on Nix) for reproducible development environments and uv for Python dependency management.

Prerequisites

DependencyPurposeInstall
NixPackage managernix.dev
devenvDevelopment environmentnix profile install github:cachix/devenv
uvPython package managerProvided by devenv
JustTask runnerProvided by devenv

You do not need to install Python, PostgreSQL, or any other runtime dependency manually. Nix provides everything.

Environment Setup

Clone the repository and enter the devenv shell:

git clone <repo-url>
cd gaius
devenv shell

The first devenv shell invocation downloads and caches all Nix dependencies. Subsequent invocations start in under a second.

Inside the shell, install Python dependencies:

uv sync

For optional features, use extras:

uv sync --extra tda      # Topological data analysis (giotto-tda)
uv sync --extra swarm    # Multi-agent support (langchain)

Starting Platform Services

Gaius depends on several backend services: PostgreSQL, Qdrant, the gRPC engine, and others. Start them all with:

devenv processes up -d

To stop all services:

devenv processes down

To verify everything is running, use the Just task runner:

just --list              # Show all available tasks
just restart-clean       # Full clean restart if something is stuck

Database

PostgreSQL runs on port 5444 with a database named zndx_gaius:

PGPASSWORD=gaius psql -h localhost -p 5444 -U gaius -d zndx_gaius

The database name is zndx_gaius, not gaius. The connection URL used internally is:

postgres://gaius:gaius@localhost:5444/zndx_gaius?sslmode=disable

Verifying the Installation

Once services are running, confirm the gRPC engine is healthy:

uv run gaius-cli --cmd "/health" --format json

If this returns a JSON health report, the installation is complete. If it fails, try just restart-clean and check the process logs in .devenv/processes.log.

First Launch

This page describes what you will see when you first start Gaius, and what to try immediately.

Starting the TUI

From inside a devenv shell with services running:

uv run gaius

The terminal fills with the Gaius interface. At its center is a 19x19 grid – the MainGrid – with a cursor marker at position K10.

What You See

The default layout has three regions:

  • Left panel – a FileTree showing the knowledge base as a directory structure
  • Center – the 19x19 MainGrid with three 9x9 MiniGrid projections below it
  • Right panel – a ContentPanel that shows context for the current selection

The bottom of the screen has a command bar. The cursor appears as a distinct marker on the grid.

First Steps

Move the cursor. Press h, j, k, l to move left, down, up, right. The cursor moves across the grid. The MiniGrid projections and ContentPanel update to reflect the new position.

Check your bearings. Press ? to display help in the ContentPanel. This shows the available key bindings and a summary of the current state.

Cycle the view. Press v to switch between view modes (Go, Theta, Swarm). Each mode renders the grid data differently.

Cycle overlays. Press o to layer additional information onto the grid: topology, geometry, dynamics, or agent positions.

Toggle panels. Press [ to toggle the left panel, ] to toggle the right panel, or \ to toggle both. Hiding panels maximizes grid space.

Enter a command. Press / to focus the command bar, then type health and press Enter. This runs the health diagnostic and displays system status in the ContentPanel.

First CLI Check

Open a second terminal (also in devenv shell) and try:

uv run gaius-cli --cmd "/health" --format json

This runs the same health check non-interactively and prints JSON output. The CLI and TUI connect to the same engine, so results are identical.

If Something Looks Wrong

If the grid is empty or services are not responding:

just restart-clean

This performs a full clean restart of all platform services. After it completes, relaunch with uv run gaius.

Next Steps

  • The TUI – understand the five components of the interface
  • Navigation – learn cursor movement and workflow patterns
  • Key Bindings – complete keyboard reference

The TUI

Gaius renders a full terminal interface built on the Textual framework. The interface draws inspiration from Bloomberg Terminal (information density), Plan 9’s Acme (everything is a file), and CAD orthographic views (multiple synchronized projections).

Launch the TUI with:

uv run gaius

Five Components

The interface is composed of five primary widgets:

MainGrid

The 19x19 grid occupies the center of the screen. It is the primary workspace – a spatial representation of high-dimensional data projected onto a Go board layout. Grid positions correspond to embedded data points, and the cursor indicates your current focus.

The grid supports three view modes (cycled with v): Go, Theta, and Swarm. Each mode changes how the underlying data is rendered. Four overlay modes (cycled with o) layer additional information on top: topology, geometry, dynamics, and agent positions.

MiniGridPanel

Below the MainGrid sit three 9x9 orthographic projections. These are CAD-style views that show the data from different angles – like top, front, and side views of a 3D object. They update automatically as you move the cursor, providing spatial context around your current position.

FileTree (Left Panel)

The left panel presents a Plan 9-inspired file tree where knowledge base entries, agents, and system state are navigated as a directory structure. Agents appear as files under /agents/, and KB entries are organized by domain. Toggle visibility with [.

ContentPanel (Right Panel)

The right panel displays detailed content for the currently selected item: file contents, agent output, position context, health reports, or command results. It is the primary output area for slash commands. Toggle visibility with ].

CommandInput (Bottom Bar)

The bottom command bar accepts slash commands. Press / to focus it, type a command (e.g., health, evolve status, gpu status), and press Enter. Press Escape to cancel. The command bar supports history navigation with up/down arrows and tab completion.

Layout Flexibility

Toggle panels to adjust the layout to your task:

  • Full layout: all panels visible – maximum context
  • Grid-focused: press \ to hide both panels – maximum grid space
  • Research mode: hide left panel with [ – more room for content output
  • Navigation mode: hide right panel with ] – focus on the file tree and grid

Design Principles

The TUI is keyboard-first. Every action is reachable without a mouse. Information density is high by design – the interface shows as much relevant data as possible without requiring navigation to separate screens. Modes and overlays let you shift perspective without losing your place.

Navigation & Modes

Gaius draws inspiration from modal editors like Vim and compositional systems like Plan 9’s Acme. Navigation is keyboard-driven, modes provide context, and every operation is reversible.

Gaius uses modes to provide context-sensitive behavior. This is not complexity – it is power through focus.

  • Normal Mode (default): navigate, observe, toggle views
  • Command Mode: enter slash commands via the command bar (/)

Cursor Navigation

The cursor is your focus point on the grid. It determines what position commands act upon, the center of local context, and the reference point for the MiniGrid projections.

Basic Movement

       k
       |
   h --+-- l
       |
       j

Vim-style navigation: h/j/k/l for left/down/up/right. These keys sit on the home row so your fingers never leave typing position.

Tenuki

Press t to jump to the point of highest strategic interest – a concept borrowed from Go, where tenuki means “playing elsewhere.” The engine evaluates all grid positions and moves your cursor to the most strategically relevant one.

View Modes

Press v to cycle through visualization modes:

Go Mode

Traditional Go stones on intersections. Black and white stones mark occupied positions. Empty intersections show as dots.

Theta Mode

Information density visualization named after theta waves, which facilitate memory consolidation. This mode renders allocation intensity and data density across the grid.

Swarm Mode

Agent-centric view showing multi-agent positions and activity across the grid.

Overlay Modes

Press o to cycle overlays. Overlays add visual information on top of the current view mode without changing the base rendering:

OverlayKey conceptWhat it shows
NoneClean slateBase grid only
TopologyPersistent homologyH0/H1/H2 features (components, loops, voids)
GeometryCurvatureSemantic boundaries vs. interiors
DynamicsGradient fieldDirection of semantic change, divergence
AgentsTeam stateAgent positions on the grid

See Overlays for detailed interpretation guidance.

Iso View Modes

Press i to cycle through Iso view modes, which change the interpretation of the MiniGrid projections below the main grid. These provide different mathematical lenses on the same data.

Panel Management

KeyAction
[Toggle left panel (FileTree)
]Toggle right panel (ContentPanel)
\Toggle both panels simultaneously

Hide panels to maximize grid visibility. Restore them to review details and navigate the knowledge base.

Graph View

Press g to cycle the center panel between modes. This toggles between the standard grid view and a graph/wiki-link visualization, providing different perspectives on the same underlying data.

Flow Patterns

Exploration Flow

  1. Navigate with hjkl to survey the grid
  2. Cycle overlays (o) to see different data layers
  3. Toggle candidates (c) to see suggested positions
  4. Press t for tenuki to jump to high-interest points

Analysis Flow

  1. Press / to enter command mode
  2. Run /health to check system state
  3. Use overlays to compare topology, geometry, and dynamics
  4. Review output in the ContentPanel

Focused Flow

  1. Hide panels (\) for maximum grid space
  2. Navigate to a region of interest
  3. Switch overlays to study different dimensions
  4. Restore panels when you need detailed context

Panels

Gaius has two side panels flanking the central grid: the FileTree on the left and the ContentPanel on the right. Both can be toggled independently or together.

Toggle Controls

KeyAction
[Toggle left panel (FileTree)
]Toggle right panel (ContentPanel)
\Toggle both panels simultaneously

When a panel is hidden, the grid expands to fill the available space.

Left Panel: FileTree

The FileTree presents a Plan 9-inspired hierarchical view of the system. Everything is navigable as if it were a filesystem:

/
  agents/
    cognition/
    evolution/
    health/
  kb/
    current/
      projects/
      content/
    scratch/
  state/

Agents are represented as files under /agents/. Knowledge base entries appear under /kb/. System state is exposed under /state/. This design follows the Plan 9 philosophy where everything – processes, data, system state – is accessible through a uniform file interface.

Selecting an entry in the FileTree updates the ContentPanel on the right to show its contents.

Right Panel: ContentPanel

The ContentPanel is the primary output area. It displays:

  • File contents – when a FileTree entry is selected
  • Command output – results from slash commands (e.g., /health, /gpu status)
  • Position context – information about the current grid position
  • Help – key binding reference when ? is pressed
  • Agent output – responses from agent operations

The ContentPanel renders markdown-formatted text, tables, and structured data. It scrolls vertically for long output.

Layout Strategies

Different tasks benefit from different panel configurations:

Full context (default): both panels visible. Use when you need to navigate the knowledge base and see detailed output simultaneously.

Grid focus: press \ to hide both panels. Use when studying spatial patterns, overlay composition, or doing pure grid exploration.

Research mode: hide the left panel with [. The grid and ContentPanel share the screen, giving more room for command output and detailed content.

Navigation mode: hide the right panel with ]. The FileTree and grid share the screen, useful when browsing the knowledge base structure without needing detailed content.

Panel Persistence

Panel visibility state persists during your session. If you hide a panel and run a command, the panel stays hidden. Toggle it back when you need it.

Overlays & Visualization

Overlays are Gaius’s mechanism for layering multiple data dimensions onto a single grid. Understanding overlay composition is key to effective visual analysis.

Overlay Philosophy

A grid has 361 cells. Naively, that is one data point per cell. But complex domains have many dimensions. Overlays solve this by:

  1. Layering: multiple data types occupy the same space
  2. Cycling: focus shifts between layers via the o key
  3. Compositing: some layers blend (e.g., density + markers)

Available Overlays

Press o to cycle through overlay modes. The current set is based on differential geometry concepts:

None

The cleanest view. Shows only:

  • Base grid (view-mode-specific symbols)
  • Cursor position
  • Candidate markers (a-i) if toggled with c

Use this for uncluttered observation of the base state.

Topology

Displays persistent homology features at three scales:

  • H0: connected components – clusters of related data points
  • H1: loops – cycles in the embedding space (feedback loops, circular dependencies)
  • H2: voids – higher-dimensional cavities (structural gaps)

Topological features that persist across scales are significant. Transient features are noise. The overlay highlights those that survive, revealing the true shape of the data.

Geometry

Curvature heatmap showing semantic boundaries versus interiors. High curvature regions mark transitions between conceptual domains. Low curvature indicates the interior of a coherent cluster. This overlay helps identify where one topic ends and another begins.

Dynamics

Gradient vector field showing the direction and magnitude of semantic change. Arrows or indicators point toward regions of increasing density or relevance. Divergence patterns reveal sources (generating new content) and sinks (absorbing attention). This overlay captures how the data landscape is evolving.

Agents

Agent positions projected from embedding space onto the grid. Each active agent occupies a position determined by its current focus within the data. Watch for:

  • Clustering: agents in agreement, converging on the same region
  • Scattering: genuine uncertainty or broad exploration
  • Opposition: agents on opposite sides of the grid (tension, disagreement)
  • Isolation: a single agent in a region (unique insight worth investigating)

Reading Composite Views

When multiple features occupy a cell, priority determines display:

  1. Overlay markers – highest priority
  2. Candidate letters (a-i)
  3. Cursor
  4. Stones/density (view-mode symbols)
  5. Empty (dot) – lowest priority

Overlay as Situational Awareness

Each overlay provides a different “sense”:

  • None: clean visual baseline
  • Topology: structural awareness (what shapes exist)
  • Geometry: boundary awareness (where things change)
  • Dynamics: momentum awareness (where things are going)
  • Agents: team state awareness (where agents are looking)

Cycling overlays is like shifting attention between modalities – a form of augmented situational awareness. The OODA loop pattern (Observe, Orient, Decide, Act) maps naturally: observe with None, orient with Topology or Geometry, decide based on Dynamics, act on Agent positions.

Combining with View Modes

Overlays compose with view modes (v). A Topology overlay on Go mode shows homology features atop stone positions. The same overlay on Theta mode shows features atop density shading. Experiment with combinations to find the perspective that reveals what you need.

Key Bindings

Complete reference for all keyboard shortcuts in the Gaius TUI.

KeyActionDescription
hMove leftMove cursor one position left
jMove downMove cursor one position down
kMove upMove cursor one position up
lMove rightMove cursor one position right
tTenukiJump to point of highest strategic interest

View Controls

KeyActionDescription
vCycle viewCycle through view modes: Go, Theta, Swarm
oCycle overlayCycle through overlays: None, Topology, Geometry, Dynamics, Agents
iCycle isoCycle Iso view modes for MiniGrid projections
cToggle candidatesShow/hide candidate markers (a-i) at suggested positions

Panel Controls

KeyActionDescription
[Toggle leftShow/hide the FileTree panel
]Toggle rightShow/hide the ContentPanel
\Toggle bothShow/hide both panels simultaneously

Commands and Help

KeyActionDescription
/Command modeFocus the command bar to enter a slash command
?HelpDisplay help and key reference in the ContentPanel

Graph and Evolution

KeyActionDescription
gGraphCycle center panel between grid and graph views
eEvolutionShow evolution panel directly

Notes

KeyActionDescription
Ctrl+nNew noteCreate a new Zettelkasten note and focus the editor
Ctrl+zZoom editorToggle editor zoom (tmux-style)

Application

KeyActionDescription
qQuit hintDisplay quit instructions (use /q or /exit to actually quit)

Command Bar Keys

When the command bar is focused (after pressing /):

KeyAction
EnterExecute command
EscapeCancel and return to normal mode
UpPrevious command in history
DownNext command in history
TabAuto-complete command

Design Notes

Key bindings follow Vim conventions for navigation (hjkl) and use mnemonic single keys for mode cycling (v for view, o for overlay, c for candidates). Panel toggles use bracket keys ([, ], \) which are adjacent on a standard keyboard. The / key enters command mode, matching the slash-command convention used by Claude Code and similar tools.

The CLI

Gaius provides a non-interactive command-line interface through gaius-cli. It executes the same slash commands available in the TUI but returns structured output suitable for scripting, piping, and automation.

Basic Usage

uv run gaius-cli --cmd "/command" --format json

The --cmd flag specifies the slash command to run (with or without the leading /). The --format flag controls output format – json produces machine-readable output, while the default produces human-readable text.

Examples

Check system health:

uv run gaius-cli --cmd "/health" --format json

Query GPU and endpoint status:

uv run gaius-cli --cmd "/gpu status" --format json

Check evolution state:

uv run gaius-cli --cmd "/evolve status" --format json

View the current application state:

uv run gaius-cli --cmd "/state" --format json

Available Commands

Gaius has 63 slash commands covering health diagnostics, agent management, inference control, knowledge base operations, evolution, visualization, and observability. The full command reference is in the CLI Commands section.

Common command categories:

PrefixDomainExample
/healthSystem health/health, /health fix engine
/gpuGPU/endpoints/gpu status, /gpu cleanup
/evolveAgent evolution/evolve status, /evolve trigger
/kbKnowledge base/kb search <query>
/renderVisualization/render cards
/observeObservability/observe metrics

Connection to the Engine

The CLI connects to the same gRPC engine (port 50051) as the TUI. Both interfaces are thin clients that send commands to the engine and display results. If the engine is not running, the CLI will report a connection error – start services with devenv processes up -d or just restart-clean.

Next Steps

Command Patterns

Common patterns for working with gaius-cli effectively. The CLI produces structured JSON output that integrates naturally with standard Unix tools.

JSON Output and jq

Most commands support --format json for machine-readable output. Pipe through jq to extract specific fields:

# Get endpoint names and statuses
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[] | {name, status}'

# Extract just the health categories that are not OK
uv run gaius-cli --cmd "/health" --format json | jq '.data.checks[] | select(.status != "ok")'

# Get the current evolution generation number
uv run gaius-cli --cmd "/evolve status" --format json | jq '.data.generation'

Polling for Status Changes

When waiting for an operation to complete, poll in a loop:

# Watch endpoints transition from STARTING to HEALTHY after a restart
for i in $(seq 1 15); do
    sleep 10
    uv run gaius-cli --cmd "/gpu status" --format json | \
        jq -r '.data.endpoints[] | "\(.name): \(.status)"'
    echo "---"
done

Comparing Before and After

Capture state before and after an operation:

# Snapshot before
uv run gaius-cli --cmd "/health" --format json > /tmp/health-before.json

# Run an operation
uv run gaius-cli --cmd "/health fix engine" --format json

# Snapshot after
uv run gaius-cli --cmd "/health" --format json > /tmp/health-after.json

# Diff
diff <(jq -S . /tmp/health-before.json) <(jq -S . /tmp/health-after.json)

Batch Operations

Run multiple commands in sequence:

# Check everything in one pass
for cmd in "/health" "/gpu status" "/evolve status"; do
    echo "=== $cmd ==="
    uv run gaius-cli --cmd "$cmd" --format json | jq '.data'
    echo
done

Conditional Logic

Use jq exit codes to drive decisions:

# Only proceed if all endpoints are healthy
if uv run gaius-cli --cmd "/gpu status" --format json | \
    jq -e '.data.endpoints | all(.status == "HEALTHY")' > /dev/null 2>&1; then
    echo "All endpoints healthy, proceeding"
    uv run gaius-cli --cmd "/evolve trigger" --format json
else
    echo "Not all endpoints healthy, aborting"
    exit 1
fi

Timestamp and Logging

Add timestamps for log correlation:

uv run gaius-cli --cmd "/health" --format json | \
    jq --arg ts "$(date -Iseconds)" '. + {queried_at: $ts}'

Error Handling

The CLI returns non-zero exit codes on failure. Check both the exit code and the response:

if ! output=$(uv run gaius-cli --cmd "/gpu status" --format json 2>&1); then
    echo "CLI failed: $output"
    exit 1
fi
echo "$output" | jq '.data'

Scripting

The gaius-cli is designed for non-interactive use in shell scripts. It connects to the gRPC engine, executes a command, prints output, and exits. This makes it suitable for cron jobs, monitoring scripts, and automation pipelines.

Health Monitoring Script

A script that checks system health and sends alerts on failures:

#!/usr/bin/env bash
set -euo pipefail

LOG="/var/log/gaius-health.log"

health=$(uv run gaius-cli --cmd "/health" --format json)
failed=$(echo "$health" | jq '[.data.checks[] | select(.status != "ok")] | length')

if [ "$failed" -gt 0 ]; then
    echo "$(date -Iseconds) ALERT: $failed health checks failing" >> "$LOG"
    echo "$health" | jq '.data.checks[] | select(.status != "ok")' >> "$LOG"
fi

Periodic Data Collection

Capture endpoint metrics at regular intervals for trend analysis:

#!/usr/bin/env bash
set -euo pipefail

OUTDIR="$HOME/gaius-metrics/$(date +%Y-%m-%d)"
mkdir -p "$OUTDIR"

TIMESTAMP=$(date +%H%M%S)

uv run gaius-cli --cmd "/gpu status" --format json > "$OUTDIR/${TIMESTAMP}_gpu.json"
uv run gaius-cli --cmd "/health" --format json > "$OUTDIR/${TIMESTAMP}_health.json"
uv run gaius-cli --cmd "/evolve status" --format json > "$OUTDIR/${TIMESTAMP}_evolve.json"

Run via cron every 5 minutes:

*/5 * * * * cd /path/to/gaius && devenv shell -- bash scripts/collect-metrics.sh

Endpoint Readiness Gate

Wait for all endpoints to be healthy before proceeding with a downstream operation:

#!/usr/bin/env bash
set -euo pipefail

MAX_WAIT=300  # 5 minutes
INTERVAL=10
elapsed=0

echo "Waiting for endpoints to become healthy..."
while [ $elapsed -lt $MAX_WAIT ]; do
    if uv run gaius-cli --cmd "/gpu status" --format json | \
        jq -e '.data.endpoints | all(.status == "HEALTHY")' > /dev/null 2>&1; then
        echo "All endpoints healthy after ${elapsed}s"
        exit 0
    fi
    sleep $INTERVAL
    elapsed=$((elapsed + INTERVAL))
done

echo "Timed out waiting for endpoints after ${MAX_WAIT}s"
exit 1

Evolution Report

Generate a summary of the current evolution state:

#!/usr/bin/env bash
set -euo pipefail

echo "=== Gaius Evolution Report $(date -Iseconds) ==="
echo

echo "## Agent Status"
uv run gaius-cli --cmd "/evolve status" --format json | \
    jq -r '.data | "Generation: \(.generation)\nActive agents: \(.active_agents)"'

echo
echo "## Endpoint Status"
uv run gaius-cli --cmd "/gpu status" --format json | \
    jq -r '.data.endpoints[] | "  \(.name): \(.status)"'

echo
echo "## Health Summary"
uv run gaius-cli --cmd "/health" --format json | \
    jq -r '.data.checks[] | "  \(.name): \(.status)"'

Tips for Robust Scripts

  • Always use set -euo pipefail at the top of scripts
  • Check that the engine is reachable before running a batch of commands
  • Use --format json consistently so output is parseable
  • Capture output to variables when you need to inspect it multiple times
  • Log timestamps alongside data for correlation with system events

MCP Integration

Gaius exposes 163 tools via the Model Context Protocol (MCP), making its full functionality available to Claude Code and other MCP-compatible AI clients.

What Is MCP?

The Model Context Protocol is a standard for connecting AI assistants to external tools and data sources. When configured, Claude Code can call Gaius tools directly – checking health, querying the knowledge base, managing agents, and running operations – all within a conversational workflow.

Starting the MCP Server

uv run gaius-mcp

This starts a stdio-based MCP server that communicates with Claude Code over standard input/output. The server connects to the same gRPC engine (port 50051) used by the TUI and CLI.

What You Can Do

With MCP integration, Claude Code can:

  • Diagnose issues: query health status, check endpoint state, review incident history
  • Manage agents: view evolution status, trigger training, promote agent versions
  • Search knowledge: query the knowledge base, perform semantic search, explore lineage
  • Run inference: submit prompts to the scheduler, evaluate outputs, manage XAI budget
  • Monitor systems: read Prometheus metrics, check Metabase dashboards, view GPU health
  • Create content: trigger article curation, render card visualizations, manage collections

Architecture

The MCP server is a thin wrapper over the same services available through the CLI. Each MCP tool maps to an internal command or service call. The server handles serialization (JSON arguments and responses) and error propagation.

Claude Code  <--stdio-->  gaius-mcp  <--gRPC-->  Engine (port 50051)
                                     <--HTTP-->  Services (Metabase, Prometheus, etc.)
                                     <--SQL-->   PostgreSQL (port 5444)

Next Steps

Claude Code Setup

This page describes how to configure Claude Code to use the Gaius MCP server, giving Claude Code direct access to all 163 Gaius tools.

Configuration

Add the Gaius MCP server to your Claude Code MCP configuration. The configuration file is typically at ~/.claude.json or in your project’s .claude/ directory.

Add the following to the mcpServers section:

{
  "mcpServers": {
    "gaius": {
      "command": "uv",
      "args": ["run", "--directory", "/path/to/gaius", "gaius-mcp"],
      "env": {
        "GAIUS_ENGINE_HOST": "localhost",
        "GAIUS_ENGINE_PORT": "50051"
      }
    }
  }
}

Replace /path/to/gaius with the absolute path to your Gaius repository checkout.

Environment Variables

The MCP server respects these environment variables:

VariableDefaultPurpose
GAIUS_ENGINE_HOSTlocalhostgRPC engine hostname
GAIUS_ENGINE_PORT50051gRPC engine port
DATABASE_URLfrom configPostgreSQL connection URL

In most setups, the defaults work without any environment overrides.

Prerequisites

Before Claude Code can use Gaius tools, the platform services must be running:

cd /path/to/gaius
devenv shell
devenv processes up -d

The MCP server connects to the gRPC engine on startup. If the engine is not running, tool calls will fail with connection errors.

Verifying the Connection

After configuring, ask Claude Code to run a health check:

“Check the Gaius health status”

Claude Code should invoke the health_observer_check tool and return a structured health report. If it reports connection errors, verify that devenv processes up -d has been run.

Tool Discovery

Claude Code can list available tools. The 163 tools are organized into categories such as health, agents, inference, knowledge base, observability, evolution, visualization, and bases. See Tool Categories for the full breakdown.

Security Considerations

The MCP server runs locally and communicates with Claude Code over stdio. It does not expose a network port. All operations are scoped to the local Gaius instance. For ACP (Agent Client Protocol) integration, which involves GitHub operations, additional security controls apply – see the ACP Security Model documentation.

Troubleshooting

SymptomCauseFix
“Tool not found”MCP config not loadedRestart Claude Code after editing config
Connection refusedEngine not runningRun devenv processes up -d
Timeout on tool callsEngine overloadedCheck /gpu status or run just restart-clean
Python errorsDependencies missingRun uv sync in the gaius directory

Tool Categories

The 163 MCP tools are organized by domain. Each tool maps to an internal service call and accepts JSON arguments.

Health

Tools for system diagnostics, self-healing, and incident management.

ToolPurpose
health_observer_statusCurrent observer daemon state
health_observer_checkRun health diagnostic across all categories
health_observer_start / stopControl the health observer daemon
health_observer_incidentsList active and recent incidents
health_observer_incident_detailDetailed view of a specific incident
fmea_catalogBrowse failure modes and their RPN scores
fmea_calculate_rpnCalculate Risk Priority Number for a failure mode
fmea_get_controlsGet remediation controls for a failure mode

Agents and Evolution

Tools for managing agent versions, evolution cycles, and cognition.

ToolPurpose
list_agent_versionsAll agent versions with metadata
get_active_configCurrent active agent configuration
get_best_agent_versionHighest-performing version for an agent
save_agent_version / rollback_agentVersion management
optimize_agentTrigger optimization for an agent
evolution_statusCurrent generation, evaluation state
trigger_evolutionStart a new evolution cycle
trigger_task_ideationGenerate new training tasks
get_capability_gapsIdentify areas where agents underperform

Inference and Models

Tools for managing LLM endpoints, inference scheduling, and XAI budget.

ToolPurpose
list_models / get_modelBrowse available models
gpu_healthGPU utilization and endpoint status
model_launch_coding / model_stop_codingControl inference endpoints
model_generate_codeGenerate code using a managed model
model_validate_codeValidate generated code
get_xai_budget / reset_xai_budgetManage XAI inference budget
evaluate_with_xaiRun evaluation using XAI model

Knowledge Base

Tools for searching, reading, and managing knowledge base content.

ToolPurpose
search_kbFull-text search across KB entries
read_kb / create_kb / update_kb / delete_kbCRUD operations
list_kbList entries with filters
kb_syncSynchronize KB with external sources
semantic_searchVector similarity search
embed_text / embed_textsGenerate embeddings

Observability

Tools for metrics, monitoring, and system telemetry.

ToolPurpose
observe_status / observe_metricsObservability pipeline state
prometheus_query / prometheus_query_rangeDirect PromQL queries
prometheus_healthPrometheus server status
metabase_statusMetabase analytics server status
metabase_list_dashboards / metabase_get_dashboardBrowse dashboards
log_activity / get_activity_stats / get_daily_summaryActivity tracking

Visualization

Tools for rendering card visualizations and managing collections.

ToolPurpose
collection_status / collection_list / collection_createManage collections
collection_add_card / collection_list_cardsCard management
collection_publish_cards / collection_publish_vizPublishing pipeline
collection_generate_summariesAI-generated card summaries
article_list / article_curate / article_newArticle management

Bases (Feature Store)

Tools for querying the Bases feature store.

ToolPurpose
bases_listList available bases
bases_queryRun DQL queries against a base
bases_entity_historyEntity change history
bases_healthFeature store health status

Cognition and Memory

Tools for agent thinking, memory consolidation, and self-reflection.

ToolPurpose
trigger_cognitionTrigger a cognition cycle
trigger_self_observationAgent self-reflection
get_thought_chain / get_recent_thoughtsView agent reasoning
what_are_you_thinkingCurrent agent state of mind
theta_sitrep / theta_consolidateTheta wave memory consolidation
reflect / quick_thoughtLightweight reflection tools

Workflows

Gaius supports multi-step workflows that combine CLI commands, MCP tools, and TUI interactions. This section documents the most common patterns.

What Is a Workflow?

A workflow is a sequence of operations that achieve a goal larger than any single command. For example, researching a topic involves creating KB entries, curating articles, generating cards, and publishing a collection. Each step uses different Gaius capabilities, and the output of one step feeds the next.

Three Interaction Layers

Workflows can be executed through any combination of the three interfaces:

  • TUI: interactive exploration, visual pattern recognition, manual curation
  • CLI: scripted operations, batch processing, automated checks
  • MCP: AI-assisted orchestration, where Claude Code drives multi-step sequences

The choice depends on the task. Health monitoring is best scripted via CLI. Research curation benefits from MCP-driven AI assistance. Spatial exploration requires the TUI.

Common Workflows

Research Workflow

End-to-end knowledge synthesis: define a topic, curate articles from the web, create cards with enriched metadata, and publish a collection. This is the primary content pipeline.

Health Workflow

System diagnosis and remediation: run health checks, interpret failures, apply self-healing fixes, and monitor recovery. This workflow is critical for keeping the platform operational.

Evolution Workflow

Agent improvement cycle: check evolution status, generate training tasks, trigger evaluation, promote successful agents. This is how Gaius agents get better over time.

Workflow Principles

Self-healing first. When something breaks, try /health fix <service> before manual intervention. The self-healing system learns from each invocation.

Test via CLI. After any code change or operation, verify the result with gaius-cli. Previous outputs are invalidated by changes – always re-run the command.

Fail fast. Gaius surfaces errors immediately with actionable remediation paths. If a step fails, the error message tells you what to do next. There are no silent fallbacks.

Observe, then act. Use the OODA loop: observe system state (/health, /gpu status), orient by comparing overlays, decide on an action, then act. Do not skip the observation step.

Research Workflow

The research workflow takes a topic from initial exploration through to a published collection of enriched cards. This is the primary content pipeline in Gaius.

Overview

Topic definition --> Article curation --> Card creation --> Enrichment --> Publishing

Each step builds on the previous one. The workflow can be driven manually through the CLI, or orchestrated by Claude Code via MCP tools.

Step 1: Define the Topic

Create or select an article definition with keywords and news queries that guide content discovery:

# List existing articles
uv run gaius-cli --cmd "/article list" --format json

# Create a new article with topic keywords
uv run gaius-cli --cmd "/article new" --format json

Articles need keywords and/or news_queries in their frontmatter for the Brave fetcher to find relevant sources. Without these, curation will fail fast with #ACF.00000013.NOHINTS.

Step 2: Curate Articles

Run the article curation flow to fetch and process relevant content:

uv run gaius-cli --cmd "/article curate" --format json

The curation flow:

  1. Searches the web using configured keywords and news queries
  2. Fetches and extracts content from discovered URLs
  3. Evaluates relevance against a selection rubric
  4. Creates cards from qualifying articles (~20 cards per run, ~2 minutes)

The selection rubric includes a curation_readiness gate that prevents selecting articles whose metadata is incomplete.

Step 3: Enrich Cards

Cards are created with basic metadata. Enrichment adds embeddings, summaries, and topology features:

# Check enrichment status
uv run gaius-cli --cmd "/collection list cards" --format json

# Generate summaries for cards that need them
uv run gaius-cli --cmd "/collection generate summaries" --format json

Card publishing is gated on enrichment completeness – cards without sufficient enrichment cannot be published.

Step 4: Render Visualizations

Each card gets a deterministic visualization rendered by the LuxCore engine:

uv run gaius-cli --cmd "/render cards" --format json

The grammar engine generates a unique visual based on the card’s topology features, seeded by hash(card_id) for deterministic output. Two variants are produced: display (1400x300) and og (1200x630 for social sharing).

Step 5: Publish Collection

Publish the completed cards to a collection:

# Create or select a collection
uv run gaius-cli --cmd "/collection create" --format json

# Add cards to the collection
uv run gaius-cli --cmd "/collection add card" --format json

# Publish
uv run gaius-cli --cmd "/collection publish cards" --format json

MCP-Driven Research

When using Claude Code with MCP tools, the entire workflow can be conversational:

“Research the topic of topological data analysis in financial risk. Curate articles, enrich the cards, and publish a collection.”

Claude Code will call article_new, article_curate, collection_generate_summaries, and collection_publish_cards in sequence, reporting progress at each step.

Monitoring Collection Balance

The pending_cards metric is the most effective signal for collection diversity. Monitor it to ensure the collection is not over-weighted toward a single source or topic.

Health Workflow

The health workflow covers diagnosing system issues, applying self-healing fixes, and monitoring recovery. Gaius implements a fail-fast policy with actionable error messages, so every failure tells you what to do next.

Step 1: Diagnose

Run the health check to see the current state of all services:

uv run gaius-cli --cmd "/health" --format json

This returns a structured report with checks organized by category. Each check has a status (ok, warn, fail) and a message explaining the current state.

To check a specific category:

uv run gaius-cli --cmd "/health engine" --format json
uv run gaius-cli --cmd "/health endpoints" --format json

Step 2: Interpret Failures

Failed checks include Guru Meditation Codes – unique identifiers for each failure mode. For example:

  • #DS.00000001.SVCNOTINIT – DatasetService not initialized
  • #NF.00000001.UNREACHABLE – NiFi not reachable
  • #EP.00000001.GPUOOM – GPU out of memory

Each code maps to a documented heuristic with symptom, cause, observation method, and solution. The error message itself contains remediation hints.

Step 3: Self-Heal

Always try /health fix before manual intervention. This is a design principle, not a suggestion:

uv run gaius-cli --cmd "/health fix engine" --format json
uv run gaius-cli --cmd "/health fix endpoints" --format json
uv run gaius-cli --cmd "/health fix nifi" --format json

Available fix targets: engine, dataset, nifi, postgres, qdrant, minio, endpoints, evolution.

Each fix strategy is a multi-step remediation sequence with verification at each step. The system attempts increasingly aggressive fixes until the service recovers.

Step 4: Monitor Recovery

After applying a fix, monitor the health observer for recovery:

# Check observer status
uv run gaius-cli --cmd "/health observer status" --format json

# List active incidents
uv run gaius-cli --cmd "/health observer incidents" --format json

# Poll for recovery
for i in $(seq 1 10); do
    sleep 15
    uv run gaius-cli --cmd "/health" --format json | \
        jq '.data.checks[] | select(.status != "ok") | {name, status, message}'
done

Step 5: Escalation

If /health fix does not resolve the issue, the Health Observer can escalate via ACP (Agent Client Protocol) to Claude Code for deeper analysis. This happens automatically when:

  1. An incident exceeds the configured FMEA RPN threshold
  2. Local remediation has failed
  3. The incident is not in cooldown

Manual escalation path – use just restart-clean as the last resort:

just restart-clean

This performs a full clean restart of all services: stops everything, cleans up state, and restarts from scratch.

FMEA Framework

The health system uses Failure Mode and Effects Analysis (FMEA) to prioritize issues. Each failure mode has a Risk Priority Number (RPN) computed from severity, occurrence frequency, and detection difficulty. Higher RPNs get attention first.

# View the FMEA catalog
uv run gaius-cli --cmd "/fmea catalog" --format json

# Calculate RPN for a specific failure mode
uv run gaius-cli --cmd "/fmea rpn <mode>" --format json

Health Observer Daemon

The Health Observer runs as a background daemon, continuously monitoring service health and automatically triggering remediation when issues are detected:

# Start the observer
uv run gaius-cli --cmd "/health observer start" --format json

# Stop the observer
uv run gaius-cli --cmd "/health observer stop" --format json

When running, it checks services periodically and logs incidents. Resolved incidents are filtered out of the active list, but unknown or unexpected states remain visible (fail-open for observability).

Evolution Workflow

The evolution workflow improves Gaius agents over time through task ideation, training, evaluation, and promotion. This is a cycle that repeats as agents accumulate more data and experience.

Overview

Status check --> Task ideation --> Training --> Evaluation --> Promotion
     ^                                                           |
     |___________________________________________________________|

Each cycle produces a new generation of agent versions. Successful versions are promoted to active status; underperformers are retained for comparison but not used in production.

Step 1: Check Status

Before starting an evolution cycle, check the current state:

uv run gaius-cli --cmd "/evolve status" --format json

This shows the current generation number, active agents, evaluation state, and any capability gaps. Pay attention to:

  • Generation: which cycle you are on
  • Active agents: which agent versions are currently serving
  • Capability gaps: areas where agents underperform

Step 2: Task Ideation

Generate new training tasks based on identified capability gaps:

uv run gaius-cli --cmd "/evolve task ideation" --format json

The ideation process analyzes recent performance data and gap analysis to propose tasks that target specific weaknesses. Tasks are designed to push agents toward areas where they currently underperform.

Step 3: Trigger Evolution

Start the evolution cycle. This runs training with the generated tasks and produces new agent versions:

uv run gaius-cli --cmd "/evolve trigger" --format json

Evolution requires healthy inference endpoints. Verify with:

uv run gaius-cli --cmd "/gpu status" --format json | \
    jq '.data.endpoints[] | {name, status}'

All endpoints should show HEALTHY before triggering evolution. If they do not, run /health fix endpoints first.

Step 4: Evaluate

After training completes, evaluate the new agent versions against held-out test data:

# Check evaluation results
uv run gaius-cli --cmd "/evolve status" --format json | jq '.data.evaluation'

# View held-out statistics
uv run gaius-cli --cmd "/evolve held-out stats" --format json

Evaluation uses the RASE verification framework. Each agent version is scored on accuracy (0.0-1.0, proportion of constraints satisfied) and compared against previous versions.

Step 5: Promote or Roll Back

If the new version outperforms the current active version, promote it:

# View the best version
uv run gaius-cli --cmd "/evolve best" --format json

# Promote (via MCP or direct command)
uv run gaius-cli --cmd "/evolve promote" --format json

If the new version underperforms, roll back to a known good version:

uv run gaius-cli --cmd "/evolve rollback" --format json

Evolution Daemon

For continuous improvement, start the evolution daemon which runs cycles automatically:

# Start the daemon
uv run gaius-cli --cmd "/evolve daemon start" --format json

# Check daemon status
uv run gaius-cli --cmd "/evolve daemon status" --format json

# Stop the daemon
uv run gaius-cli --cmd "/evolve daemon stop" --format json

The daemon monitors capability gaps and triggers evolution cycles when thresholds are exceeded.

Track improvement over time:

uv run gaius-cli --cmd "/evolve trend" --format json

This shows how agent performance has changed across generations. Look for:

  • Upward trend: agents are improving, the evolution cycle is working
  • Plateau: training tasks may need diversification, or capability limits have been reached
  • Regression: roll back to a previous version and investigate

Model Merging

When multiple specialized agent versions exist, model merging can combine their strengths:

# View merge candidates
uv run gaius-cli --cmd "/evolve merge candidates" --format json

# Trigger a merge
uv run gaius-cli --cmd "/evolve merge" --format json

# View lineage
uv run gaius-cli --cmd "/evolve lineage" --format json

Model lineage tracking records the ancestry of each merged version, enabling traceability from the final model back to its training data and parent versions.

Design Philosophy

Gaius is more than a visualization tool—it’s an experiment in augmented cognition. The design integrates principles from human factors engineering, situational awareness research, and decades of interface evolution to create something genuinely new.

Foundational Principles

1. Spatial Cognition First

Humans evolved to navigate physical space. We have dedicated neural hardware for:

  • Allocentric mapping: Understanding space from a fixed reference frame
  • Path integration: Tracking position through movement
  • Landmark recognition: Identifying significant points

Gaius exploits this by mapping abstract data onto a navigable grid. The cursor becomes your position. Regions become territories. Movement through the grid engages spatial reasoning circuits that spreadsheets leave dormant.

2. Perceptual Bandwidth

Vision is our highest-bandwidth sense. Reading text: ~250 words/minute. Recognizing a scene: ~100ms. Gaius prioritizes visual pattern recognition over sequential text processing.

When you see agents clustered in a corner with death loops nearby, you perceive the situation instantly—before you could read a report describing it.

3. Modal Efficiency

Modal interfaces concentrate related operations. In normal mode, every key is a navigation or view command—no modifier keys needed. This reduces both physical motion and cognitive load.

Critics of modes cite “mode errors” (typing in wrong mode). Gaius addresses this with:

  • Clear mode indicators in status line
  • Consistent escape semantics (Esc always returns to normal)
  • Mode-appropriate cursor styling (planned)

4. Progressive Complexity

New users see a clean grid. They navigate with hjkl, toggle modes, quit with q. Nothing confusing.

Power users access deeper functionality through slash commands, MCP tools, and CLI scripting. Three interfaces — TUI, CLI, MCP — offer increasing levels of automation.

Complexity is opt-in, not mandatory.

5. Transparency Over Magic

Every visual element has an explanation. The grid shows exactly what it’s told to show. Agent positions derive from actual embeddings through a defined projection. Death loops come from computed homology.

No black boxes. No “AI magic.” Understanding the system enables trusting the system.

Human Factors Integration

Gaius incorporates principles from human factors engineering—the discipline of designing systems that account for human capabilities and limitations.

Cognitive Load Management

Miller’s Law: Working memory holds 7±2 chunks. Gaius manages this by:

  • Showing at most 7 agents (one per color)
  • Limiting candidate markers to 9 (a-i)
  • Using overlays to separate concerns (one layer at a time)

Hick’s Law: Decision time increases with choice count. Modal operation reduces active choices at any moment.

Attention and Distraction

The grid provides a stable anchor. Overlays add information; the base never shifts unexpectedly.

Status updates appear in the designated status line—not as popups or animations that hijack attention.

Error Prevention

Confirmation for destructive actions: Clear memory, quit with unsaved changes Reversible operations: Overlay cycles, mode toggles, cursor movement Visible state: Current mode, active features, domain always displayed

Fitts’s Law and Input

Fitts’s Law: Target acquisition time depends on distance and size. Keyboard input eliminates targeting entirely—no mouse movement, no precision required.

hjkl navigation is the fastest possible input for grid movement.

Situational Awareness

Situational awareness (SA) is the perception, comprehension, and projection of system states. Gaius is explicitly designed to support all three levels of SA as defined by Endsley (1995).

Level 1: Perception

What is happening?

Gaius provides immediate perception through:

  • Grid state: See where entities are located
  • Density shading: See relative magnitudes at a glance
  • Agent positions: See where each analytical lens is focused
  • Death loops: See topological features visually

No reading required. No scrolling. The state is visible.

Level 2: Comprehension

What does it mean?

Comprehension emerges from:

  • Spatial relationships: Clusters = consensus, scatter = uncertainty
  • Overlay transitions: Compare views to understand multi-dimensional state
  • Color coding: Consistent agent colors build recognition
  • Historical context: Memory enables “this is different from before”

Level 3: Projection

What will happen next?

Projection is supported by:

  • Swarm dynamics: Watch convergence/divergence trends
  • Entropy tracking: Rising entropy may signal regime change
  • Death loop evolution: New loops appearing = emerging risk
  • Agent trajectories: Where is each analytical perspective moving?

SA Demons (Threats to Awareness)

Endsley identified common SA failures. Gaius defends against them:

SA DemonGaius Defense
Attention tunnelingOverlay cycling forces perspective shifts
Data overloadLayered disclosure; modes separate concerns
Out-of-the-loopSwarm runs show agent “thinking” in real-time
Misplaced salienceConsistent visual vocabulary; no flashy distractions
Complexity creepFeature flags; base UI is minimal

The OODA Loop

Boyd’s OODA (Observe-Orient-Decide-Act) loop describes competitive decision-making:

  1. Observe: Grid displays current state
  2. Orient: Overlays, memory search, agent positions inform context
  3. Decide: Slash commands, domain changes, focus actions
  4. Act: Run swarm rounds, mark positions, export insights

Fast OODA loops win. Gaius minimizes latency at every stage.

Design Tensions

Every design involves tradeoffs. Gaius makes explicit choices:

Density vs. Clarity

The grid could show more information (color + shape + size). We prioritize clarity—one symbol per cell, overlays for additional dimensions.

Flexibility vs. Consistency

Custom projections enable domain adaptation. But core navigation (hjkl) never changes. Flexibility in content, consistency in interaction.

Power vs. Accessibility

Modal interfaces have a learning curve. We accept this tradeoff because mastery enables flow states inaccessible to modeless interfaces.

Automation vs. Control

Agents suggest; humans decide. The swarm provides perspectives, not prescriptions. Autonomy remains with the operator.

The Goal: Augmented Cognition

Gaius aims to extend human perception into domains we can’t naturally sense:

  • High-dimensional embedding spaces
  • Topological structure of point clouds
  • Collective reasoning of agent swarms

By projecting these onto a navigable grid with overlays and keyboard-driven interaction, we make the invisible visible—and navigable.

This is augmentation, not replacement. The human remains in control, with enhanced perception of complex systems.

Co-Creation with Code Agents

Gaius represents a novel architectural pattern: an application co-created with AI code agents, where the development process itself shapes the system’s design.

The Co-Creation Paradigm

Traditional software development follows a clear separation: humans design, humans implement, humans document. Gaius challenges this by integrating Claude Code (powered by Claude Opus 4.5) as a first-class development partner.

This isn’t “AI-assisted coding” in the conventional sense. It’s a symbiotic development process where:

  1. The human provides vision and judgment — strategic direction, quality assessment, architectural taste
  2. The code agent provides implementation velocity — exploring codebases, generating code, maintaining consistency
  3. The system evolves through dialogue — features emerge from conversation, not specification documents

Implications for Architecture

When an AI agent is a development partner, certain architectural choices become natural:

Interface Parity: CLI, TUI, and MCP interfaces must provide equivalent functionality. Why? Because the code agent (via MCP) needs access to the same operations the human uses (via TUI). Parity isn’t a nice-to-have; it’s essential for the agent to effectively participate in development and testing.

Living Documentation in the KB: Command references live in the Knowledge Base ([[current/commands/]]), not frozen in mdbook. The command set evolves as the agent and human add features together. Static documentation would be perpetually stale.

Self-Describing Systems: The MCP tools are the API. The CLI commands are the operations. When these are well-named and well-documented, the code agent can discover and use them without additional instruction.

The Knowledge Base as Shared Memory

A key insight: the KB serves as shared context between human and agent across sessions.

What Belongs in the KB vs. mdbook

KB (build/dev/)mdbook (docs/)
Command reference (evolving)Design philosophy (stable)
Current research threadsArchitectural foundations
Session notes and decisionsCore concepts
Feature-specific documentationUser guides
Agent-generated analysisContributing guidelines

The distinction: KB content may change between sessions as features evolve. mdbook content captures enduring principles that guide the evolution.

Example: The Commands Directory

The command reference in [[current/commands/]] was created during a session where we:

  1. Audited all commands across CLI, TUI, and MCP
  2. Identified parity gaps
  3. Documented each interface comprehensively

This documentation now serves multiple purposes:

  • For humans: Quick reference, training material
  • For code agents: Discovery of available operations
  • For development: Gap analysis, parity tracking

If we added the command reference to mdbook, it would be outdated within days. In the KB, it can evolve with the system.

BDD as Collaborative Specification

Behavior-Driven Development (BDD) takes on new significance in co-created systems.

Feature Files as Contracts

Gherkin feature files (features/*.feature) serve as:

  1. Executable specifications — Tests that verify behavior
  2. Agent-readable requirements — Clear, structured descriptions the code agent can understand
  3. Living documentation — Always synchronized with actual behavior
Feature: Wiki Link Resolution
  As a knowledge worker
  I want broken wiki links to resolve via search
  So that the knowledge graph grows organically

  Scenario: Selecting an unresolved wiki link
    Given a file "test.md" containing "[[nonexistent-topic]]"
    When I select the broken link in the graph panel
    Then a search runs for "nonexistent-topic"
    And a new zettelkasten note is created
    And the original link is updated to point to the new note

This scenario was implemented in a single session. The code agent:

  • Read the feature file to understand requirements
  • Implemented the feature across multiple files
  • Created tests to verify the behavior

Scenarios as Design Discussions

BDD scenarios often emerge from human-agent dialogue:

Human: “When I click a broken link, instead of an error, can it search and create a note?”

Agent: “So the flow would be: detect missing target → run search → synthesize note → update original link?”

Human: “Yes, and add a backlink from the new note to the origin.”

This conversation becomes a scenario. The scenario becomes a test. The test drives the implementation.

Interface Parity as Architectural Principle

The three interfaces serve different users:

InterfacePrimary UserInteraction Pattern
TUIHuman (interactive)Real-time visualization, keyboard navigation
CLIHuman (scripting), CI/CDJSON output, automation
MCPCode agents, integrationsStructured tool calls

Why Parity Matters

When interfaces drift apart:

  • The code agent can’t test what the human experiences
  • Automation scripts break when TUI adds features
  • Documentation fragments across interfaces

Gaius addresses this through:

  1. Shared core functions — CLI and TUI call the same underlying methods
  2. MCP as the comprehensive API — 163 tools covering all operations
  3. Regular parity audits — Tracking gaps in [[current/commands/index]]

The Parity Matrix

The command coverage matrix explicitly tracks which operations are available where:

| Command      | CLI | TUI | MCP |
|--------------|-----|-----|-----|
| /search      |  ✓  |  ✓  |  ✓  |  ← Full parity
| /model add   |  ✓  |  -  |  ✓  |  ← TUI gap (priority)
| /init        |  -  |  ✓  |  -  |  ← TUI-specific (OK)

This matrix is itself a development artifact that guides prioritization.

Practical Patterns

Pattern 1: Agent-Discoverable Operations

Name commands and tools descriptively:

  • scheduler_health_check not shc
  • /evolve trigger not /evo t

The code agent reads these names. Clear naming reduces confusion.

Pattern 2: JSON-First CLI

CLI commands return structured JSON by default:

uv run gaius-cli --cmd "/state" --format json

This enables:

  • Agent parsing of command output
  • Scripted verification of behavior
  • Pipeline integration

Pattern 3: Incremental Documentation

Don’t write comprehensive documentation upfront. Let it emerge:

  1. Implement feature with agent
  2. Agent documents as it implements
  3. Human reviews and refines
  4. Documentation evolves with feature

Pattern 4: Session Handoff

The KB preserves context across sessions:

  • [[scratch/YYYY-MM-DD/]] — Daily working notes
  • [[current/commands/]] — Living reference
  • Research threads — Ongoing investigations

When a new session starts, the agent can read recent KB entries to resume context.

The Meta-Principle

Systems designed for co-creation with code agents are inherently more maintainable.

Why? Because the requirements for agent collaboration—clear interfaces, structured data, living documentation, testable behavior—are the same requirements for long-term maintainability.

Designing for an AI collaborator forces us to:

  • Make implicit knowledge explicit
  • Structure operations consistently
  • Document as we build
  • Test what we document

These are good practices regardless of whether an agent is involved. The agent just makes them essential.

Future Directions

Agent-Initiated Evolution

Currently, the human initiates feature development. Future systems might:

  • Have the agent propose features based on usage patterns
  • Automatically generate BDD scenarios from user feedback
  • Self-document new capabilities as they’re added

Multi-Agent Development

Gaius already uses multi-agent swarms for analysis. The same pattern could apply to development:

  • Architect agent proposes structure
  • Implementation agent writes code
  • Critic agent reviews
  • Documentation agent updates KB

Adaptive Interfaces

If the agent tracks which operations are used most, it could:

  • Suggest adding frequently-used MCP tools to TUI
  • Identify commands that should be automated
  • Propose interface simplifications

Conclusion

Gaius isn’t just a tool for augmented cognition—it’s a case study in augmented development. The co-creation paradigm, where human vision and AI implementation velocity combine, produces systems that are:

  • More consistent — The agent enforces patterns across the codebase
  • Better documented — Documentation emerges from the development dialogue
  • More testable — BDD scenarios are natural outputs of requirement discussions
  • Easier to maintain — Clear interfaces required for agent collaboration benefit all maintainers

The KB as shared memory, interface parity as principle, and BDD as collaborative specification—these patterns aren’t specific to Gaius. They’re applicable to any system designed for human-AI co-creation.

The future of software development isn’t human OR machine. It’s human AND machine, each contributing their strengths to create systems neither could build alone.

Inspirations

Gaius stands on the shoulders of giants. This section traces the lineage of ideas that inform its design.

The Polymath Tradition

Gaius Plinius Secundus (23-79 CE)

Pliny the Elder’s Naturalis Historia attempted to catalog all knowledge of the natural world across 37 books. He wrote: “Nature is to be found in her entirety nowhere more than in her smallest creatures.”

This spirit—systematic observation, comprehensive scope, attention to detail—animates Gaius. The grid is our attempt at a unified view of complex domains.

The Encyclopedists

Diderot and d’Alembert’s Encyclopédie (1751-1772) organized knowledge with cross-references, creating a navigable web of ideas. Gaius’s scene graph and semantic search continue this tradition.

Modern Polymaths

Herbert Simon (AI, economics, psychology), Douglas Engelbart (augmented intelligence), Seymour Papert (constructionism)—thinkers who crossed disciplines to synthesize new understanding. Gaius is built for their intellectual descendants.

Interface Lineages

Terminal Interfaces

From TTY to VT100 to ANSI terminals to modern terminal emulators, the text interface has evolved continuously. Gaius inherits:

  • Character grid: Discrete, addressable positions
  • ANSI styling: Colors, bold, background
  • Keyboard primacy: No mouse required
  • Stream output: Log panels for sequential information

vi (1976) → vim (1991) → neovim (2014) → modern modal interfaces. Key insights:

  • Modes reduce modifier keys: Insert mode types; normal mode commands
  • Composability: d3w (delete 3 words) combines operation + count + motion
  • Muscle memory: Consistent bindings become automatic

Gaius adopts hjkl and plans command composition (/focus Risk | /analyze).

Plan 9 and Acme

Rob Pike’s Acme editor (1994) introduced:

  • Mouse chording: Combined mouse buttons for operations
  • Text as command: Select text, execute it
  • Windowing without decoration: Content maximizes screen real estate
  • Unix philosophy at the UI level: Small, composable pieces

Gaius plans Acme-inspired text execution for the log panel.

Professional Interfaces

Bloomberg Terminal

Since 1981, Bloomberg has defined professional data interfaces:

  • Information density: Every pixel works
  • Keyboard-first: <GO> commands, function keys, minimal mouse
  • Consistent vocabulary: Familiar patterns across thousands of functions
  • Real-time updates: Live data as the base state

Gaius inherits the density and keyboard ethos while modernizing the visual language.

Trading Floors

Before terminals, open outcry trading used:

  • Spatial organization: Pits and rings for specific instruments
  • Hand signals: High-bandwidth visual communication
  • Peripheral awareness: Seeing the whole floor at once

The grid echoes the trading pit—a spatial organization of a complex domain.

Modern Developments

Gödel Terminal

The emerging Gödel Terminal project explores:

  • AI-native interfaces: Designed for LLM integration
  • Semantic commands: Natural language as primary input
  • Dynamic context: Interface adapts to conversation

Gaius draws on this for its slash command system and domain adaptation.

Claude Code

Anthropic’s Claude Code (the tool you’re reading about this in) pioneered:

  • Slash commands: /help, /clear, /review
  • Context awareness: Understanding codebase structure
  • Conversational flow: Natural language with structured commands

Gaius’s command system directly inherits this pattern.

LLM-Augmented Interfaces

The 2023-2024 wave of LLM tools demonstrated:

  • Natural language as interface: Beyond command-line syntax
  • Agent architectures: Multiple specialized perspectives
  • Embeddings everywhere: Semantic similarity as fundamental operation

Gaius integrates all three.

Visualization Traditions

Information Visualization

Tufte’s principles:

  • Data-ink ratio: Maximize information, minimize decoration
  • Small multiples: Repeated grids for comparison
  • Layering and separation: Overlays instead of clutter

Topological Visualization

Carlsson and others showed that shape matters. TDA visualization typically uses:

  • Persistence diagrams: Birth-death scatter plots
  • Barcodes: Horizontal bars for feature lifespans

Gaius experiments with projecting these onto the grid—making topology spatial.

Game Interfaces

Go software (KGS, OGS, Sabaki) provides:

  • Board representation: The 19×19 standard
  • Coordinate systems: A-T, 1-19
  • Stone visualization: Contrast, shadows, territory

We inherit the board but repurpose it for data.

Cognitive Science

Embodied Cognition

Lakoff, Johnson, and others argue that thought is grounded in bodily experience. Spatial metaphors (“high status,” “falling behind”) pervade language.

Gaius literalizes these metaphors: positions have meaning, movement has direction, territory can be claimed.

Distributed Cognition

Hutchins showed that cognition extends beyond the skull—tools, environments, and other people participate in thinking.

Gaius + human + agent swarm form a cognitive system. The grid is external memory; agents are external perspectives; topology is external pattern detection.

Ecological Psychology

Gibson’s affordances: the environment offers action possibilities. A grid affords navigation. Overlays afford comparison. Commands afford precision.

Design is the creation of useful affordances.

Synthesis

Gaius attempts to synthesize:

TraditionContribution
Polymath encyclopedismComprehensive scope, cross-reference
Terminal interfacesText grid, keyboard, streaming
Modal editorshjkl, modes, composition
Plan 9 / AcmeText as command, minimal chrome
BloombergDensity, professionalism, real-time
Gödel / Claude CodeAI-native, slash commands
VisualizationTufte principles, TDA projection
Cognitive scienceSpatial cognition, distributed thinking

The result is something new—an interface paradigm for augmented cognition in complex domains.

Bloomberg Terminal

The Bloomberg Terminal, launched in 1981, remains the gold standard for professional financial interfaces. With over 300,000 subscribers paying ~$24,000/year, it demonstrates that density and keyboard-first design can command premium value.

What Bloomberg Gets Right

Information Density

A Bloomberg screen contains more data per pixel than almost any other interface. Multiple panels display:

  • Real-time quotes
  • News headlines
  • Chart overlays
  • Analytics
  • Communication

Nothing is wasted. Every region serves a purpose.

Keyboard Supremacy

Bloomberg operators type commands like AAPL <EQUITY> GO to navigate. Function keys, abbreviations, and muscle memory enable speeds impossible with mouse navigation.

The terminal was designed for traders who can’t afford to look away from the market to find a menu item.

Consistent Mental Model

Despite thousands of functions, Bloomberg maintains consistency:

  • <GO> executes
  • <MENU> shows options
  • Yellow keys are market sectors
  • Green keys are actions

Learn the pattern once, apply it everywhere.

Real-Time as Default

Bloomberg screens update continuously. You don’t refresh; you watch. The terminal shows the world as it happens.

What Gaius Inherits

Density Without Clutter

The 19×19 grid provides 361 data points. Overlays add dimensions. But each view is coherent—one mode, one overlay, one interpretation.

Bloomberg achieves density through multiple panels. Gaius achieves it through layers on a unified surface.

Keyboard-First

hjkl navigation. Slash commands. No required mouse. Power users should never reach for the trackpad.

Bloomberg charges premium prices for keyboard efficiency. Gaius provides it freely.

Consistency

Overlay cycling always uses o. Mode toggle always uses v. Quit is always q. The vocabulary is small and stable.

Live Updates

Swarm rounds update the grid in real-time. Agent positions shift as analysis proceeds. The view is alive.

Where Gaius Differs

Visual Language

Bloomberg uses dense text, tables, and traditional charts. Gaius uses a spatial grid with symbolic markers.

The grid enables pattern recognition that tables don’t. A cluster is visible instantly; a column of numbers requires scanning.

AI Integration

Bloomberg has added AI features incrementally. Gaius is AI-native—agents are foundational, not bolted on.

Openness

Bloomberg is proprietary and expensive. Gaius is open and free. The design philosophy is available for inspection and critique.

Domain Agnosticism

Bloomberg serves finance. Gaius adapts to any domain via the --domain flag. Pension analysis today, supply chain tomorrow, cybersecurity next week.

Lessons for Gaius

  1. Respect expertise: Bloomberg doesn’t dumb down for casual users. Gaius shouldn’t either.

  2. Invest in consistency: Bloomberg’s decades-old commands still work. Gaius should avoid gratuitous changes.

  3. Optimize for flow: Bloomberg operators enter flow states. Gaius should enable the same.

  4. Density is a feature: Information-rich displays serve experts. Don’t dilute for aesthetics.

  5. Keyboard speed matters: Milliseconds add up over thousands of operations.

The Bloomberg Bar (Status Line)

Bloomberg’s status area shows:

  • Current function
  • User identity
  • Connection status
  • Contextual hints

Gaius’s status line serves the same purpose:

Ready | TDA on | Swarm (pension) | hjkl=move o=overlay

Both provide constant orientation without demanding attention.

Beyond Bloomberg

Bloomberg optimized for 1980s constraints: text terminals, limited bandwidth, human-only analysis.

Gaius operates in a different era:

  • Unicode enables rich symbolism beyond ASCII
  • Embeddings enable semantic operations
  • Agents provide parallel analysis
  • Topology reveals hidden structure

We inherit Bloomberg’s keyboard efficiency while transcending its visual limitations.

Gödel Terminal

The Gödel Terminal represents an emerging paradigm for AI-native interfaces. While still evolving, it offers design principles that Gaius incorporates.

The AI-Native Interface

Traditional interfaces were designed for direct manipulation: click buttons, fill forms, navigate menus. The user explicitly specifies every action.

AI-native interfaces shift this paradigm:

  • Intent over action: Express what you want, not how to do it
  • Semantic understanding: The interface comprehends context
  • Adaptive response: Behavior adjusts to situation
  • Conversational flow: Dialogue as primary interaction

Gödel’s Key Ideas

Semantic Commands

Instead of hierarchical menus, semantic commands express intent:

/analyze the risk concentration in the northeast quadrant

The system interprets “northeast quadrant,” understands “risk concentration,” and executes appropriately.

Context Windows

Gödel maintains rich context:

  • Current state (what’s displayed)
  • History (what was discussed)
  • User patterns (typical workflows)
  • Domain knowledge (relevant concepts)

Commands are interpreted within this context, reducing verbosity.

Dynamic Layouts

The interface reorganizes based on task:

  • Analysis mode: Maximize grid, minimize chrome
  • Research mode: Split with documentation
  • Comparison mode: Side-by-side views

Agent Integration

Agents aren’t tools invoked occasionally—they’re persistent presences:

  • Always available for queries
  • Proactively surface insights
  • Learn from interaction patterns

What Gaius Inherits

Slash Commands

Gaius’s /command syntax follows Gödel’s semantic approach:

/domain "supply chain"
/ask "What are the top risks?"
/focus Risk

These read as intent expressions, not procedure calls.

Domain Adaptation

The --domain flag and /domain command enable semantic rewiring:

/domain "cybersecurity incident response"

All agents, embeddings, and analyses reorient to the new domain.

Contextual Awareness

Future Gaius versions will maintain:

  • Session history across restarts
  • User preference learning
  • Domain-specific vocabularies
  • Personalized agent tuning

Proactive Insight (Planned)

Agents could surface observations unprompted:

[Risk] Entropy spike detected. New death loop forming near D4.

The interface becomes an active collaborator, not a passive tool.

Where Gaius Extends Gödel

Spatial Grounding

Gödel uses conventional screen layouts. Gaius adds a spatial metaphor:

  • Positions have meaning
  • Navigation has direction
  • Territory can be claimed

This grounds abstract AI operations in spatial intuition.

Topological Awareness

Gödel focuses on semantic understanding. Gaius adds structural understanding via TDA:

  • Shape of data
  • Persistent features
  • Emergence and dissolution

Visualization Priority

Gödel emphasizes text and conversation. Gaius emphasizes visual pattern:

  • Grid as primary display
  • Text as secondary (log panel)
  • Overlays as visual analysis

Keyboard Efficiency

Gödel often implies mouse/touch interaction. Gaius prioritizes keyboard:

  • hjkl navigation
  • Single-key mode toggles
  • Command completion

Design Tensions

Automation vs. Control

Gödel tends toward autonomous agents. Gaius keeps humans in the loop:

  • Agents suggest, don’t act
  • Swarm rounds are explicit (s)
  • Domain changes are deliberate

Fluidity vs. Stability

Gödel’s dynamic layouts can disorient. Gaius’s grid is stable:

  • 19×19 never changes
  • Overlays add, don’t rearrange
  • Status line always present

Natural Language vs. Structure

Gödel embraces free-form input. Gaius balances:

  • Slash commands for precision
  • Query commands for natural language
  • Keyboard bindings for speed

The Synthesis

Gaius combines:

  • Gödel’s semantic awareness
  • Gaius’s spatial grounding
  • Bloomberg’s keyboard efficiency
  • TDA’s structural insight

The result is an AI-native interface that remains tangible—where complex analysis projects onto a navigable grid.

Future Convergence

As AI-native interfaces mature, we expect:

  • More spatial metaphors (not just Gaius)
  • Better keyboard integration
  • Richer visualization
  • Deeper agent collaboration

Gaius is an early experiment in this convergence.

Plan 9 & Acme

Plan 9 from Bell Labs (1992) was Ken Thompson and Rob Pike’s attempt to push Unix ideas to their logical conclusion. Its text editor, Acme (1994), remains one of the most influential programmer tools ever created.

Plan 9 Philosophy

Everything is a File

Unix had “everything is a file” as aspiration. Plan 9 achieved it:

  • Network connections: files
  • Processes: files
  • Graphics: files
  • Input devices: files

This uniformity enables composition. Any tool that reads files can process any system resource.

Distributed by Design

Plan 9 assumed network operation. Local and remote resources accessed identically. Your terminal could seamlessly use CPU from across the network.

Simplicity Through Completion

Rather than adding features, Plan 9 removed special cases. The result is smaller but more general.

Acme: A Different Editor

Acme is startling to modern users:

  • No syntax highlighting
  • No configuration files
  • No plugins
  • No key bindings (almost)

And yet, Acme users are among the most productive programmers.

Mouse Chording

Acme uses three-button mouse chording:

  • Left: Select text
  • Middle: Execute selected text as command
  • Right: Search/open selected text

Any text can become a command. Type make, select it, middle-click. The boundary between text and action dissolves.

Tags as Command Lines

Each window has a “tag” line containing text. That text is executable:

/home/user/project Del Snarf Get | fmt | Look

Click on Del to delete the window. Click on fmt to reformat. The tag is a command palette you can edit.

No Modes

Acme has no insert/command mode distinction. You’re always in “insert mode”—typing inserts text. Commands are executed by clicking on them.

This eliminates mode errors entirely.

Plumbing

Plan 9’s plumber routes messages based on content. Click on a filename: it opens. Click on an error with line number: editor jumps there. Click on a URL: browser opens.

Pattern matching replaces explicit handlers.

What Gaius Inherits

Text as Command

Gaius plans to make log panel text executable:

[Risk] Cluster forming at K10-L12. Consider /analyze K10.

Click on /analyze K10 to execute it. Agent suggestions become actionable.

Minimal Configuration

Gaius aims for sensible defaults. The grid is 19×19. Colors are fixed. Navigation is hjkl. Power comes from composition, not configuration.

Compositional Commands

Planned command piping:

/region D4-F6 | /analyze | /summarize

Small operations combine into complex workflows—the Unix way.

Simplicity Through Generality

One grid serves many purposes:

  • Go stones
  • Pension allocations
  • Agent positions
  • Topological features

The grid is general; overlays specialize.

Where Gaius Differs

Modes Exist

Acme’s modelessness works for text editing. Gaius’s modes serve navigation:

  • Normal mode: hjkl moves cursor
  • Command mode: typing enters commands
  • (Future) Visual mode: region selection

Modes concentrate related operations without modifier keys.

Keyboard Priority

Acme was designed for mice (three-button, specifically). Gaius prioritizes keyboard:

  • Navigation without mouse
  • Commands via slash prefix
  • Mode switching via single keys

Both approaches are valid; Gaius serves users who prefer keyboard.

Visualization Over Text

Acme is fundamentally a text environment. Gaius is fundamentally visual:

  • Grid as primary display
  • Symbols over words
  • Patterns over paragraphs

Lessons from Plan 9/Acme

1. Composition Over Features

Don’t add a feature when you can compose existing ones. Gaius’s overlay system composes simple layers; it doesn’t have a “complex visualization mode.”

2. Uniformity Enables Power

Consistent interaction patterns (every overlay cycles with o, every mode toggles with its key) compound into expertise.

3. Text as Interface

Making text executable bridges display and action. Log panel entries become command suggestions.

4. Defaults Over Configuration

Every configuration option is a decision users must make. Prefer good defaults. Gaius’s fixed color scheme and grid size are deliberate.

5. Network Transparency

Gaius doesn’t yet have distributed operation, but the architecture anticipates it:

  • Agent swarms could run remotely
  • Vector memory could be shared
  • Grid state could synchronize

The Acme User Profile

Acme attracts a specific user: one who prefers mastery over convenience, composition over features, simplicity over apparent ease.

Gaius seeks the same users:

  • Experts who will invest in learning
  • Polymaths who work across domains
  • Professionals who value efficiency

If you want a tool that works immediately without learning, Gaius (like Acme) isn’t it. If you want a tool that rewards mastery, welcome.

Rob Pike’s Influence

Pike’s essays—“Notes on Programming in C,” “A Lesson in Brevity,” various design rationales—express a philosophy:

  • Clarity over cleverness
  • Data structures over algorithms
  • Composition over inheritance (before OOP made this controversial)

Gaius aspires to this clarity: a small set of concepts (grid, overlays, modes, commands) that compose into powerful workflows.

OODA Loop

Boyd’s OODA (Observe-Orient-Decide-Act) loop describes competitive decision-making under uncertainty. Gaius is explicitly designed to accelerate each phase.

The Loop in Gaius

Observe

The grid displays current system state. Health checks, agent positions, and topology overlays provide immediate perception without requiring sequential reading.

Tools: Grid view, /health, /gpu status, overlay modes

Orient

Context-building through overlays, memory search, and agent analysis. Multiple perspectives (risk, topology, temporal) help frame observations.

Tools: Overlay cycling (o), /search, /sitrep, MiniGrid projections

Decide

Slash commands, domain changes, and focus actions translate understanding into intent.

Tools: Command input (/), tenuki (t), mode cycling (v)

Act

Execute decisions: run analysis, apply fixes, export insights, trigger evolution.

Tools: /health fix, /evolve trigger, /render, /swarm

Fast OODA Wins

The competitive advantage of OODA comes from cycle speed. Gaius minimizes latency at every stage:

  • Observe: Grid renders state instantly (no loading, no scrolling)
  • Orient: Overlays toggle without delay (pre-computed)
  • Decide: Keyboard-first eliminates mouse targeting time
  • Act: Engine RPCs execute in <30s (most <1s)

OODA for Autonomous Agents

The same loop applies to Gaius’s autonomous systems:

PhaseHealth ObserverEvolution Daemon
ObserveHealth checksGPU utilization monitoring
OrientFMEA risk scoringAgent performance evaluation
DecideTier selection (0/1/2)Candidate ranking
ActRemediation or escalationPromote or discard

Fail Open Supports Observation

The fail open principle directly supports the Observe phase: by surfacing unknown states rather than hiding them, it ensures the OODA loop always has complete visibility.

Infrastructure

Gaius runs on a local development infrastructure managed by devenv (Nix-based), with process-compose for service orchestration and Just for task running.

Components

ComponentPurposeManagement
devenvNix-based development environmentdevenv shell
process-composeService orchestrationdevenv processes up/down
JustTask runner (recipes)just <recipe>
PostgreSQLPrimary database (:5444)devenv process
QdrantVector store (:6334)devenv process
AeronIPC transportdevenv process
NiFiData ingestiondevenv process
MetabaseAnalytics dashboardsdevenv process
Gaius EnginegRPC daemon (:50051)devenv process

Quick Start

# Enter development environment
devenv shell

# Start all services
devenv processes up

# Or clean restart (preferred)
just restart-clean

# Check status
uv run gaius-cli --cmd "/health" --format json

Architecture

devenv.nix is a pure service declaration file (~470 lines). It defines packages, environment variables, service configurations, and process dependency graphs. All process startup bash lives in scripts/processes/*.sh.

See:

devenv Environment

Gaius uses devenv for a Nix-based development environment that provides all system dependencies reproducibly.

Structure

devenv.nix is a pure service declaration file. It defines:

  • Packages: System tools (kubectl, k9s, mdbook, etc.) provided by Nix
  • Environment variables: DATABASE_URL, PGPORT, KUBECONFIG, etc.
  • Process definitions: One-liner exec blocks pointing to scripts
  • Dependency graphs: Process startup ordering via depends_on
  • enterShell: Interactive shell setup (PATH, aliases, KUBECONFIG)

Key Design Rules

No Inline Bash

All process startup bash lives in scripts/processes/*.sh. The devenv.nix exec blocks are one-liners:

processes.gaius-engine = {
  exec = ''
    exec ${config.devenv.root}/scripts/processes/gaius-engine.sh
  '';
};

Nix Store Paths as Env Vars

When a script needs Nix-managed binaries, pass them as environment variables:

processes.nifi = {
  exec = ''
    export NIFI_PACKAGE="${pkgs.nifi}"
    exec ${config.devenv.root}/scripts/processes/nifi.sh
  '';
};

KUBECONFIG Handling

enterShell only runs for interactive shells. Process scripts must set KUBECONFIG unconditionally from $HOME:

export KUBECONFIG="$HOME/.config/kube/rke2.yaml"

Never use fallback syntax (${KUBECONFIG:-...}) — the system KUBECONFIG may point to a root-owned path.

Environment Variables

VariableValueSource
DATABASE_URLpostgres://gaius:gaius@localhost:5444/zndx_gaiusdevenv.nix
PGPORT5444devenv.nix
KUBECONFIG~/.config/kube/rke2.yamlenterShell
METAFLOW_SERVICE_URLhttp://localhost:8180enterShell

Nix-Managed Tools

kubectl and k9s are provided by Nix (not the system RKE2 binary). This ensures version consistency across environments.

Process Scripts

All process startup bash lives in scripts/processes/*.sh. Shared helpers are in scripts/lib/.

Process Scripts

ScriptServiceDependencies
aeron-driver.shAeron IPC transportNone
gaius-engine.shgRPC engine daemonAeron, PostgreSQL
gaius-worker.shBackground workerEngine
metabase.shAnalytics dashboardsPostgreSQL
metaflow-bootstrap.shMetaflow K8s setupKubernetes
metaflow-db-setup.shMetaflow databasePostgreSQL
metaflow-port-forwards.shK8s port forwardingKubernetes
metaflow-ui.shMetaflow UIMetaflow service
nifi.shData ingestionPostgreSQL

Shared Helpers

scripts/lib/process-helpers.sh

Common functions used by all process scripts:

FunctionPurpose
bannerPrint startup banner with service name
check_disabledSkip if service is disabled via env var
wait_for_postgresBlock until PostgreSQL is accepting connections
wait_for_aeronBlock until Aeron driver is ready

scripts/lib/gpu-helpers.sh

GPU cleanup functions shared by gaius-engine.sh and the justfile:

FunctionPurpose
gpu_cleanupKill orphan vLLM/CUDA processes

Script Pattern

Every process script follows the same structure:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/process-helpers.sh"

banner "Service Name"
check_disabled "SERVICE_NAME"

# Wait for dependencies
wait_for_postgres

# Set KUBECONFIG unconditionally (not from enterShell)
export KUBECONFIG="$HOME/.config/kube/rke2.yaml"

# Start the service
exec some-command --flags

Adding a New Process

  1. Create scripts/processes/<name>.sh with the pattern above
  2. Add process block to devenv.nix with one-liner exec
  3. Pass any Nix-only values as env vars in the exec block
  4. Set dependency ordering with process-compose.depends_on

Just Task Runner

Gaius uses Just as its task runner, replacing devenv-tasks which had SQLite locking issues.

Why Just

devenv-tasks 2.0.0 introduced a tasks.db SQLite file that deadlocks when tasks call devenv up. Just is a pure command runner with no state files — it reads justfile and executes recipes.

Key Recipes

just --list              # Show all available recipes

# Core operations
just restart-clean       # Full clean restart (preferred)
just proto-generate      # Regenerate gRPC protobuf bindings

# GPU management
just gpu-cleanup         # Kill orphan vLLM/CUDA processes
just gpu-deep-cleanup    # Aggressive GPU memory cleanup

# Documentation
just docs-build          # Build mdbook documentation

# Kubernetes
just k8s-cleanup         # Clean up K8s resources

restart-clean

The most important recipe. Delegates to scripts/restart-clean.sh:

  1. Stops all devenv processes
  2. Kills stale vLLM/CUDA processes
  3. Cleans up GPU memory
  4. Strips DEVENV_* environment variables (uses env -i)
  5. Restarts everything fresh

Warm start time: ~13 seconds (Nix cached, tasks.db warm).

just restart-clean

Usage

Invoke from the devenv shell (or any shell with just + devenv):

devenv shell
just <recipe>

Recipes are defined in justfile at the project root.

Deployment

Gaius uses RKE2 Kubernetes for production services (Metaflow, supporting infrastructure) and local process-compose for the core platform.

Local Development

The primary deployment model is local, using devenv process-compose:

devenv processes up    # Start all services
devenv processes down  # Stop all services
just restart-clean     # Clean restart (preferred)

Kubernetes Services

Supporting services run in RKE2 Kubernetes:

ServiceNamespacePurpose
Metaflow metadatadefaultFlow run tracking
Metaflow UIdefaultWeb dashboard

Kubernetes resources are managed via Tilt in infra/tilt/.

See Also

Kubernetes

Gaius uses an RKE2 cluster for running Metaflow and supporting services.

Kubeconfig Setup

RKE2 installs its kubeconfig at /etc/rancher/rke2/rke2.yaml (root-owned). Copy it to a user-accessible location:

mkdir -p ~/.config/kube
sudo cp /etc/rancher/rke2/rke2.yaml ~/.config/kube/rke2.yaml
sudo chown $(id -u):$(id -g) ~/.config/kube/rke2.yaml

Set KUBECONFIG:

export KUBECONFIG="$HOME/.config/kube/rke2.yaml"

This is set automatically by enterShell in devenv.nix for interactive shells. Process scripts set it unconditionally.

Nix-Managed Tools

kubectl and k9s are provided by Nix packages in devenv.nix — not the system RKE2 binary. This ensures version consistency.

Pod Networking

K8s pods need pg_hba.conf entries for cluster networks:

host all all 10.42.0.0/16 md5   # Pod network
host all all 10.43.0.0/16 md5   # Service network

Tilt

Development iteration on K8s resources uses Tilt, configured in infra/tilt/.

Cleanup

just k8s-cleanup    # Clean up stale K8s resources

Metaflow Service

The Metaflow metadata service runs in Kubernetes and enables local flow execution with centralized run tracking.

Deployment

Metaflow service is deployed via Tilt in infra/tilt/.

Port Forwarding

The service runs in-cluster and needs port-forwarding for local access:

kubectl port-forward svc/metaflow-service 8180:8080

This is handled automatically by the metaflow-port-forwards.sh process script.

Environment

Set the service URL for flow runs:

export METAFLOW_SERVICE_URL=http://localhost:8180

This is set automatically by enterShell in devenv.nix.

Database

Metaflow uses the same PostgreSQL instance (port 5444) with its own database, set up by the metaflow-db-setup.sh process script.

Bootstrapping

The metaflow-bootstrap.sh script handles initial K8s resource creation for the Metaflow service.

Monitoring

Operational monitoring combines CLI health checks, GPU status tracking, and Metabase dashboards.

Quick Status

# Overall health
uv run gaius-cli --cmd "/health" --format json

# GPU status
uv run gaius-cli --cmd "/gpu status" --format json

# Health Observer incidents
uv run gaius-cli --cmd "/health incidents" --format json

Monitoring Stack

ToolPurposeAccess
/health CLIInfrastructure health checksCLI/MCP
/gpu status CLIGPU and endpoint monitoringCLI/MCP
Health ObserverContinuous background monitoringEngine daemon
MetabaseAnalytics dashboardsWeb UI
PrometheusTime-series metricsQuery API

See Also

Health Checks

The /health command runs diagnostics across all system components and reports status.

Running Health Checks

# All checks
uv run gaius-cli --cmd "/health" --format json

# Specific category
uv run gaius-cli --cmd "/health gpu" --format json
uv run gaius-cli --cmd "/health endpoints" --format json
uv run gaius-cli --cmd "/health infrastructure" --format json

Interpreting Results

Each check reports a status:

StatusMeaning
PASSComponent is healthy
WARNComponent has issues but is functional
FAILComponent is unhealthy

Applying Fixes

When checks fail, use /health fix:

# Fix a specific service
uv run gaius-cli --cmd "/health fix engine" --format json

# Available services
# engine, dataset, nifi, postgres, qdrant, minio, endpoints, evolution

Always try /health fix before manual intervention. This exercises the self-healing system and helps it improve over time.

Manual Fallback

If /health fix fails:

# Full clean restart
just restart-clean

# GPU-specific cleanup
just gpu-cleanup
just gpu-deep-cleanup

FMEA Diagnostics

For deeper analysis:

# FMEA summary with RPN scores
uv run gaius-cli --cmd "/fmea" --format json

# Failure mode details
uv run gaius-cli --cmd "/fmea detail GPU_001" --format json

GPU Management

Gaius manages 6 NVIDIA GPUs across vLLM inference, LuxCore rendering, and embedding workloads.

GPU Allocation

GPUTypical Use
0-1Reasoning endpoint (tensor_parallel=2)
2-3Coding endpoint (tensor_parallel=2)
4Embedding endpoint
5Available for rendering/evolution

Allocation is managed by the Orchestrator. GPUs can be temporarily reassigned for rendering or evolution workloads via makespan scheduling.

Status Monitoring

# Endpoint status
uv run gaius-cli --cmd "/gpu status" --format json

# GPU health (memory, temperature, utilization)
uv run gaius-cli --cmd "/gpu health" --format json

Cleanup

When GPU processes get stuck or memory leaks:

# Standard cleanup (kill orphan vLLM processes)
just gpu-cleanup

# Deep cleanup (aggressive memory recovery)
just gpu-deep-cleanup

The gpu-helpers.sh shared library provides the gpu_cleanup function used by both the engine startup script and the justfile recipes.

Common Issues

IssueSymptomFix
Orphan vLLM processGPU memory used but no endpointjust gpu-cleanup
OOM during model loadEndpoint stuck in STARTINGFree GPU, then /health fix endpoints
CUDA memory fragmentationDegraded inference speedjust gpu-deep-cleanup then restart
OpenCV conflictvLLM WorkerProc fails (cv2 error)Already fixed via pyproject.toml override

Rendering GPU Eviction

The viz pipeline temporarily evicts a low-priority endpoint to use a GPU for LuxCore rendering:

  1. Orchestrator evicts endpoint from target GPU
  2. LuxCore renders using PATHOCL engine with CUDA
  3. clear_embeddings() releases Nomic model (~3GB)
  4. Orchestrator restores evicted endpoint

Contributing

Gaius is an experiment in augmented cognition. Contributions that advance this vision are welcome.

Development Setup

# Clone and enter
git clone https://github.com/zndx/gaius.git
cd gaius

# Start devenv (provides all system dependencies)
devenv shell

# Install Python dependencies
uv sync

# Start all platform components
devenv processes up

# Or use a clean restart
just restart-clean

# Run the TUI
uv run gaius

# Run the CLI
uv run gaius-cli --cmd "/health" --format json

Project Structure

gaius/
├── src/gaius/          # Python source (26 packages)
│   ├── app.py          # TUI application
│   ├── cli.py          # Non-interactive CLI
│   ├── mcp_server.py   # MCP server (163 tools)
│   ├── engine/         # gRPC engine (37 services)
│   ├── health/         # Self-healing infrastructure
│   ├── agents/         # Agent system
│   └── ...
├── scripts/
│   ├── processes/      # Process startup scripts
│   └── lib/            # Shared helpers
├── docs/current/       # mdbook documentation
├── config/             # HOCON configuration
├── justfile            # Task runner recipes
├── devenv.nix          # Development environment
├── pyproject.toml      # Python dependencies
└── CLAUDE.md           # Development guidelines

Development Workflow

Testing Changes

The CLI is the product. After every code change, verify via CLI:

# After editing code — always re-test
uv run gaius-cli --cmd "/health" --format json

Previous test outputs are invalidated by code changes. Don’t reason from stale context — run the command again.

Key Recipes

just --list              # Show all available tasks
just restart-clean       # Full clean restart
just proto-generate      # Regenerate gRPC bindings
just gpu-cleanup         # Clean up GPU processes
just docs-build          # Build documentation

Design Principles

When contributing, these principles are mandatory:

  1. Fail-fast: Errors surface immediately with guru codes and remediation hints. No silent degradation.
  2. Engine-first: Business logic belongs in engine services, not in interfaces.
  3. Self-healing first: Prefer /health fix over manual remediation.
  4. Keyboard-first: Every operation available via keyboard.
  5. CLI verification: All new features must be testable via gaius-cli.

Code Style

  • Python 3.12+ features welcome
  • Type hints for public interfaces
  • Local imports inside functions for lazy loading in service modules
  • Use from gaius.core.config import get_database_url for DB URL (never hardcode)

Commit Messages

Use conventional commit style:

feat: add temporal overlay mode
fix: correct grid boundary check
docs: expand TDA explanation
refactor: simplify swarm initialization

Pull Request Process

  1. Create a feature branch
  2. Make changes with clear commits
  3. Verify via CLI: uv run gaius-cli --cmd "/health" --format json
  4. Ensure cd docs/current && mdbook build succeeds if docs changed
  5. Submit PR with description of changes

Architecture Decision Records

Key architectural decisions that shaped the system.

ADR-001: Engine-First Architecture

Context: Business logic was scattered across TUI, CLI, and utility scripts, causing duplication and inconsistency.

Decision: Centralize all business logic in the gRPC engine. TUI, CLI, and MCP become thin clients.

Consequences: Single source of truth for all operations. Engine manages GPU resources centrally. All interfaces get consistent behavior automatically.

ADR-002: Just Over devenv-tasks

Context: devenv-tasks 2.0.0 introduced SQLite locking on tasks.db that deadlocks when tasks call devenv up.

Decision: Migrate from devenv-tasks to Just as the task runner.

Consequences: Pure command runner with no state files. Recipes defined in justfile. No locking issues. scripts/restart-clean.sh still does actual work; Just recipe delegates to it.

ADR-003: Fail-Fast as Iron-Clad Principle

Context: Silent degradation hid problems until they became critical.

Decision: All code must surface errors immediately with guru meditation codes and remediation paths. No silent fallbacks.

Consequences: Higher initial friction (more explicit error handling) but dramatically faster diagnosis and resolution. Self-healing system built on reliable error detection.

ADR-004: FMEA for Health Monitoring

Context: Simple severity classifications don’t capture risk adequately — a rare but invisible failure is more dangerous than a frequent but obvious one.

Decision: Adopt FMEA (Failure Mode and Effects Analysis) with RPN scoring for health monitoring.

Consequences: Quantitative risk assessment (S x O x D). Tiered remediation based on risk level. Adaptive learning from outcomes.

ADR-005: LuxCore Over Blender for Visualization

Context: Blender’s Cycles renderer couldn’t render glass convincingly (opaque white blobs).

Decision: Use LuxCore unbiased path tracer for card visualization, initially via PyPI, later from source for GPU acceleration.

Consequences: Physically accurate glass rendering. GPU-accelerated via PATHOCL engine with CUDA. More complex build process but superior visual quality.

ADR-006: Process Scripts Over Inline Nix Bash

Context: devenv.nix contained inline bash blocks that were hard to debug and test.

Decision: Move all process startup bash to scripts/processes/*.sh with shared helpers. devenv.nix becomes a pure service declaration file with one-liner exec blocks.

Consequences: Scripts are independently testable. Shared helpers eliminate duplication. Nix store paths passed as environment variables.

Adding New ADRs

When making significant architectural decisions:

  1. Add an entry here with Context, Decision, and Consequences
  2. Reference the ADR in relevant code comments
  3. Update CLAUDE.md if the decision affects development workflow

Proto Change Workflow

Changes to the gRPC protobuf schema require a specific workflow to keep generated bindings, internal enums, and status mappings in sync.

Step-by-Step

1. Edit the Proto File

Edit src/gaius/engine/proto/gaius_service.proto. Append new enum values — don’t renumber existing values for wire compatibility.

2. Regenerate Bindings

just proto-generate

This generates gaius_service_pb2.py and gaius_service_pb2_grpc.py.

3. Update Generated Exports

Add new symbols to src/gaius/engine/generated/__init__.py:

  • Add to the import block
  • Add to the __all__ list

Critical: Skipping this causes engine startup failures.

4. Update Internal Enums

If there’s a parallel Python enum (e.g., in vllm_controller.py), sync it with the proto enum.

5. Update Status Mappings

Add string-to-proto mappings in the servicer’s _STATUS_MAP.

6. Verify

uv run python -c "from gaius.engine.generated import NEW_SYMBOL; print('OK')"

7. Restart and Test

just restart-clean
uv run gaius-cli --cmd "/gpu status" --format json

Common Mistakes

SymptomCauseFix
Engine fails to startMissing export in __init__.pyAdd symbol to imports and __all__
Port 50051 not listeningImport error in gRPC serverCheck engine logs
Status shows wrong valueMissing _STATUS_MAP entryAdd mapping

See Protobuf Schema for more detail.

Testing

Gaius follows a CLI-first testing methodology. The CLI is the product — all functionality must be verifiable through it.

Core Rules

1. Re-Test After Every Code Change

Previous test outputs are invalidated by code changes. Don’t reason from stale context — run the command again.

# After editing code:
# BAD: "The fix should work based on my analysis"
# GOOD: Actually run it
uv run gaius-cli --cmd "/evolve status" --format json

2. No Static Test Data

We do not fall back to static test data. All functional aspects of new features must be verified directly against running services.

3. No Fallback Workarounds

Do not rely on fallbacks or workarounds when testing. If a service is down, fix it (via /health fix) rather than mocking around it.

Verification Patterns

Health Check

uv run gaius-cli --cmd "/health" --format json

Endpoint Status

uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[] | {name, status}'

Evolution Status

uv run gaius-cli --cmd "/evolve status" --format json

Import Verification

For new modules or proto changes:

uv run python -c "from gaius.engine.generated import NewSymbol; print('OK')"

TUI Testing

TUI behavior must be tested using Textual Pilot before committing:

async with app.run_test() as pilot:
    await pilot.press("h")  # Navigate left
    assert app.cursor_x == expected_x

Fail-Fast Compliance

Before committing, verify:

# No fallback patterns
grep -rn "fail_fast\|SELENIUM_AVAILABLE\)" src/gaius/
# No placeholder image colors
grep -rn "240, 240, 240" src/gaius/

All error messages must include guru meditation codes and remediation hints.

CLI Commands

63 slash commands are available in both the TUI and CLI interfaces. Commands are executed via:

# TUI: press / then type command
/health

# CLI: use --cmd flag
uv run gaius-cli --cmd "/health" --format json

Command Categories

Health & Diagnostics

CommandDescription
/healthRun all health checks
/health <category>Run checks for category (gpu, endpoints, infrastructure)
/health fix <service>Apply automated fix strategy
/health observerHealth Observer daemon status
/health incidentsList active incidents
/fmeaFMEA summary with RPN scores
/fmea catalogList all failure modes
/fmea detail <id>Failure mode details

GPU & Endpoints

CommandDescription
/gpu statusEndpoint and GPU status
/gpu healthGPU memory, temperature, utilization

Agents & Evolution

CommandDescription
/swarmRun swarm analysis
/evolve statusEvolution daemon status
/evolve triggerTrigger evolution cycle
/cognitionTrigger cognition cycle
/thoughtsView recent thoughts
/sitrepSituational report
/theta consolidateRun theta consolidation

Knowledge Base

CommandDescription
/search <query>Search knowledge base
/kb listList KB entries
/kb createCreate KB entry

System

CommandDescription
/stateCurrent application state
/renderRender card visualizations
/xai budgetXAI API budget status

Note: This is a representative subset. Run /help in the TUI for the complete list, or see the dispatch table in src/gaius/cli.py.

MCP Tools

163 MCP tools expose Gaius functionality to Claude Code and other MCP-compatible clients.

Tool Categories

CategoryCountDescription
Health~20Health checks, FMEA, observer, incidents
Agents~15Swarm, evolution, cognition, theta
Inference~10Scheduler, endpoints, GPU status
Knowledge Base~15Search, CRUD, sync, semantic search
Observability~10Metrics, Prometheus, status
Data Pipeline~10Metaflow, lineage, flows
Visualization~5Render, card management
Bases~10Feature store queries, entity history
Collections~15Card collections, publishing
Articles~5Article curation, status
X Bookmarks~8Sync, auth, folders
Calibration~5Understanding calibration
Evolution~10Agent versions, optimization
System~25Config, models, sessions, research

Naming Convention

Tools follow a consistent naming pattern: <domain>_<action> (e.g., health_observer_status, scheduler_submit, gpu_health).

Example Usage

From Claude Code:

> Use the health_observer_status tool to check system health
> Use the gpu_health tool to check GPU memory usage
> Use the search_kb tool to find articles about pensions

Server Configuration

{
  "mcpServers": {
    "gaius": {
      "command": "uv",
      "args": ["run", "gaius-mcp"],
      "cwd": "/path/to/gaius"
    }
  }
}

Note: For the complete tool list with parameters, see src/gaius/mcp_server.py.

Guru Meditation Codes

Complete catalog of error codes used across the Gaius platform.

Format

#<COMPONENT>.<SEQUENCE>.<MNEMONIC>

Catalog

DS — DatasetService

CodeDescriptionFix
#DS.00000001.SVCNOTINITDatasetService not initialized/health fix dataset

NF — NiFi

CodeDescriptionFix
#NF.00000001.UNREACHABLENiFi not reachable/health fix nifi

EN — Engine

CodeDescriptionFix
#EN.00001.GRPC_BINDgRPC port bind failureCheck port 50051
#EN.00002.VLLM_STARTvLLM startup failure/health fix endpoints
#EN.00003.GPU_OOMGPU out of memoryjust gpu-cleanup
#EN.00004.ORPHAN_PROCOrphan vLLM processjust gpu-cleanup

EP — Endpoints/Inference

CodeDescriptionFix
#EP.00000001.GPUOOMGPU out of memory during inference/health fix endpoints

GR — gRPC

CodeDescriptionFix
#GR.00000001.CONNFAILgRPC connection failedCheck engine status

ACP — Agent Client Protocol

CodeDescriptionFix
#ACP.00000001.CONNFAILACP connection failedCheck Claude Code
#ACP.00000002.TIMEOUTACP connection timeoutRetry
#ACP.SEC.00000002.NOTALLOWEDRepo not in allowlistUpdate acp.conf
#ACP.SEC.00000003.NOTPRIVATERepo not privateMake repo private

ACF — Article Curation Flow

CodeDescriptionFix
#ACF.00000013.NOHINTSEmpty keywords in article frontmatterAdd keywords/news_queries

XB — X Bookmarks

CodeDescriptionFix
#XB.00000001.NOTOKENNo auth tokenComplete OAuth flow
#XB.00000011.NOFOLDERFolders API unavailable (403)Upgrade API tier

HL — Health

CodeDescriptionFix
#HL.00001.GRPC_DOWNgRPC connection downjust restart-clean
#HL.00002.GPU_OOMGPU memory exhaustedjust gpu-cleanup

Note: This is a representative subset. Guru codes are assigned as new failure modes are identified. See CLAUDE.md for the full format specification.

Database Schema

PostgreSQL database zndx_gaius on port 5444.

Connection

PGPASSWORD=gaius psql -h localhost -p 5444 -U gaius -d zndx_gaius

Connection URL: postgres://gaius:gaius@localhost:5444/zndx_gaius?sslmode=disable

Important: The database name is zndx_gaius, not gaius.

Key Tables

Cards & Content

TablePurpose
cardsCard entities with metadata
card_enrichmentsEnrichment data for cards
articlesSource articles
article_contentArticle text content

FMEA & Health

TablePurpose
fmea_catalogFailure mode definitions (S/O/D scores)
fmea_occurrencesFailure occurrence history
fmea_outcomesRemediation outcomes (for adaptive learning)
fmea_approvalsPending Tier 2 approvals
healing_eventsSelf-healing audit trail
health_observer_stateObserver daemon state

Agents & Evolution

TablePurpose
agent_versionsAgent prompt versions with lineage
agent_evaluationsEvaluation results for evolution

Operations

TablePurpose
activity_logSystem activity tracking
x_bookmarksX bookmark sync data
x_bookmark_foldersX bookmark folder metadata
x_sync_runsX sync run history

Accessing from Code

Always use the config helper:

from gaius.core.config import get_database_url

url = get_database_url()

Never hardcode connection parameters.

Configuration

Gaius uses HOCON configuration files with environment variable overrides. The canonical source is config/base.conf.

Configuration Hierarchy

  1. config/base.conf — Default values
  2. ~/.config/gaius/acp.conf — ACP-specific overrides
  3. Environment variables — Highest priority

Database

KeyDefaultEnv VarDescription
database.hostlocalhostPGHOSTPostgreSQL host
database.port5444PGPORTPostgreSQL port
database.namezndx_gaiusPGDATABASEDatabase name
database.usergaiusPGUSERDatabase user
database.passwordgaiusPGPASSWORDDatabase password

Important: Always use gaius.core.config.get_database_url() to get the connection URL. Never hardcode.

Engine

KeyDefaultEnv VarDescription
engine.grpc.host0.0.0.0GAIUS_ENGINE_HOSTgRPC bind host
engine.grpc.port50051GAIUS_ENGINE_PORTgRPC port
engine.grpc.max_workers10Max gRPC worker threads
engine.orchestrator.preload_endpoints["reasoning"]Endpoints to load on startup
engine.orchestrator.startup_timeout600Startup timeout (seconds)
engine.scheduler.default_timeout120GAIUS_ENGINE_TIMEOUTDefault inference timeout
engine.evolution.enabledtrueEnable evolution daemon
engine.evolution.idle_threshold60GPU idle seconds before evolution

Health

KeyDefaultDescription
health.check_interval60Health Observer poll interval (seconds)
health.fmea.learning_rate0.2Adaptive S/O/D learning rate
health.self_healing.enabledtrueEnable automatic remediation

Agents

KeyDefaultDescription
agents.swarm.paralleltrueEnable parallel swarm execution
agents.swarm.timeout60Swarm execution timeout (seconds)
agents.theta.confidence_threshold0.8Theta consolidation confidence threshold

ACP Security

Configured in ~/.config/gaius/acp.conf:

acp {
  github {
    allowed_repos = ["zndx/gaius-acp"]
    require_private = true
    verify_on_each_operation = true
    cache_visibility_seconds = 300
  }
}

Glossary

ACP — Agent Client Protocol. Integration layer for Claude Code to perform autonomous health maintenance.

AgendaTracker — Tracks scheduled endpoint transitions for makespan operations, preventing false-positive health incidents.

APO — Automatic Prompt Optimization. Technique for evolving agent system prompts.

Bases — Feature store for entity-centric data with temporal queries.

CLT — Cross-Layer Transcoder. Extracts sparse interpretable features from model activations.

Death Loop — An H1 topological feature (persistent cycle) in embedding space. Indicates feedback loops, circular dependencies, or systemic risk.

devenv — Nix-based development environment providing reproducible builds.

DQL — Domain Query Language. Query syntax for the Bases feature store.

FMEA — Failure Mode and Effects Analysis. Quantitative risk assessment using RPN scoring.

Guru Meditation Code — Unique error identifier (e.g., #DS.00000001.SVCNOTINIT). Inspired by the Amiga.

H0/H1/H2 — Homology dimensions: H0 = connected components, H1 = loops, H2 = voids.

HOCON — Human-Optimized Config Object Notation. Configuration file format used by Gaius.

Just — Command runner replacing devenv-tasks. Reads recipes from justfile.

KServe OIP — Open Inference Protocol. Standard gRPC interface for ML inference.

LuxCore — Unbiased path tracing renderer used for card visualizations.

Makespan — Total time from start to finish of a multi-step workload (eviction, loading, inference, restoration).

MCP — Model Context Protocol. Programmatic interface exposing 163 tools to AI assistants.

optillm — Inference-time reasoning enhancement (CoT, BoN, MoA techniques).

PATHOCL — LuxCore rendering engine using OpenCL/CUDA for GPU acceleration.

RASE — Rapid Agentic Systems Engineering. Python-native MBSE metamodel for verifiable agent training.

RLVR — Reinforcement Learning with Verifiable Reward. Agent training methodology.

RPN — Risk Priority Number. FMEA score: Severity x Occurrence x Detection (range 1-1000).

Tenuki — Go term for playing away from the current area. In Gaius, jumps the cursor to a strategic point.

Theta Consolidation — Memory compression inspired by hippocampal theta rhythms. Links knowledge across temporal slices.

TUI — Terminal User Interface. Interactive Textual application launched with uv run gaius.

vLLM — High-throughput LLM inference engine. Managed by the Orchestrator across 6 GPUs.

ACP Incident Resolution: 2026-01-01

A milestone in autonomous self-healing: Claude Code resolves GPU allocation conflicts using Gaius MCP tools.

Overview

On January 1, 2026, the Gaius HealthObserver daemon detected GPU memory exhaustion and escalated to Claude Code via the Agent Client Protocol (ACP). This document captures the complete investigation and resolution session, demonstrating the first successful end-to-end ACP escalation workflow.

Key Achievements

  1. Autonomous Root Cause Analysis: Claude Code identified overlapping GPU allocations between multiple endpoints
  2. MCP Tool Integration: Used gpu_status, health_check, and orchestrator commands via gRPC
  3. Self-Healing Validation: Monitored recovery and verified resolution through the same tooling
  4. Multiple Incident Cascade: Handled a follow-on incident caused by the same GPU contention pattern

Incident 1: GPU_001:gpu_1_health

 ▐▛███▜▌   Claude Code v2.0.54
▝▜█████▛▘  Opus 4.5 · Claude Max
  ▘▘ ▝▝    ~/local/src/zndx/gaius

Health Incident Context

Fingerprint: GPU_001:gpu_1_health Endpoint: gpu_1_health Failure Mode: GPU_001 RPN Score: 125 (S:5 × O:5 × D:5) Escalation Tier: 2 Attempts: 3

Initial State

GPUMemory UsedMemory %Healthy
01.29 GB5.4%
123.6 GB98.4%
222.89 GB95.4%
322.89 GB95.4%
422.89 GB95.4%
50.56 GB2.3%

Endpoints at Incident Time

EndpointModelGPUsStatusIssue
cap_reasoningQwen/QwQ-32B[1,2,3,4]healthyConsuming GPU 1
orchestratornvidia/Orchestrator-8B[0,1]starting“Insufficient GPU memory”

Root Cause Analysis

Claude Code identified the issue immediately:

Root Cause: GPU allocation conflict between two endpoints trying to use GPU 1:

  1. cap_reasoning (Qwen/QwQ-32B) - Uses GPUs [1, 2, 3, 4], status: healthy
  2. orchestrator (nvidia/Orchestrator-8B) - Uses GPUs [0, 1], status: starting with “Insufficient GPU memory”

The cap_reasoning endpoint loaded first and consumed GPU 1’s memory (~23.6 GB of 24 GB). When orchestrator tries to start on GPUs [0, 1], GPU 0 has memory available (1.29 GB used) but GPU 1 is already exhausted.

Investigation Commands

# Check endpoint status
uv run gaius-cli --cmd "/gpu status" --format json

# Stop the stuck orchestrator
uv run gaius-cli --cmd "/gpu stop orchestrator" --format json

# Monitor recovery
uv run gaius-cli --cmd "/health gpu" --format json

Resolution

After stopping the orchestrator endpoint, the scheduler automatically rebalanced:

GPUBeforeAfter
05.4%0.01%
198.4%0.01% ✓
295.4%92.2%
395.4%94.3%
495.4%94.3%
52.3%0.01%

Final State:

  • orchestrator: HEALTHY (port 8094)
  • coding: HEALTHY (port 8093)
  • cap_reasoning: STOPPING
  • fast: STARTING (port 8095)

Incident 2: VLLM_001:coding

Immediately after resolving the first incident, a second cascaded incident appeared.

Health Incident Context

Fingerprint: VLLM_001:coding Endpoint: coding Failure Mode: VLLM_001 RPN Score: 125 (S:5 × O:5 × D:5) Escalation Tier: 2 Attempts: 3

Conflict Analysis

EndpointGPUsStatusMemory on GPU 1
cap_reasoning[1,2,3,4]healthy22.89 GB (95.4%)
orchestrator[0,1]stoppingcompeting
coding[1]failedcan’t allocate
fast[0]healthy-

Resolution

The scheduler handled this automatically:

  1. Stopped cap_reasoning to free GPUs [1,2,3,4]
  2. Stopped orchestrator and coding
  3. Cleared all GPU memory (95%+ → 0%)
  4. Restarted endpoints with non-overlapping allocations

Final State:

EndpointStatus
orchestratorHEALTHY ✓
codingHEALTHY ✓
reasoningSTOPPING
fastSTARTING

Observations

What Worked

  1. FMEA-Based Escalation: RPN scoring correctly identified severity (125 = S:5 × O:5 × D:5)
  2. MCP Tool Chain: All diagnostic commands worked through gRPC proxying
  3. Scheduler Self-Healing: Automatic GPU reallocation after conflicts cleared
  4. Cascading Incident Detection: Second incident properly tracked with separate fingerprint

Identified Gaps

  1. GPU Overlap Detection: Scheduler allowed conflicting GPU assignments (cap_reasoning and orchestrator both claimed GPU 1)
  2. Startup Ordering: No precedence constraints ensured larger models claim GPUs first
  3. Runtime Validation: GPU allocations only validated at scheduling time, not continuously

Order 3+ RCA Observations

These connect to CP-SAT constraints in makespan_scheduler.py:

ConstraintGap Identified
GPU_MUTUAL_EXCLUSIONEnforced at planning time, not at runtime
CONTIGUITY_REQUIREMENTTP endpoints need contiguous GPU blocks
PRECEDENCELarge models should claim GPUs before small ones

Significance

This incident represents a milestone in Gaius’s self-healing capabilities:

  1. First Successful ACP Escalation: HealthObserver → Claude Code → MCP tools → Resolution
  2. Closed-Loop Verification: Claude Code verified resolution using same tools that detected the issue
  3. RCA Framework Validation: Order 3+ observations identified scheduler constraint gaps
  4. Multi-Incident Handling: Cascading incidents tracked and resolved in sequence

The GPU allocation conflict exposed architectural issues that led to the RCA (Root Cause Analysis) framework development, enabling future incidents to be classified as OPERATIONAL (transient) or ARCHITECTURAL (needs code fix).


Captured from ACP session on 2026-01-01 04:11-04:45 UTC

Changelog

Notable changes and milestones in Gaius development.

2026-03 (Current)

  • Bases feature store with DQL query language
  • Card publishing gates on enrichment completeness
  • Content Freshness health check
  • KV coherence health check
  • Per-type watchdog timeouts for scheduled tasks

2026-02

  • LuxCore GPU rendering (PATHOCL engine with CUDA)
  • Just task runner (replacing devenv-tasks)
  • Process script architecture (no inline bash in devenv.nix)
  • FMEA health framework with adaptive learning
  • Article curation flow with Brave search
  • X Bookmarks sync with folder-first strategy
  • OpenCV/vLLM dependency conflict resolution

2026-01

  • ACP (Agent Client Protocol) for Claude Code integration
  • Health Observer daemon with ACP escalation
  • Guru Meditation Code system
  • Content sanitization for ACP security

2025-12

  • gRPC engine with 37 services
  • Orchestrator with makespan scheduling
  • Evolution daemon with APO optimization
  • Cognition service (autonomous thoughts)
  • Theta consolidation pipeline

2025-11

  • Initial TUI with 19x19 grid
  • Persistent homology visualization
  • Multi-agent swarm execution
  • mdbook documentation foundation
  • MiniGrid orthographic projections