Data Pipeline

The data pipeline connects external sources to the knowledge base, card collections, and search index through a sequence of ingestion, processing, and indexing stages.

End-to-End Flow

Web Sources (Brave, arXiv, RSS)
    |
    v
NiFi Ingestion ──> Raw Content (HX / Iceberg)
    |
    v
Metaflow Pipelines ──> Article Drafts, Card Creation
    |
    v
Qdrant Indexing ──> 768-dim Nomic Embeddings
    |
    v
PostgreSQL (zndx_gaius:5444) ──> Cards, Collections, Metadata
    |
    v
R2 Storage ──> Rendered Visualizations (viz.gaius.zndx.org)

Ingestion. NiFi processors fetch content from external APIs, RSS feeds, and web search results (Brave). Raw content is stored in Apache Iceberg tables via the HX data lake before any processing occurs. This preserves the original source material and provides a replay capability.

Processing. Metaflow pipelines handle the compute-intensive work: PDF conversion via docling, topic extraction via BERTopic, relevance scoring via local LLMs, and article draft generation. See Metaflow Integration for details on the execution environment.

Article Curation. The Article Curation flow orchestrates the full lifecycle from article selection through card creation and publication. Each run produces approximately 20 cards in under 2 minutes.

Indexing. Processed content is embedded using Nomic (768-dimensional vectors) and indexed in Qdrant for semantic search. The same embeddings drive the TUI’s 19x19 grid layout and the visualization pipeline.

Storage. Cards, collections, and metadata live in PostgreSQL (zndx_gaius on port 5444). Rendered card images are uploaded to Cloudflare R2 and served from viz.gaius.zndx.org. See Viz Storage for the object key convention.

Service	Role	Port
NiFi	Content ingestion	8443
Metaflow	Pipeline execution	8180
PostgreSQL	Metadata, cards, collections	5444
Qdrant	Vector search	6333
MinIO	Artifact storage (S3-compatible)	9000
Gaius Engine (gRPC)	Orchestration, scheduling	50051

CLI Access

# List available flows
uv run gaius-cli --cmd "/flows list"

# Trigger article curation
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"

# Query lineage for a KB file
uv run gaius-cli --cmd "/lineage query scratch/2026-03-14/paper.md"

Gaius

Data Pipeline

End-to-End Flow

Pipeline Stages

Lineage Tracking

Knowledge Base

Key Services

CLI Access

Keyboard shortcuts

Gaius

Data Pipeline

End-to-End Flow

Pipeline Stages

Lineage Tracking

Knowledge Base

Key Services

CLI Access