Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Pipeline

The data pipeline connects external sources to the knowledge base, card collections, and search index through a sequence of ingestion, processing, and indexing stages.

End-to-End Flow

Web Sources (Brave, arXiv, RSS)
    |
    v
NiFi Ingestion ──> Raw Content (HX / Iceberg)
    |
    v
Metaflow Pipelines ──> Article Drafts, Card Creation
    |
    v
Qdrant Indexing ──> 768-dim Nomic Embeddings
    |
    v
PostgreSQL (zndx_gaius:5444) ──> Cards, Collections, Metadata
    |
    v
R2 Storage ──> Rendered Visualizations (viz.gaius.zndx.org)

Pipeline Stages

Ingestion. NiFi processors fetch content from external APIs, RSS feeds, and web search results (Brave). Raw content is stored in Apache Iceberg tables via the HX data lake before any processing occurs. This preserves the original source material and provides a replay capability.

Processing. Metaflow pipelines handle the compute-intensive work: PDF conversion via docling, topic extraction via BERTopic, relevance scoring via local LLMs, and article draft generation. See Metaflow Integration for details on the execution environment.

Article Curation. The Article Curation flow orchestrates the full lifecycle from article selection through card creation and publication. Each run produces approximately 20 cards in under 2 minutes.

Indexing. Processed content is embedded using Nomic (768-dimensional vectors) and indexed in Qdrant for semantic search. The same embeddings drive the TUI’s 19x19 grid layout and the visualization pipeline.

Storage. Cards, collections, and metadata live in PostgreSQL (zndx_gaius on port 5444). Rendered card images are uploaded to Cloudflare R2 and served from viz.gaius.zndx.org. See Viz Storage for the object key convention.

Lineage Tracking

Every pipeline stage emits OpenLineage events that are materialized into an Apache AGE graph. This provides full provenance from source URL to published card. See Lineage Tracking for Cypher query examples.

Knowledge Base

The Knowledge Base serves as both input and output of the pipeline. Articles begin as zettelkasten notes in build/dev/scratch/, and the curation flow produces structured content in build/dev/current/articles/.

Key Services

ServiceRolePort
NiFiContent ingestion8443
MetaflowPipeline execution8180
PostgreSQLMetadata, cards, collections5444
QdrantVector search6333
MinIOArtifact storage (S3-compatible)9000
Gaius Engine (gRPC)Orchestration, scheduling50051

CLI Access

# List available flows
uv run gaius-cli --cmd "/flows list"

# Trigger article curation
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"

# Query lineage for a KB file
uv run gaius-cli --cmd "/lineage query scratch/2026-03-14/paper.md"