Metaflow Integration
Gaius uses Metaflow for production data pipelines that run on Kubernetes. Flows handle article curation, content evaluation, rendering, and document processing.
Infrastructure
The Metaflow service is deployed via Tilt in infra/tilt/ and runs on the local RKE2 Kubernetes cluster. Access requires a port-forward:
kubectl port-forward svc/metaflow-service 8180:8080
The environment variable METAFLOW_SERVICE_URL=http://localhost:8180 must be set for flow execution. This is configured automatically in devenv.nix for interactive shells and explicitly in process scripts.
GaiusFlow Base Class
All Gaius flows inherit from GaiusFlow, which provides OpenLineage integration and KB path helpers:
from gaius.flows import GaiusFlow
from metaflow import step
class MyFlow(GaiusFlow):
@step
def start(self):
self.emit_lineage_start("my_flow", inputs=[...])
self.next(self.process)
@step
def end(self):
self.emit_lineage_complete(outputs=[...])
KB path helpers generate paths following the zettelkasten convention:
# scratch/{date}/{HHMMSS}_{title}.md
path = self.zettelkasten_path("My Analysis")
# current/archive/{quarter}/attachments/{filename}
path = self.archive_path("paper.pdf")
Flow Registry
Flows are registered for CLI discovery using the @register_flow decorator:
from gaius.flows import register_flow
@register_flow("article-curation")
class ArticleCurationFlow(GaiusFlow):
...
Registered flows can be listed and invoked from the CLI or MCP tools.
Available Flows
| Flow | Purpose | Typical Duration |
|---|---|---|
| ArticleCurationFlow | End-to-end article research and card publication | ~2 min |
| ArxivDoclingFlow | Fetch and convert arXiv papers to markdown | ~30s |
| ClouderaDocsFlow | Sync Cloudera documentation archives | varies |
See Article Curation for the full 11-step pipeline.
Configuration
Key environment variables:
| Variable | Purpose |
|---|---|
METAFLOW_SERVICE_URL | Metaflow service endpoint (http://localhost:8180) |
METAFLOW_DATASTORE_SYSROOT_S3 | MinIO path for flow artifacts |
METAFLOW_DEFAULT_METADATA | Metadata backend (postgresql) |
GAIUS_KB_ROOT | Knowledge base root directory |
Running Flows
# Via Metaflow CLI
python -m metaflow.cli run ArticleCurationFlow --article ai-reasoning-weekly
# Via Gaius CLI
uv run gaius-cli --cmd "/article curate ai-reasoning-weekly"
# Via MCP tool
uv run gaius-cli --cmd "/fetch_paper 2312.12345"
K8s Prerequisites
kubectlandk9sare Nix-managed viadevenv.nix(not the system RKE2 binary)- KUBECONFIG must be set to
~/.config/kube/rke2.yaml(never use fallback syntax) - K8s pods need
pg_hba.confentries for10.42.0.0/16and10.43.0.0/16subnets