Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

OpenTelemetry

Gaius uses the OpenTelemetry SDK for distributed tracing and metric instrumentation. The engine centralizes all OTel export through EngineMetrics, ensuring a single source of truth for operational telemetry.

Instrumentation

The EngineMetrics singleton (initialized at engine startup) creates OTel instruments:

from gaius.engine.metrics import EngineMetrics

metrics = EngineMetrics.get_instance()
metrics.record_inference(model="reasoning", latency_ms=150, tokens=500)
metrics.record_gpu_memory(gpu_id=0, used_mb=12000, total_mb=24000)
metrics.record_healing_attempt(endpoint="reasoning", tier=0, success=True)

Metric Categories

CategoryInstrumentsType
Inferenceinference_count, inference_latency, inference_tokensCounter, Histogram
GPUgpu_memory_used, gpu_utilization, gpu_flops_utilizationGauge (observable callbacks)
Endpointsendpoint_healthy, endpoint_requestsGauge, Counter
Healinghealing_attempts, healing_escalations, incidents_activeCounter, Gauge
Pipelinepipeline_cards_published, pipeline_pending_cardsCounter, Gauge
Errorserror_total, exception_caught_totalCounter

Metric Naming

Metrics follow a double-prefix convention due to OTel Collector namespace configuration:

gaius_gaius_<metric_name>_<unit>

The first gaius_ comes from the OTel Collector namespace config; the second from SDK metric naming (gaius. becomes gaius_ after export). PromQL queries in the OBSERVE_METRICS registry use this full prefix.

Export Pipeline

EngineMetrics --> OTel SDK --> OTLP Exporter --> OTel Collector --> Prometheus

The OTel Collector runs as a sidecar, receiving OTLP and remoting to Prometheus via the Prometheus remote-write or scrape endpoint. GPU metrics use observable callbacks that are invoked on each collection cycle.

Makespan Tracing

For long-running operations (evolution cycles, research flows), Gaius uses makespan tracing: a parent span covers the entire operation, with child spans for each phase. This enables latency attribution across multi-step workflows without excessive span cardinality.

Source

Engine metrics: src/gaius/engine/metrics.py. Observability sources: src/gaius/observability/sources/.