Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Inference

The inference layer routes requests across multiple backends: vLLM for local GPU models, optillm for reasoning enhancement, and external APIs (xAI, Cerebras) for cloud-based inference.

Backend Router

The BackendRouter selects the appropriate backend based on capability requirements:

class BackendRouter:
    async def route_inference(
        self,
        model: str,
        prompt: str,
        max_tokens: int,
        technique: str = "",  # optillm technique
    ) -> str

Backends

BackendPurposeHardware
vLLMLocal model inference6x NVIDIA GPUs
optillmReasoning enhancement (CoT, BoN, MoA)Proxies to vLLM
xAI (Grok)External API inferenceCloud
CerebrasExternal API inferenceCloud
NomicText embeddings1 GPU

optillm Techniques

TechniqueDescription
cot_reflectionChain-of-thought with reflection
bonBest-of-N sampling
moaMixture of Agents
rtoRound-trip optimization
z3Z3 solver integration
leapLearn from examples

Request Flow

Client → gRPC → Scheduler → BackendRouter → Backend
                                           ↗ vLLM (local)
                                          ↗ optillm → vLLM
                                         ↗ xAI API (cloud)

All inference requests route through the gRPC engine for centralized authentication, audit logging, and resource management.

Subchapters