Development Guide
Building and Running
Critical: Always Use --no-sync
uv run --no-sync python main.py
The --no-sync flag prevents uv from re-resolving and reinstalling dependencies before running. This is required because flash-attn, flash-linear-attention (fla), mamba-ssm, and causal-conv1d are patched CUDA extensions that were built manually with corrected CXX11 ABI flags. Running uv run without --no-sync will clobber these patched builds with incompatible PyPI wheels.
Smoke Tests
# Model instantiation and forward pass shapes
uv run --no-sync python main.py
# Training loop validation (tiny model, synthetic data)
uv run --no-sync python train.py --smoke-test --model-size tiny --epochs 3
Multi-GPU Training
# 6x RTX 4090 training
uv run --no-sync torchrun --nproc_per_node=6 train.py \
--model-size small \
--epochs 100 \
--batch-size 64 \
--lr 1e-4
Training uses DDP (DistributedDataParallel), AMP with bf16, cosine LR schedule with linear warmup, and load balancing loss for dynamic chunking regularization.
CUDA Extension Build Notes
The devenv/Nix environment provides GCC 15, which sets _GLIBCXX_USE_CXX11_ABI=1. However, PyTorch’s cu124 wheels are built with _GLIBCXX_USE_CXX11_ABI=0. This ABI mismatch causes segfaults when CUDA extensions link against the wrong ABI.
Patching Procedure
Both mamba-ssm and flash-attn have a CachedWheelsCommand in their setup.py that downloads prebuilt wheels from GitHub releases, bypassing local compilation. To force a local build with the correct ABI:
-
Set environment variables to force local build:
export MAMBA_FORCE_BUILD=TRUE export FLASH_ATTENTION_FORCE_BUILD=TRUE -
Use
env -iwith system GCC-11 to get the correct ABI:env -i PATH=/usr/bin:$PATH HOME=$HOME \ pip install --no-build-isolation /tmp/mamba_src/mamba_ssm-2.3.1/ -
Patch
setup.pyin each extension to add explicit_abi_flagmatching torch’s ABI.
Patched source trees are kept in /tmp/mamba_src/ and /tmp/flash_src/. See docs/notes/2026-03-28/010808_deps_smoke_train.md for the full step-by-step procedure.
Verifying the Build
After patching, verify that the extensions load correctly:
uv run --no-sync python -c "import mamba_ssm; print('mamba-ssm OK')"
uv run --no-sync python -c "import flash_attn; print('flash-attn OK')"
uv run --no-sync python -c "from fla.ops.rwkv7 import chunk_rwkv7; print('fla OK')"
Adding New Block Types
The architecture supports mixed block types (Mamba2, MHA, RWKV-7, RWKV-8 ROSA) within a single model. To add a new block type:
1. Implement the Mixer Class
Create a new module that implements three methods:
class MyNewMixer(nn.Module):
def forward(self, hidden_states, inference_params=None, **kwargs):
"""Full-sequence forward pass. Input: (B, L, D). Output: (B, L, D)."""
...
def step(self, hidden_states, inference_params):
"""Single-token autoregressive step. Input: (B, 1, D). Output: (B, 1, D)."""
...
def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
"""Allocate KV cache or recurrent state for inference."""
...
2. Register in create_block()
Add the new type to src/aegir/modules/block.py:
def create_block(arch, d_model, ...):
if arch in ("x", "X"): # new block type code
from my_module import MyNewMixer
mixer_cls = partial(MyNewMixer, **factory_kwargs, layer_idx=layer_idx)
...
Convention: lowercase letter = mixer only (no MLP), uppercase = mixer + SwiGLU MLP.
3. Add to Isotropic Forward Loop
In src/aegir/modules/isotropic.py, add the new block type to:
-
The regex pattern that parses layout strings:
layout_parse = re.findall(r"([mMtTrRwWxX])(\d+)", arch_layout) -
The forward loop’s block-type dispatch:
elif arch in ("x", "X"): layer_mixer_kwargs = {} # or whatever kwargs your mixer needs if hidden_states.dim() == 2: hidden_states = hidden_states.unsqueeze(0) residual = None if residual is None else residual.unsqueeze(0)
4. Test
# Verify the new block type instantiates and runs
uv run --no-sync python main.py
Project Structure
aegir/
main.py -- Smoke tests
train.py -- Training script (DDP, AMP, cosine LR)
src/aegir/
models/
config.py -- AegirConfig, SSMConfig, AttnConfig, RWKVConfig
aegir.py -- Recursive hierarchical backbone
heads.py -- AegirForCausalLM, AegirForColumnAnnotation
modules/
block.py -- Block factory (create_block)
isotropic.py -- Flat block stack with mixed types
dc.py -- Dynamic chunking (RoutingModule, ChunkLayer, DeChunkLayer)
rwkv7_tmix.py -- RWKV-7 full TimeMix (fla kernels)
rwkv.py -- RWKV-8 ROSA time mixing + relu^2 channel mixing
rosa.py -- ROSA suffix automaton (CPU-based)
mlp.py -- SwiGLU MLP
swarm/
state_fusion.py -- RWKVStateFusion (3 modes)
alignment.py -- AlignmentProjection (cross-agent state mapping)
specialist.py -- FrozenSpecialist wrapper
orchestrator.py -- SwarmOrchestrator (K2.5 PARL)
data/
serialization.py -- Table-to-byte-sequence serialization
context_select.py -- MMR context column selection
table_dataset.py -- PyTorch dataset for table benchmarks
utils/
train.py -- Load balancing loss, F1 metrics, param grouping
docs/ -- mdbook documentation (this book)
ref/ -- Reference papers
Documentation
Build and serve the documentation locally:
mdbook build docs/
mdbook serve docs/ # serves at http://localhost:3000
The documentation uses mdbook with katex (math), mermaid (diagrams), and d2 (architecture diagrams) plugins, all provisioned by devenv.