Phase 1: Supervised Bootstrapping

Train the base Aegir model on Column Type Annotation (CTA) and Column Property Annotation (CPA) benchmarks with byte-level input. This phase establishes baseline performance and validates the hierarchical architecture on real tabular data.

Objective

Produce a single Aegir checkpoint that achieves competitive F1 scores on standard CTA/CPA benchmarks, operating directly on raw byte sequences (no external tokenizer).

Target Datasets

Dataset	Task	Tables	Columns	Label Classes
SOTAB-CTA	Column Type Annotation	~50k	~500k	91 semantic types
GitTables	CTA (large-scale)	~1.5M	~15M	Schema.org types
WikiTables	CTA/CPA	~1.7M	~6M	DBpedia ontology

Baseline F1 Targets

These targets are based on published results from SOTAB and Retrieve-and-Verify:

Benchmark	Metric	Target F1
SOTAB-CTA (easy)	Macro F1	> 0.85
SOTAB-CTA (hard)	Macro F1	> 0.65
SOTAB-CPA	Macro F1	> 0.75

Byte-Level Input

Aegir operates on raw byte sequences (vocab_size=65536 to cover byte values plus special tokens). Tables are serialized into a linear byte stream with role markers distinguishing the target column from context columns.

Dynamic chunking learns tokenization from raw bytes. The RoutingModule in the hierarchical backbone predicts chunk boundaries based on cosine similarity between adjacent hidden states. This means the model discovers its own sub-word units during training, adapting segmentation to the statistics of tabular data rather than relying on a fixed tokenizer trained on natural language.

Serialization Format

Tables are serialized using the format in src/aegir/data/serialization.py:

[CLS] col_name: val1 | val2 | val3 [SEP] ctx_col1: v1 | v2 [SEP] ctx_col2: ...

The target column comes first, followed by context columns selected via MMR (Maximal Marginal Relevance) to maximize diversity while staying within the byte budget.

Training Configuration

Single-GPU (Development)

uv run --no-sync python train.py \
    --model-size tiny \
    --epochs 30 \
    --batch-size 32 \
    --lr 3e-4

Multi-GPU with DDP

uv run --no-sync torchrun --nproc_per_node=6 train.py \
    --model-size small \
    --epochs 100 \
    --batch-size 64 \
    --lr 1e-4

Training uses:

DDP (DistributedDataParallel) across GPUs
AMP (Automatic Mixed Precision) with bf16
Cosine LR schedule with linear warmup
Load balancing loss adapted from H-Net to regularize dynamic chunking

Model Sizes

Size	`d_model`	Layers	Parameters	Use Case
tiny	[128, 192, 192]	~10	~2M	Smoke tests, CI
small	[256, 384, 384]	~20	~15M	Development, ablations
base	[512, 768, 768]	~40	~120M	Benchmark evaluation

Success Criteria

Phase 1 is complete when:

The base model meets or exceeds F1 targets on SOTAB-CTA/CPA
Dynamic chunking converges to stable boundary predictions (no degenerate all-boundary or no-boundary patterns)
The trained checkpoint can be frozen and used as a specialist in Phase 3

Keyboard shortcuts

Ægir: Hierarchical Sequence Modeling with Dynamic Chunking