Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 3: PARL Training

Train the SwarmOrchestrator using Parallel Agent Reinforcement Learning, following the K2.5 framework (arXiv:2602.02276). The primary model learns to route inputs to frozen specialists and fuse their recurrent states, optimized via token-level clipping RL.

Setup

Primary Model

A fresh Aegir model initialized from the Phase 1 checkpoint. All parameters are trainable. The primary model learns to:

  • Process the input table and produce annotations
  • Decide which specialists to activate via the SpecialistRouter
  • Integrate specialist states through RWKVStateFusion

Frozen Specialists

One or more Phase 1 checkpoints frozen with requires_grad_(False). Each specialist is wrapped in a FrozenSpecialist that:

  • Runs forward passes with torch.no_grad()
  • Extracts recurrent states from its RWKV layers
  • Optionally applies AlignmentProjection if its architecture differs from the primary

Initially, Phase 3 uses a single specialist (the best Phase 1 checkpoint). Additional specialists with different training data or hyperparameters are added incrementally.

Token-Level Clipping RL

K2.5 uses a variant of PPO where the clipping objective is applied at the token level rather than the trajectory level. For each token position t:

L_t = min(
    rho_t * A_t,
    clip(rho_t, 1-eps, 1+eps) * A_t
)

where:

  • rho_t = pi_new(a_t | s_t) / pi_old(a_t | s_t) is the per-token importance ratio
  • A_t is the advantage estimate at position t
  • eps is the clipping range (starts at 0.2, narrows to 0.1 over training)

Token-level clipping provides finer-grained credit assignment than trajectory-level clipping. For column annotation, this means the router receives distinct gradient signal for each column boundary token, each type-indicative value, and each structural separator.

Routing as Action Space

The “action” at each routing decision point is the specialist activation vector:

a = sigmoid(W_router @ h)   # continuous in [0, 1]^num_specialists

The policy pi(a | s) is parameterized by the router weights. The RL objective encourages the router to activate specialists when they improve annotation quality and deactivate them when they don’t.

Critical-Steps Optimization

Rather than minimizing total FLOPs or wall-clock time, PARL optimizes the critical path – the longest sequential dependency chain in the computation.

critical_path = max(
    primary_forward_time,
    max(specialist_forward_times for activated specialists)
)

Specialist forward passes run in parallel (they are independent). The critical path is therefore the maximum of the primary and any single specialist, not the sum. This means:

  • Activating additional specialists that run in parallel is free in critical-path terms
  • The optimizer penalizes only sequential dependencies (e.g., if the primary must wait for specialist state before proceeding)
  • This naturally encourages parallel specialist activation over sequential reasoning in the primary

Training Loop

for each batch:
    1. Run primary model through first layer to get routing_hidden
    2. Compute specialist activation scores
    3. Run activated specialists (parallel, no_grad)
    4. Fuse specialist states into primary's recurrent state
    5. Complete primary forward pass
    6. Compute r_perf from annotation accuracy
    7. Compute r_parallel from activation statistics
    8. Compute r_finish from annotation completeness
    9. Combine rewards with annealed lambdas
   10. Compute token-level PPO loss and update primary + router + fusion

Budget-Limited vs Standard Scaling

PARL training alternates between two modes:

Budget-limited phase: The router has a hard cap on the number of specialists it can activate per batch. This encourages selective, high-value routing decisions. The cap starts low (1 specialist) and gradually increases.

Standard scaling phase: No activation cap. The router is free to activate as many specialists as it wants, paying only the r_parallel penalty for inefficient routing. This phase tests whether the router has learned meaningful selectivity.

The alternation prevents the router from converging to a trivial “activate everything” strategy during standard scaling while still allowing it to learn from unrestricted experimentation.

Success Criteria

Phase 3 is complete when:

  1. The primary model with specialist fusion exceeds the standalone Phase 1 baseline by a meaningful margin (target: +2-5 F1 points on SOTAB-CTA hard split)
  2. The router activates specialists selectively (not all-on or all-off) and the activation pattern varies with input content
  3. The lambda annealing schedule produces smooth training curves without reward collapse