Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

K2.5 PARL Orchestrator

The SwarmOrchestrator coordinates a trainable primary Aegir model with multiple frozen specialist agents, following the Parallel Agent Reinforcement Learning (PARL) pattern from Kimi K2.5 (arXiv:2602.02276). Implementation is in src/aegir/swarm/orchestrator.py.

Architecture

                    +-------------------+
                    | SwarmOrchestrator |
                    +-------------------+
                            |
             +--------------+--------------+
             |              |              |
     SpecialistRouter   Primary      FrozenSpecialists
     (sigmoid gates)    (trainable)   (frozen params)
             |              |              |
             |              |    +---------+---------+
             |              |    |         |         |
             |              |  Spec_0   Spec_1   Spec_N
             |              |    |         |         |
             +--> activation -->  state fusion  <----+
                  mask           (RWKVStateFusion)

The primary model is the only component whose parameters are updated during PARL training. Specialists are frozen checkpoints that contribute their recurrent states when activated by the router.

SpecialistRouter

The router decides which specialists to activate for a given input. It maps the primary agent’s hidden representation to per-specialist activation scores:

scores = sigmoid(W_router @ hidden_states)   # (B, num_specialists)
activation_mask = scores > threshold          # default threshold = 0.5

Sigmoid gating (rather than softmax) allows zero, one, or multiple specialists to be activated simultaneously. This is critical for the column annotation task where a table may require expertise from several domain specialists, or none at all.

PARL Reward Structure

The combined reward follows K2.5’s formulation:

r = lambda_1 * r_parallel + lambda_2 * r_finish + r_perf

Reward Components

r_perf – Performance reward. F1 accuracy on the annotation task (CTA or CPA). This is the primary signal that drives annotation quality.

r_parallel – Parallelism and load balancing reward. Encourages efficient specialist utilization: activate specialists when they help, avoid activating them when they don’t. Adapted from H-Net’s lb_loss which penalizes unbalanced routing across experts.

r_finish – Completion quality reward. All columns in a table must be annotated, and the router must not degenerate into always-on or always-off patterns. Penalizes incomplete annotations and trivial routing strategies.

Lambda Annealing Schedule

Following K2.5, the lambda weights anneal over training:

Phaselambda_1 (parallel)lambda_2 (finish)Rationale
Early0.30.1Encourage exploration of specialist activation
Mid0.10.3Shift focus to completion quality
Late0.050.05Let r_perf dominate for final accuracy

The initial values (lambda_parallel=0.3, lambda_finish=0.1) are set in the orchestrator constructor. Annealing is managed by the training loop.

Token-Level Clipping RL

K2.5 uses a variant of PPO with token-level clipping rather than trajectory-level. This provides finer-grained credit assignment:

  • Each token’s routing decision gets its own clipped surrogate objective
  • Critical tokens (column boundaries, type-indicative values) receive higher weight
  • The clipping range narrows over training to stabilize converged policies

Critical-Steps Optimization

Rather than minimizing total computation, the orchestrator minimizes the critical path – the longest chain of sequential dependencies. Specialist activations that can run in parallel do not increase the critical path even if they increase total FLOPs. This encourages the router to prefer parallel specialist activation over sequential reasoning in the primary model when both achieve similar accuracy.

Forward Pass

orchestrator = SwarmOrchestrator(
    primary_model=primary,
    specialists=[spec_cta, spec_cpa, spec_geo],
    fusion=RWKVStateFusion(num_heads=8, head_size=64, num_agents=3),
    d_model=512,
    activation_threshold=0.5,
)

result = orchestrator(
    input_ids=tokens,
    mask=mask,
    routing_hidden=pooled_hidden,  # from primary's first layer
)

# result["output"]           -- primary model output
# result["specialist_outputs"] -- list of activated specialist results
# result["activation_mask"]  -- (B, num_specialists) boolean mask

When routing_hidden is None, specialist activation is skipped entirely and only the primary model runs. This allows the same orchestrator to be used in both supervised pre-training (no specialists) and PARL training (with specialists).