Category	Inference Engine (SGLang)	Training Engine (Megatron)	Why Router Differs
(A) Kernel & Computation	Fused kernels; paged attention; per-token execution	TP/EP kernels; GEMM-based attention; batch execution	Tiny Q/K/V numeric shifts → top-k expert changes
(B) KV-Cache Behavior	Block-based PagedAttention KV cache; dynamic scheduling	Dense KV cache; strict teacher forcing	Different hidden states → different router logits

Category	Inference Engine (SGLang)	Training Engine (Megatron)	Why Router Differs
(C) Parallelism (TP/EP)	Often merged weights or light TP	Full Tensor Parallel + Expert Parallel + All-to-All	Weight sharding/layout mismatch → logit drift
(D) Non-determinism	Deterministic inference; dropout disabled	Dropout enabled; CUDA reduction order nondeterminism; repeated forwards drift	Forward instability → inconsistent routing

RSPO: Router-Shift Policy Optimization

Routing Replay constrains router updates and incurs significant memory and communication overhead.

GSPO does not fundamentally resolve routing distribution drift, and its sequence-level
clipping can over-prune tokens, potentially discarding useful gradient information.

Instead of fully constraining the router, RSPO introduces a router shift ratio, computed from router scores between the current and old policies.

This ratio quantifies the degree of routing deviation for each token and is used to softly rescale IS weights.

Towards Balancing the Experts

Outline

Some Insights of MoE

Balancing & Stability

Mixture of Experts

Fomula of MoE

Towards Sparsity

Sampling

Sampling

Sampling

Spacial Explanation

What Does the Router Do?

Load Balancing (Among Experts)

Challenges

How to Train a Router?

A commonly used approach: Aux Loss

Why ?

Straight-Through Estimator

Is a smaller Aux Loss better?

Other Aux Loss

Aux Loss

Training Instability

Training Instability

Training Instability

StableMoE: 2-Stage Training (DeepseekMoE, DeepseekV2)

Loss-Free Balancing

Loss-Free Balalncing

Loss-Free Balalncing

Loss-Free Balancing

Loss-Free Balancing

Loss-Free Balancing

To pay the Efforts to the Difficult Parts

To pay the Efforts to the Difficult Parts

To Control the Total Budget

Shared Experts

Shared Experts

Stablely RL MoE Models

Routing Replay

Routing Replay

Rollout Routing Replay

Inconsistency between rollout and old policy?

Inconsistency between rollout and old policy?

Inconsistency between rollout and old policy?

Rollout Routing Replay

Review: GSPO (Group Sequence Policy Optimization)

Review: GMPO(Geometric-Mean Policy Optimization)

RSPO: Router-Shift Policy Optimization

RSPO: Router-Shift Policy Optimization

RSPO: Router-Shift Policy Optimization

RSPO: Router-Shift Policy Optimization

Some Tricks in Deepseek

DeepSeek-V2

Device-Limited Routing

Token-Dropping

Device-Level Balance Loss

Communication Balance Loss

Softmax?

DeepSeek-V3

Node-Limited Routing

Complementary Sequence-Wise Aux Loss

Towards Balancing the Experts