To minimize the difference, we need to find
Relax the problem to:
We introduce an
We construct an approximation by independently sampling
The error
Using the arithmetic–geometric mean inequality, we have
where equality holds when
Therefore, the optimal distribution is
Equality holds when
the optimal solution is to select the
Spacial Explanation: The larger the magnitude of a vector, the less likely it is to be canceled out during summation, and thus the more significant its contribution becomes.
To compute each
Final MoE output:
Spacial Explanation:
Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.
Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high.
where
Uniform distribution:
In the sense of balancing, the ideal state is that all experts receive an equal share of tokens.
Aux Loss:
However,
Aux Loss:
Gradient:
Compare the two formulations:
The two formulations yield the same gradient.
However, equation (2) reaches its minimum when
Begin from the negative of maximum entropy
Straight-Through Estimator:
Gradient:
Will the Aux Loss exacerbate MoE Models' instability during training?


Stage 1: Learning to route
Stage 2: Router Freezed

The auxiliary loss introduces interference gradients, which can disturb the optimization direction of the primary language modeling objective.
An important observation: The balance among experts can be achieved through a single bias term.
This approach allows token routing to remain primarily guided by the gating logits
Loss:
STE with
Gradient:
Gradient:
Update:
When
An iterative process of token routing and bias updating.



The error
Activating exactly
The purpose of sparsity is approximation, not hard selection.
Therefore, we can consider activating a different number of experts for each token.
Routing:
Optimization goals of
Update rule:
Update rule:
Each update to
However, the number of activated experts may still exceed the budget.
If we want the average number of activated experts per token to be
If we want the number of expert activations per token to never exceed

Residual perspective:
To treat shared experts as residual components.
DeepSeek: To ensure that each routed expert focuses on distinct aspects.
Geometric perspective:
Previous assumption: all routed experts are orthogonal.
The residuals are easier to satisfy the orthogonality assumption.
Routing fluctuations: Expert-selection instability amplifies importance-sampling ratio variance, causing frequent clipping and degrading training stability.
Variance mismatch: Using token-level importance ratios to correct sequence-level advantages introduces systematic variance misalignment, which is magnified in MoE and further destabilizes GRPO training.
For rollout-update discrepancies.
To cache the activated experts in
For each token

Reusing routing modes incurs additional memory and communication overhead, and may also constrain the effective capacity of the MoE model.

Inconsistency between inference engine and training engine
| Category | Inference Engine (SGLang) | Training Engine (Megatron) | Why Router Differs |
|---|---|---|---|
| (A) Kernel & Computation | Fused kernels; paged attention; per-token execution | TP/EP kernels; GEMM-based attention; batch execution | Tiny Q/K/V numeric shifts → top-k expert changes |
| (B) KV-Cache Behavior | Block-based PagedAttention KV cache; dynamic scheduling | Dense KV cache; strict teacher forcing | Different hidden states → different router logits |
Inconsistency between inference engine and training engine
| Category | Inference Engine (SGLang) | Training Engine (Megatron) | Why Router Differs |
|---|---|---|---|
| (C) Parallelism (TP/EP) | Often merged weights or light TP | Full Tensor Parallel + Expert Parallel + All-to-All | Weight sharding/layout mismatch → logit drift |
| (D) Non-determinism | Deterministic inference; dropout disabled | Dropout enabled; CUDA reduction order nondeterminism; repeated forwards drift | Forward instability → inconsistent routing |
Inconsistency between inference engine and training engine

Training-inference inconsistency.

where
GSPO applies clipping to entire responses instead of individual tokens to exclude the overly "off-policy" samples from gradient estimation.
GMPO has a narrower value range than GRPO:
The training process of GMPO experiences lower variance in the optimization objective.
More stable policy updates. Less sensitive to outliers.
Routing Replay constrains router updates and incurs significant memory and communication overhead.
GSPO does not fundamentally resolve routing distribution drift, and its sequence-level
clipping can over-prune tokens, potentially discarding useful gradient information.
Instead of fully constraining the router, RSPO introduces a router shift ratio, computed from router scores between the current and old policies.
This ratio quantifies the degree of routing deviation for each token and is used to softly rescale IS weights.
Averaging the scores of the top-K experts that were activated under the old policy.
The larger the routing change is, the smaller the value of
When
When
If

For each token, its MoE-related communication frequency is proportional to the number of devices covered by its target experts.
The number of activated experts can be large.
To additionally ensure that the target experts of each token will be distributed on at most
Findings: when
First to compute the average computational budget for each device.
To drop tokens with the lowest affinity scores on each device until reaching then computational budget.
To ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped.
where
where
Ensuring that each device transmits at most
Simultaneously, the communication balance loss is employed to encourage each device to receive around
In DeepseekV2, Softmax is applied among all topk weights.
In DeepseekV3, Softmax is not applied, norm is applied among selected topk weights.
Reasons:
To ensure that each token will be sent to at most
where
Yuxuan Wang
2025-11-21