Towards Balancing the Experts

Yuxuan Wang
2025-11-21

Outline

Some Insights of MoE

  • Spacial Explanation of Routers

Balancing & Stability

  • Balancing Among Experts
  • Training Stability

Mixture of Experts

MoE illustration

Fomula of MoE

Traditional FFN:

To chunk weights into pieces:

Only calculate top r terms out of n.

Towards Sparsity

To minimize the difference, we need to find that satisfy

Relax the problem to:

Sampling

We introduce an -dimensional distribution , then

We construct an approximation by independently sampling times from :

Sampling

The error

Sampling

Using the arithmetic–geometric mean inequality, we have

where equality holds when

Therefore, the optimal distribution is

Spacial Explanation

Equality holds when ,
the optimal solution is to select the vectors with the largest magnitudes.

Spacial Explanation: The larger the magnitude of a vector, the less likely it is to be canceled out during summation, and thus the more significant its contribution becomes.

What Does the Router Do?

To compute each efficiently.

Final MoE output:

Spacial Explanation: ~

Load Balancing (Among Experts)

Challenges

  • Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.

  • Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high.

How to Train a Router?

A commonly used approach: Aux Loss

where

Why ?

Uniform distribution:

In the sense of balancing, the ideal state is that all experts receive an equal share of tokens.

Aux Loss:

However, is discontinuous and non-differentiable. Therefore, (P) is used as a smooth approximation of :

Straight-Through Estimator

Aux Loss:

Gradient:

Is a smaller Aux Loss better?

Compare the two formulations:

The two formulations yield the same gradient.

However, equation (2) reaches its minimum when while equation (1) does not.

Other Aux Loss

Begin from the negative of maximum entropy

Straight-Through Estimator:

Gradient:

Aux Loss

Will the Aux Loss exacerbate MoE Models' instability during training?

Training Instability

Training Instability

Training Instability

Training Instability

Training Instability 2

StableMoE: 2-Stage Training (DeepseekMoE, DeepseekV2)

Stage 1: Learning to route
Stage 2: Router Freezed

StableMoE

Loss-Free Balancing

The auxiliary loss introduces interference gradients, which can disturb the optimization direction of the primary language modeling objective.

An important observation: The balance among experts can be achieved through a single bias term.

This approach allows token routing to remain primarily guided by the gating logits , while serves as a bias controller to maintain expert load balance, achieving load balancing without auxiliary-loss-induced gradient interference.

Loss-Free Balalncing

Loss:

STE with :

Gradient:

Loss-Free Balalncing

Gradient:

Update:

When is too large, decrease it slightly; when is too small, increase it slightly.

Loss-Free Balancing

An iterative process of token routing and bias updating.

Loss-Free

Loss-Free Balancing

Loss-Free-Algo

Loss-Free Balancing

Training Instability

To pay the Efforts to the Difficult Parts

The error

Activating exactly experts per token is not necessarily optimal.

The purpose of sparsity is approximation, not hard selection.

Therefore, we can consider activating a different number of experts for each token.

To pay the Efforts to the Difficult Parts

Routing:

Optimization goals of :

  • Load balancing
  • Budget control

Update rule:

To Control the Total Budget

Update rule:

Each update to has a total sum of zero.
However, the number of activated experts may still exceed the budget.

If we want the average number of activated experts per token to be :

If we want the number of expert activations per token to never exceed :

Shared Experts

Shared Experts

Shared Experts

Residual perspective:
To treat shared experts as residual components.
DeepSeek: To ensure that each routed expert focuses on distinct aspects.

Geometric perspective:
Previous assumption: all routed experts are orthogonal.
The residuals are easier to satisfy the orthogonality assumption.

Stablely RL MoE Models

Routing fluctuations: Expert-selection instability amplifies importance-sampling ratio variance, causing frequent clipping and degrading training stability.

Variance mismatch: Using token-level importance ratios to correct sequence-level advantages introduces systematic variance misalignment, which is magnified in MoE and further destabilizes GRPO training.

Routing Replay

For rollout-update discrepancies.

To cache the activated experts in and "replay" these routing patterns in when computing the importance ratios

For each token , both and share the same activated expert network, thereby restoring the stability of token-level importance ratios and ensuring consistent optimization within the same activated subnetwork across gradient updates.

Routing Replay

Routing Replay

Reusing routing modes incurs additional memory and communication overhead, and may also constrain the effective capacity of the MoE model.

Rollout Routing Replay

Rollout Routing Replay

Inconsistency between rollout and old policy?

Inconsistency between inference engine and training engine

Category Inference Engine (SGLang) Training Engine (Megatron) Why Router Differs
(A) Kernel & Computation Fused kernels; paged attention; per-token execution TP/EP kernels; GEMM-based attention; batch execution Tiny Q/K/V numeric shifts → top-k expert changes
(B) KV-Cache Behavior Block-based PagedAttention KV cache; dynamic scheduling Dense KV cache; strict teacher forcing Different hidden states → different router logits

Inconsistency between rollout and old policy?

Inconsistency between inference engine and training engine

Category Inference Engine (SGLang) Training Engine (Megatron) Why Router Differs
(C) Parallelism (TP/EP) Often merged weights or light TP Full Tensor Parallel + Expert Parallel + All-to-All Weight sharding/layout mismatch → logit drift
(D) Non-determinism Deterministic inference; dropout disabled Dropout enabled; CUDA reduction order nondeterminism; repeated forwards drift Forward instability → inconsistent routing

Inconsistency between rollout and old policy?

Inconsistency between inference engine and training engine

Training Inference Discrepancy

Rollout Routing Replay

Training-inference inconsistency.

Rollout Routing Replay

Review: GSPO (Group Sequence Policy Optimization)

where

GSPO applies clipping to entire responses instead of individual tokens to exclude the overly "off-policy" samples from gradient estimation.

Review: GMPO(Geometric-Mean Policy Optimization)

GMPO has a narrower value range than GRPO:

The training process of GMPO experiences lower variance in the optimization objective.
More stable policy updates. Less sensitive to outliers.

RSPO: Router-Shift Policy Optimization

Routing Replay constrains router updates and incurs significant memory and communication overhead.

GSPO does not fundamentally resolve routing distribution drift, and its sequence-level
clipping can over-prune tokens, potentially discarding useful gradient information.

Instead of fully constraining the router, RSPO introduces a router shift ratio, computed from router scores between the current and old policies.

This ratio quantifies the degree of routing deviation for each token and is used to softly rescale IS weights.

RSPO: Router-Shift Policy Optimization

: router shift ratio

: routing score of expert at layer .

Averaging the scores of the top-K experts that were activated under the old policy.

RSPO: Router-Shift Policy Optimization

The larger the routing change is, the smaller the value of becomes.

When is small, the gradient contribution of that token is reduced.

When is large, the gradient contribution of that token is amplified.

If is too small, the gradient of that token will be eliminated.

RSPO: Router-Shift Policy Optimization

RSPO

Some Tricks in Deepseek

DeepSeek-V2

  • Device-Limited Routing
  • Auxiliary Loss (Experts & Devices)
  • Token-Dropping

Device-Limited Routing

For each token, its MoE-related communication frequency is proportional to the number of devices covered by its target experts.

The number of activated experts can be large.

To additionally ensure that the target experts of each token will be distributed on at most devices.

Findings: when , the device-limited routing can achieve a good performance roughly aligned with the unrestricted top-K routing.

Token-Dropping

First to compute the average computational budget for each device.

To drop tokens with the lowest affinity scores on each device until reaching then computational budget.

To ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped.

Device-Level Balance Loss

where

Communication Balance Loss

where

Ensuring that each device transmits at most tokens to other devices.

Simultaneously, the communication balance loss is employed to encourage each device to receive around tokens from other devices.

Softmax?

In DeepseekV2, Softmax is applied among all topk weights.

In DeepseekV3, Softmax is not applied, norm is applied among selected topk weights.

Reasons:

  • Softmax routing is numerically unstable at scale.
  • Softmax exacerbates load imbalance.
  • Softmax makes routing sensitive to backend differences.
  • Removing softmax enables deterministic, parallel-friendly gating.
  • The routing-shift bias b_i integrates more effectively with sign-based gating than with normalized probabilities.

DeepSeek-V3

  • Node-Limited Routing
  • Complementary Sequence-Wise Aux Loss

Node-Limited Routing

To ensure that each token will be sent to at most nodes, which are selected according to the sum of the highest / affinity scores of the experts distributed on each node.

Complementary Sequence-Wise Aux Loss

where

Towards Balancing the Experts

Yuxuan Wang
2025-11-21