GSPO replaces token-level RL clipping with sequence-level optimization, fixing MoE training collapse

The Qwen team at Alibaba has published Group Sequence Policy Optimization (GSPO), a new reinforcement learning algorithm designed to replace GRPO as the training backbone for large language models. The post describes GRPO as exhibiting “severe instability issues during long training” that lead to “irreversible model collapse,” preventing further performance gains with increased compute. GSPO was developed to address this directly and is credited as the training algorithm behind the latest Qwen3 models, including Instruct, Coder, and Thinking variants.

The core distinction is where the importance ratio is computed. GRPO computes importance ratios at the token level; GSPO computes them at the sequence level. According to the post, this change produces benefits in stability, efficiency, and infrastructure simplicity.

The sequence-level optimization objective

GSPO defines its importance ratio as the geometric mean of per-token log-probability ratios, normalized by sequence length. This length normalization reduces variance and unifies the numerical range across sequences of different lengths.

The objective clips this sequence-level ratio within a fixed range, applying a single clip per response rather than per token. The post notes an empirically observed result: the fraction of tokens clipped in GSPO is “two orders of magnitude higher” than in GRPO, yet GSPO still achieves higher training efficiency. The team interprets this as evidence that “GRPO’s token-level optimization objective is noisy and inefficient, while GSPO’s sequence-level approach provides a more reliable and effective learning signal.”

Experiments used a cold-start model fine-tuned from Qwen3-30B-A3B-Base. Performance curves on AIME’24, LiveCodeBench, and CodeForces are shown against GRPO as the baseline. The post reports that GSPO “demonstrates significantly higher training efficiency than GRPO, achieving better performance under the same training cost.”

Fixing Mixture-of-Experts RL

The post identifies a specific failure mode that blocked GRPO from training large MoE models: expert activation volatility. In MoE architectures, different experts are activated for different inputs, and routing patterns differ between the old and current policy. When computing token-level importance ratios, this routing divergence introduces noise that prevents convergence.

The Qwen team’s previous workaround was a technique called Routing Replay: caching the expert activations from the old policy and replaying them during importance ratio computation for the current policy. While functional, this required “additional memory and communication overhead and may limit the actual capacity of MoE models,” according to the post.

GSPO eliminates the need for Routing Replay entirely. Because GSPO only uses sequence-level likelihoods rather than token-level ones, it is insensitive to the specific routing patterns of individual tokens. The post describes this as “completely eliminating the dependency on Routing Replay,” which both simplifies training infrastructure and allows MoE models to use their full routing capacity.

Infrastructure implications

The sequence-level design has a secondary infrastructure benefit. Token-level importance ratios require recomputation using the training engine because inference engines return sequence probabilities, not the per-token distributions needed for precise token-level ratios. GSPO’s sequence-level objective can directly consume the likelihoods returned by inference engines, potentially eliminating a recomputation step.

The post highlights this as “particularly beneficial in scenarios such as partial rollout, multi-turn RL, and training-inference disaggregated frameworks.” The team describes GSPO as “fundamentally more tolerant to precision discrepancies” as a result.

Connection to Qwen3

The post states that GSPO was “successfully applied to the large-scale RL training of the latest Qwen3 models,” and attributes part of those models’ reasoning improvements to the algorithm. The paper is published as arXiv:2507.18071, authored by twelve researchers at Alibaba.