GSPO replaces token-level RL clipping with sequence-level optimization, fixing MoE training collapse
Qwen introduces Group Sequence Policy Optimization, a new RL algorithm that eliminates the instability and infrastructure overhead that blocked GRPO from scaling to large Mixture-of-Experts language models.