NVIDIA adds Muon optimizer support to Megatron Core, closes gap with AdamW at scale

Higher-order optimization algorithms have been used in neural network training for over a decade, but recent LLM results have renewed interest. The Muon optimizer — short for MomentUm Orthogonalized by Newton-Schulz — was used to train Kimi K2 and GLM-5, two of today’s more capable open-source models. Now NVIDIA’s developer blog details how the company has built comprehensive infrastructure to run Muon and similar emerging optimizers at large scale inside Megatron Core.

The core claim: with the right engineering, Muon’s training throughput on GB300 hardware is within a small margin of AdamW, not dramatically worse.

What the benchmarks show

NVIDIA ran throughput comparisons on Kimi K2 and Qwen3 30B-A3B models on the GB300 NVL72 system. For Kimi K2, they used 256 GB300 GPUs with a PP4DP64EP64 configuration. For Qwen3 30B, eight GPUs with DP8EP8. The measurements were made using NeMo Megatron Bridge 26.02, a PyTorch-native library that provides pretraining, SFT, and LoRA for LLM and VLM models.

The results show a very small training performance loss for Muon compared to AdamW. Model FLOPs utilization is reported as higher with Muon when accounting for the FLOPs from the Newton-Schulz iteration matrix multiplications, which are part of the optimizer’s orthogonalization step.

Why scaling Muon is hard

Muon’s orthogonalization step — using Newton-Schulz iteration or eigen decomposition — is what makes it different from element-wise optimizers like AdamW. That step is also what creates scaling challenges.

The preconditioning cost increases both computational load and memory consumption. Mixed-precision training and gradient accumulation introduce numerical instability at lower precision. Distributing orthogonalized updates across thousands of GPUs can create communication bottlenecks that erode the efficiency gains. The post describes NVIDIA’s approach as balanced across generality, throughput, and implementation complexity, with the intent that the same infrastructure supports Muon, SOAP, and other complex optimizers rather than optimizing narrowly for Muon.

The key infrastructure innovation is a layer-wise distributed optimizer. Traditional element-wise distributed optimizers work for AdamW because they can partition optimizer states and gradients evenly across GPUs. Muon cannot use this approach: it requires gradients for an entire layer to compute the weight update for that layer. If weights and optimizer states are sliced across data-parallel ranks, each GPU has only a shard and cannot independently compute the preconditioner.

The layer-wise optimizer distributes entire layers to specific data-parallel ranks. Each GPU owns full layers, which means it has everything needed for the Newton-Schulz computation. The tradeoff is variable-size communication: because whole layers vary in size, the all-gather operations use all_gatherv rather than fixed-size all_gather. This layer-wise distributed optimizer is now integrated into Megatron Core and available in the open-source codebase at layer_wise_optimizer.py.

Handling tensor parallelism

Tensor parallelism (TP) splits individual weight matrices across multiple GPUs. This creates a specific problem for Muon: the momentum buffer for a weight matrix is sharded, but the Newton-Schulz step needs the full matrix to compute the orthogonalization.

NVIDIA implemented TensorParallelMuon with three modes. Duplicated mode all-gathers momentum buffers across the TP domain so each GPU can run the full Newton-Schulz iteration; one all-gather per update regardless of iteration count, trading compute redundancy for fewer communication rounds. Distributed mode spreads the Newton-Schulz computation across GPUs, with an all-reduce after the first matrix multiplication of each iteration — more communication, less redundant compute. Blockwise mode skips cross-GPU communication entirely by doing orthogonalization only on the local shard; computationally cheapest but mathematically different from full-matrix orthogonalization.

Additional optimizations described in the post include communication hiding (delaying parameter all-gathers to overlap with the next forward pass), round-robin load balancing across layer sizes, and a SYRK optimization that maps two of the three matrix multiplications in a Newton-Schulz iteration to symmetric rank-K updates, saving roughly half the floating-point operations in those steps.

What this means in practice

The engineering here addresses a real bottleneck. Muon was used in production to train competitive models, but doing that without purpose-built distributed optimizer infrastructure meant accepting communication overhead or falling back to simpler configurations. NVIDIA’s integration into Megatron Core makes the tooling available to anyone running NeMo-based training at scale.

The benchmark results do not show Muon outperforming AdamW on throughput — they show it is close enough that its potential training efficiency advantages are not offset by optimizer overhead. Whether the optimization trajectory of Muon-trained models justifies the engineering complexity is a separate question, one the production use in Kimi K2 and GLM-5 has already started answering.