End-to-end FP8 in RL training: NeMo RL achieves 48% speedup over BF16 baseline

NVIDIA has published a post on NeMo RL describing how applying FP8 precision end-to-end across both the generation and training phases of reinforcement learning delivers throughput gains while maintaining accuracy. Applying FP8 only to one phase, the post explains, creates a numerical disagreement problem that the end-to-end approach is designed to resolve.

According to the post, end-to-end FP8 on linear layers achieves a consistent 15–25% throughput improvement over BF16. With FP8 extended to KV cache and attention, the total speedup reaches approximately 48% over the BF16 baseline.

The numerical disagreement problem

NVIDIA NeMo RL uses vLLM for rollouts and Megatron Core for training. Each uses distinct CUDA kernels, which introduces numerical differences between the two systems. The post quantifies this as token multiplicative probability error — the mean exponentiated absolute difference between log-probabilities from the training and inference frameworks. The post states that acceptable values in practice are below 1.03 to 1.05, with a perfect score of 1.0.

The problem with applying FP8 only to generation while keeping training in BF16 is that quantization in the generation engine creates a distribution mismatch the training engine does not see. The post reports that for this mixed configuration, importance sampling can narrow the accuracy gap but cannot close it. With FP8 applied to both generation and training, importance sampling completely closes the remaining gap from the BF16 baseline. The post describes the symmetric use of FP8 across both engines as what makes the mismatch correctable.

Results on dense and MoE models

On Llama 3.1 8B Instruct trained with GRPO on a math dataset to 4,000 steps, the FP8 recipe achieves 15–25% throughput improvement over BF16 with matching accuracy after importance sampling. The post reports a BF16 baseline validation accuracy of 0.616 and an end-to-end FP8 result of 0.613 after importance sampling.

The post explains why the speedup falls short of the theoretical 2x: FP8 provides 2x peak throughput only for linear layers, while attention, normalization, non-linear functions, and output projections remain in BF16. Additional quantization kernels inserted before linear layers also add overhead. The post notes that fusing those kernels in vLLM could push the speedup to approximately 1.25x for linear layers alone.

For Qwen3-30B, a mixture-of-experts model, experiments show matching accuracy curves between FP8 and BF16, trained over 600 steps. The post notes that the speed gain for the MoE case is still being investigated.

FP8 for KV cache and attention

KV cache growth and attention computation can dominate rollout time in RL workflows with long output sequences. The post describes extending FP8 to KV cache and attention operations using per-tensor scaling.

The challenge specific to RL is that policy weights change at every training step, requiring quantization scales to track current policy state. NeMo RL addresses this by recalibrating Query, Key, and Value scales at the end of each training step using the updated policy weights, then synchronizing those scales to vLLM for the next rollout phase. The post reports calibration overhead of approximately 2–3% of total step time.

On Qwen3-8B-Base using GRPO, adding KV cache and attention quantization on top of linear FP8 delivers approximately 30% additional speedup on the rollout stage, bringing the total improvement to approximately 48% over the BF16 baseline. The post notes the gains are most pronounced at longer response lengths, where attention computation makes up a larger share of total work.

Implementation

The FP8 recipe is derived from the block-wise quantized FP8 approach in the DeepSeek-V3 Technical Report, the post states. Linear layers use FP8 math; all other modules remain in BF16. The kv_cache_dtype setting in the vLLM configuration triggers automatic QKV recalibration and synchronization to NeMo RL’s training loop.