NVIDIA Blackwell delivers over 150 tokens/sec/user on DeepSeek-V4-Pro in initial testing

NVIDIA’s developer blog covers Blackwell platform support for DeepSeek-V4 on day zero of the model’s release, including initial throughput numbers and deployment options. The post covers both the model family’s architecture and the serving infrastructure available for teams building on it.

DeepSeek-V4-Pro has 1.6 trillion total parameters and 49 billion active parameters. DeepSeek-V4-Flash has 284 billion total parameters and 13 billion active parameters. Both models support a context window of up to 1 million tokens and a maximum output of up to 384,000 tokens, according to the post.

Architecture and inference economics

The V4 family uses a mixture-of-experts architecture with three attention mechanisms. Compressed Sparse Attention (CSA) uses dynamic sequence compression to reduce KV cache memory and applies sparsification. DeepSeek Sparse Attention reduces computational overhead through attention matrix sparsification. Heavily Compressed Attention (HCA) consolidates KV entries across sets of tokens into single compressed entries, achieving further KV cache reduction.

Together, the post states these achieve a 73% reduction in per-token inference FLOPs and a 90% reduction in KV cache memory burden compared with DeepSeek-V3.2.

Blackwell performance

Out-of-the-box testing of DeepSeek-V4-Pro on NVIDIA GB200 NVL72 demonstrated over 150 tokens per second per user, according to the post. NVIDIA also used vLLM’s Day 0 NVIDIA Blackwell B300 recipe to produce a throughput snapshot across latency and throughput configurations. The post states these numbers are expected to improve as NVIDIA optimizes its stack, citing Dynamo, NVFP4, optimized CUDA kernels, and advanced parallelization techniques as areas of ongoing work.

Both models are available through NVIDIA GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program, and through NVIDIA NIM on day zero.

Serving frameworks

Two open-source serving frameworks are covered. SGLang offers serving recipes for DeepSeek-V4 on Blackwell and Hopper hardware tuned for low-latency, balanced, and maximum-throughput profiles, along with recipes for long-context workloads and prefill/decode disaggregation. vLLM provides single-node and multinode recipes for Blackwell and Hopper, including multinode prefill/decode disaggregation scaling to 100 or more GPUs, with support for tool calling, reasoning, and speculative decoding.

Agent harnesses

The post identifies three agent integrations. NVIDIA NemoClaw allows users to run OpenClaw in a secure OpenShell environment to create a long-running personal assistant powered by DeepSeek-V4 for tasks including code generation and autonomous support. The NVIDIA AI-Q Blueprint, described as a deep research assistant based on LangChain Deep Agents, is extensible to add DeepSeek-V4 for orchestration and planning. The NVIDIA Data Explorer Agent, written with NeMo Agent Toolkit, won first place in the DABstep benchmark and supports switching to DeepSeek-V4.

The post states: “As open models reach the frontier of intelligence, the enterprise focus is pivoting from model selection to infrastructure strategy.”