After several months of delay and speculation, DeepSeek released DeepSeek-V4 Pro and DeepSeek-V4 Flash, its first major architecture refresh since V3 in December 2024 and R1 in January 2025. The release comes with a 58-page technical report, MIT licensing, both Base and Instruct checkpoints, and first-party support for Huawei Ascend chips. Latent Space’s AINews coverage synthesizes the community reaction, benchmark results, and technical claims across the release.
Two models, one architecture
The V4 family uses a two-tier structure. V4 Pro has 1.6 trillion total parameters with 49 billion active per forward pass. V4 Flash has 284 billion total with 13 billion active. Both support a 1 million token context window, up from 128K in V3.2 — a jump multiple community observers described as the headline achievement of the release. The context expansion is enabled by two new attention mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA).
The efficiency gains from these mechanisms are substantial according to the technical report. At 1M tokens, V4 requires only 27% of FLOPs and 10% of KV cache memory compared with DeepSeek V3.2. Community analysis by @ZhihuFrontier quantified the KV cache more precisely: the 1M context KV cache weighs 9.62 GiB per sequence in bf16, compared with 83.9 GiB for V3.2 — an 8.7x reduction. FP4 index caching and FP8 attention caching yield another roughly 2x reduction on top of that.
The checkpoint format is mixed FP4 and FP8: MoE expert weights in FP4, attention, norm, and router weights in FP8. @LambdaAPI reported that the full model fits on a single 8xB200 node. Training used Moonshot’s Muon optimizer and the Manifold Constrained Hyper-Connections (mHC) approach from a DeepSeek paper published in January. The training scale reported across multiple community sources is 32 to 33 trillion tokens — roughly 20 tokens per parameter for the Pro model.
DeepSeek also released both the Base and Instruct versions. This is described in AINews as “incredibly rarely” done for models at this scale, and the coverage notes it “surely” sets the stage for community fine-tuning and a possible future reasoning model, though V4 already exposes three reasoning modes including a hybrid thinking/non-thinking configuration.
Where V4 lands on evals
The most structured independent benchmark synthesis in AINews came from @ArtificialAnlys, whose Artificial Analysis Intelligence Index placed V4 Pro Max at 52 — up 10 points from V3.2’s 42 — making it the #2 open-weights reasoning model, behind Kimi K2.6 at 54. V4 Flash Max scored 47, positioned at roughly what the coverage describes as “Claude Sonnet 4.6 max level intelligence.”
On GDPval-AA, an agentic real-world task evaluation, V4 Pro scored 1,554 — leading all open-weight models listed, ahead of Kimi K2.6 at 1,484, GLM-5.1 at 1,535, and MiniMax-M2.7 at 1,514. The Text Arena debut placed V4 at #2 open overall, with a #1 ranking in Medical and Healthcare, #15 in Creative Writing, and #18 in Multi-Turn.
The caveat that AINews flags prominently is token volume. Running the Artificial Analysis Index on V4 Pro consumed 190 million output tokens at a cost of $1,071 — despite the low per-token pricing of $1.74/$3.48 per million input/output tokens. V4 Flash ran the same index using 240 million output tokens for $113. The AINews framing is direct: “cheap per-token pricing does not imply cheap total task cost if the model spills huge token volumes.”
On hallucination, V4 Pro showed a 94% hallucination rate on AA-Omniscience despite an 11-point improvement over V3.2, while V4 Flash came in at 96%. These numbers sit alongside the positive eval results as a meaningful practical limitation.
The broader assessment from community evaluators quoted in AINews: V4 is “definitely better than GLM-5.1 but not quite Opus 4.7, GPT-5.4 or Gemini 3.1 Pro.” @ArtificialAnlys described the overall family as “roughly a Gemini 3.1, GPT 5.4, Opus 4.6 level model.”
Huawei Ascend and geopolitics
DeepSeek’s day-0 support for Huawei Ascend chips is a deliberate move to reduce dependence on NVIDIA hardware subject to US export controls. AINews notes that Ascend chips currently represent roughly a quarter of the supply of H100s, but that the milestone represents progress toward what the coverage calls “Chinese total independence” in the AI supply chain.
@Reuters reported via community posts that DeepSeek indicated V4 Pro pricing could fall “sharply” once Huawei Ascend 950 supernodes are deployed at scale in the second half of the year. On the other end of the hardware spectrum, community members also reported MLX quantizations running on Macs with 256GB RAM, and @simonw raised questions about viability on smaller machines.
Inference support on day 0 included coverage across H200, MI355, B200, B300, and GB200/300 systems from @SemiAnalysis_. NVIDIA reported 150+ tokens per second per user for agentic workflows on Blackwell Ultra. vLLM published day-0 V4 Pro performance numbers. Third-party API availability moved quickly through Together, Baseten, and Nous Research.
Pricing alongside the release: V4 Pro at $1.74/$3.48 and V4 Flash at $0.14/$0.28 per million input/output tokens, with cache-hit pricing also available. @scaling01 viewed the Flash pricing as a preview of future cheap coding model economics, describing it as a “Mythos-level” cost point.