DeepSeek V4 arrives: frontier-adjacent performance at dramatically lower cost

Chinese AI lab DeepSeek has released the first models in its V4 series: DeepSeek-V4-Pro and DeepSeek-V4-Flash, both available under an MIT license on Hugging Face. The pair follow DeepSeek’s V3.2 and V3.2 Speciale, which came out last December. Both new models support 1 million token context windows and use a Mixture of Experts architecture.

The numbers are large. V4-Pro has 1.6 trillion total parameters with 49 billion active. V4-Flash has 284 billion total parameters with 13 billion active. According to Simon Willison’s write-up, V4-Pro is likely the new largest open-weights model available, larger than Kimi K2.6 at 1.1 trillion parameters, GLM-5.1 at 754 billion, and more than twice the size of DeepSeek’s own V3.2 at 685 billion. The raw file sizes on Hugging Face reflect this: Pro weighs in at 865GB, Flash at 160GB.

The pricing is the real story

What Willison calls out most forcefully is the cost. DeepSeek’s pricing for V4-Flash is $0.14 per million input tokens and $0.28 per million output tokens. For V4-Pro, it’s $1.74 per million input and $3.48 per million output.

A comparison table in the post places those numbers in context. V4-Flash at $0.14 input is cheaper than GPT-5.4 Nano ($0.20), Gemini 3.1 Flash-Lite ($0.25), Gemini 3 Flash Preview ($0.50), and GPT-5.4 Mini ($0.75). V4-Pro at $1.74 input undercuts Gemini 3.1 Pro ($2.00), GPT-5.4 ($2.50), Claude Sonnet 4.6 ($3.00), Claude Opus 4.7 ($5.00), and GPT-5.5 ($5.00). As Willison summarizes: “DeepSeek-V4-Flash is the cheapest of the small models, beating even OpenAI’s GPT-5.4 Nano. DeepSeek-V4-Pro is the cheapest of the larger frontier models.”

That pricing is possible in part because of what DeepSeek has done with efficiency, particularly for long-context workloads. The paper reports that in a 1 million token context scenario, V4-Pro achieves only 27% of the single-token FLOPs and 10% of the KV cache size relative to V3.2. V4-Flash pushes further: 10% of the single-token FLOPs and 7% of the KV cache compared with V3.2. These are substantial reductions — the kind that make running large-context jobs meaningfully cheaper to serve, which flows directly into the price.

Where the model sits relative to frontier

DeepSeek’s self-reported benchmarks describe a model that is competitive with frontier offerings, though with a clear acknowledgment of the gap. According to the paper, “DeepSeek-V4-Pro-Max demonstrates superior performance relative to GPT-5.2 and Gemini-3.0-Pro on standard reasoning benchmarks.” However, “its performance falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months.”

That framing — trailing the absolute frontier by three to six months — is an unusually specific and candid self-assessment for a model release announcement. It positions V4-Pro not as a frontier-beater but as a near-frontier model at a fraction of the price of the actual frontier. Whether that is the right trade-off depends on the use case.

Willison ran both models through his standard informal benchmark: generating an SVG of a pelican riding a bicycle. He links to the outputs. Flash produced “an excellent bicycle — good frame shape, nice chain, even has a reflector on the front wheel. Pelican has a mean looking expression but has its wings on the handlebars and feet on the pedals.” Pro’s result was mixed: “Another solid bicycle, albeit the spokes are a little jagged and the frame is compressed a bit. Pelican has gone a bit wrong — it has a VERY large body, only one wing, a weirdly hairy backside.”

Informal as this is, Willison has been running the same pelican test across DeepSeek versions since V3-0324, which gives it some longitudinal value as a rough consistency check.

Running locally

Willison tested both models via OpenRouter using his llm-openrouter plugin, rather than running them locally. He notes that a quantized version of Flash might fit on his 128GB M5 MacBook Pro, and that Pro could potentially run if active experts are streamed from disk — though this is speculative.

He says he is watching Unsloth’s Hugging Face page for quantized versions, which the Unsloth team typically produces quickly for large open-weights releases. For researchers and developers interested in running the model locally, the 160GB Flash file is the more plausible starting point; the 865GB Pro would require either substantial RAM or creative streaming approaches.

The MIT license means both models can be used without the usage restrictions that apply to some other open-weights releases, which matters for commercial deployment and fine-tuning. Combined with the pricing and efficiency numbers, DeepSeek has released two models that will draw serious attention from anyone building applications where inference cost is a meaningful variable.