DeepSeek V4:
The Model That Refuses
to Stop Shocking the World
1.6 trillion parameters. 1 million token context. A Hybrid Attention Architecture Silicon Valley didn't see coming. One year after R1 rattled markets, Hangzhou is back — and this time it's bigger in every sense.
(MoE: ~3% utilisation)
Low-latency variant
or book-length documents
cheapest frontier-class API
MiniMax M1 (456B)
(self-reported)
- The DeepSeek Story: From Startup to Sputnik
- V4 at a Glance: What Just Dropped
- Architecture Deep Dive: Mixture of Experts
- Multi-Head Latent Attention (MLA): The Memory Miracle
- Hybrid Attention Architecture: The V4 Innovation
- The 1 Million Token Context Window
- Benchmark Performance: How Does V4 Actually Score?
- Pricing: The Cost-Destruction Machine
- The Chip Question: Huawei Ascend & Nvidia Restrictions
- V4 vs. The Competition
- The Distillation Controversy
- Geopolitics, Bans, and the Open-Source Gambit
- Verdict: What V4 Actually Means
The DeepSeek Story: From Startup to Sputnik
In the summer of 2023, a small Hangzhou-based quant trading firm called High-Flyer quietly spun out an AI research team with a seemingly impossible mandate: build frontier large language models without frontier compute. DeepSeek — the team and later the company — was handed a budget that would be laughed out of any Silicon Valley AI lab. What they built instead redefined what "efficient" means in the context of large language models.
The world noticed in December 2024, when DeepSeek V3 arrived with 671 billion parameters, a then-jaw-dropping architecture, and a training cost claim of just $5.5 million — a fraction of the hundreds of millions being spent by OpenAI, Google, and Anthropic on comparable models. V3 outperformed every other open-weight model and nipped at the heels of frontier closed models.
Then came R1, in January 2025. A reasoning model trained with reinforcement learning that matched OpenAI's o1 on benchmark after benchmark. The training cost: under $6 million. The market reaction: NVIDIA lost $600 billion in market cap in a single day. Marc Andreessen called it "AI's Sputnik moment." The US government launched an immediate investigation into whether DeepSeek had violated chip export controls to acquire the Nvidia H800 chips used in training.
Now it's April 2026. DeepSeek has been quiet since R1 — dealing with personnel departures, regulatory scrutiny from both US and Chinese governments, and the pressure of following up arguably the most impactful single AI release in history. V4 is what they built in that silence.
V4 at a Glance: What Just Dropped
DeepSeek released the model as DeepSeek V4-Pro and DeepSeek V4-Flash. V4-Pro is a larger model aimed at more demanding tasks, while V4-Flash is a smaller version designed to respond faster and cost less to run.
The Pro model has a total of 1.6 trillion parameters (49 billion active), which makes it the biggest open-weight model available, outstripping Moonshot AI's Kimi K 2.6 (1.1 trillion), MiniMax's M1 (456 billion), and more than double DeepSeek V3.2 (671 billion). The jump from V3.2 to V4-Pro is not an incremental model update — it's a more than doubling of total model capacity and a complete architectural overhaul of how the model handles attention at long contexts.
The open-source wager: Both V4 models are available for download and local deployment on Hugging Face — under the same liberal license as prior DeepSeek releases. This "open everything" strategy is not altruism; it is a calculated geopolitical move. Open models build global adoption infrastructure that closed US models cannot replicate. Running DeepSeek on your own hardware means your data never touches DeepSeek's servers — a compelling answer to national security concerns in some markets.
Architecture Deep Dive: Mixture of Experts
To understand V4, you first need to understand the two-pillar architecture that DeepSeek pioneered in V3 and has now supercharged in V4: Mixture of Experts (MoE) and Multi-Head Latent Attention (MLA). Together, they explain why DeepSeek can build 1.6 trillion parameter models that don't cost a trillion dollars to run.
The genius of MoE is that it decouples two things that were previously coupled: model knowledge capacity and inference compute cost. In a dense model, a 1.6 trillion parameter model would require activating all 1.6 trillion parameters for every single token generated — making it essentially unusable at commercial scale without extreme hardware. With MoE, a 1.6 trillion parameter model activates only 49 billion parameters per token. The other 1.55 trillion sit dormant, waiting for the type of input they specialize in.
DeepSeek V3's MoE uses 1 shared expert (always active, every token) plus 256 routed experts (8 selected per token). The shared expert is crucial — it solves the token-dropping problem that plagued earlier MoE designs. In naive MoE, low-confidence tokens can get poorly routed and end up undertrained. The shared expert guarantees every token has at least one strong processing path regardless of routing confidence.
DeepSeek V4-Pro scales this dramatically: with 1.6T total parameters — more than double V3.2's 671B — the model's potential knowledge capacity is enormous, yet its inference cost (governed by active parameters) remains comparable to a 49B dense model. This is the single architectural fact that makes V4's pricing ($0.145/M input tokens) possible.
Multi-Head Latent Attention: The Memory Miracle
The second pillar of DeepSeek's architecture is MLA — Multi-Head Latent Attention, first introduced in DeepSeek V2 and refined in every major release since. To understand why MLA matters, you first need to understand the KV cache problem that plagues all transformer-based large language models at long context lengths.
When a transformer model generates text token-by-token, it needs to "remember" every previous token in the conversation to compute attention correctly. This is done through a KV (Key-Value) cache — stored representations of every prior token at every layer of the model. For standard multi-head attention (MHA), the memory required for this cache grows proportionally with sequence length. At 1 million token context windows, this becomes catastrophically expensive.
MLA achieves this through a low-rank joint compression of keys and values. Instead of storing a full-size Key and Value vector for every token at every layer, MLA compresses both into a single low-dimensional latent vector via a learned down-projection matrix. Only this compressed latent is stored in cache. At attention time, it's decompressed back to full-size K and V via an up-projection. The KV cache footprint shrinks by 98.6% — not a rounding error, a transformation.
This is why DeepSeek V4 can offer a 1 million token context window at commercially viable prices. Without MLA, a 1M token context on V4-Pro would require hundreds of gigabytes of GPU memory just for the KV cache, making it economically impossible to offer at $0.145 per million input tokens.
Hybrid Attention Architecture: What Makes V4 Different
DeepSeek singled out a technique it dubbed Hybrid Attention Architecture, which it said improves the ability of an AI platform to remember queries across long conversations. This is V4's signature architectural innovation — the reason it represents a genuine leap beyond V3 rather than a simple scaling exercise.
The core problem it solves is a fundamental tension in transformer attention: full attention (where every token attends to every other token) provides perfect recall but scales quadratically with sequence length. Sparse attention (where tokens only attend to a local window) is linear but "forgets" distant context. Prior DeepSeek models used MLA to reduce the memory cost of full attention, but the computational cost still scaled with sequence length.
The Hybrid Attention Architecture is DeepSeek's answer: alternate between full MLA attention layers and sparse local attention layers within the same model. Full attention layers appear at regular intervals, providing global context recall — the model's "memory refresh" across the entire 1M token window. The majority of layers use cheaper local attention, drastically reducing per-layer compute for the bulk of the processing. The result is a model that can genuinely remember a conversation from token 1 to token 1,000,000 — not just the recent window — without the quadratic compute cost of full attention at every layer.
This is meaningfully different from prior approaches. Earlier long-context models either truncated context, compressed it losily, or simply used sliding windows that effectively forgot older content. V4's Hybrid Attention maintains true long-range dependencies across million-token contexts, which matters enormously for the codebase analysis and long-document reasoning use cases DeepSeek is targeting.
The 1 Million Token Context Window
DeepSeek pushed the 1 million-token context window — a leap that allows entire codebases or long documents to be sent as a single prompt. To contextualize this: 1 million tokens is roughly 750,000 words — the equivalent of 3–4 average novels, an entire medium-sized codebase, a semester of research papers, or 80+ hours of meeting transcripts. This is no longer a model that helps you with tasks; it's a model that can ingest your entire work product at once and reason across all of it.
or a full codebase
128K token models
8× larger
makes 1M viable
Benchmark Performance: How Does V4 Actually Score?
DeepSeek says both models are more efficient and performant than DeepSeek V3.2 due to architectural improvements, and have almost "closed the gap" with current leading models, both open and closed, on reasoning benchmarks. The honest picture from DeepSeek's own benchmark release is more nuanced than that summary suggests.
Coding & Mathematics — Category Leaders
DeepSeek-V4-Pro beats all rival open models for maths and coding. On LiveCodeBench and competitive programming benchmarks, V4-Pro and V4-Flash both claim performance "comparable to GPT-5.4" — the current top closed-source coding model. This is V4's strongest category and where DeepSeek's architectural innovations in long-context reasoning pay clearest dividends: multi-file codebase analysis, long-chain mathematical proofs, and complex agentic coding tasks.
Reasoning — Near Frontier
The company claims its new V4-Pro-Max model outperforms its opensource peers across reasoning benchmarks, and outstrips OpenAI's GPT-5.2 and Gemini 3.0 Pro on some tasks. Both V4 models include a dedicated reasoning mode — like R1, the model generates explicit chain-of-thought reasoning tokens before producing its final answer. DeepSeek claims this mode substantially improves performance on multi-step mathematical proofs, logical puzzles, and complex coding challenges.
World Knowledge — The Honest Gap
The models seem to fall slightly behind frontier models in knowledge tests, specifically OpenAI's GPT-5.4 and Google's latest Gemini 3.1 Pro. This lag suggests a "developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months." This is a remarkably candid self-assessment from DeepSeek — and it reflects a genuine strategic tradeoff. DeepSeek's architecture optimizes for reasoning efficiency and coding capability; richer world knowledge requires more diverse pretraining data at scale, an area where OpenAI and Google still have meaningful advantages through their proprietary training pipelines and web access.
Overall Benchmark Table
| Model | Coding | Reasoning/Math | World Knowledge | Open-weight? | Context |
|---|---|---|---|---|---|
| V4-Pro | ≈ GPT-5.4 ✓ | Beats GPT-5.2 ✓ | 3–6mo lag | Yes | 1M |
| V4-Flash | ≈ GPT-5.4 ✓ | Near frontier | 3–6mo lag | Yes | 1M |
| GPT-5.4 (OpenAI) | SOTA | SOTA | SOTA | No (closed) | 128K |
| Gemini 3.1 Pro | Strong | SOTA | SOTA | No (closed) | 1M |
| Kimi K2.6 | Strong | Strong | Good | Yes | 128K |
| DeepSeek V3.2 | Good | Good | Good | Yes | 128K |
Pricing: The Cost-Destruction Machine, Again
If DeepSeek's architecture story is impressive, its pricing story is the part that genuinely disrupts the AI industry's business model. V4 follows the same playbook as V3 and R1: deliver near-frontier capability at a price that makes competing products look embarrassing.
The smaller V4 Flash model costs $0.14 per million input tokens and $0.28 per million output tokens, undercutting GPT-5.4 Nano, Gemini 3.1 Flash, GPT-5.4 Mini, and Claude Haiku 4.5. The larger V4 Pro model, meanwhile, costs $0.145 per million input tokens and $3.48 per million output tokens, also undercutting Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7, and GPT-5.4.
The nuclear option: V4-Flash's $0.28/M output price for a near-frontier model is within the cost range of running GPT-3.5 class models just two years ago. For high-volume production applications — search re-ranking, document classification, agentic coding pipelines — this price point changes the ROI calculus entirely. DeepSeek isn't just winning on benchmarks; it's making the business case for US closed-source models significantly harder to justify.
The Chip Question: Huawei Ascend & the Export Control Maze
Wei Sun, principal analyst at market analysis firm Counterpoint Research, highlighted that V4 is run on domestic chips from Huawei and Cambricon. "It allows AI systems to be built and deployed without relying solely on Nvidia, which is why V4 could ultimately have an even bigger impact than R1 — accelerating adoption domestically and contributing to faster global AI development overall."
The chip situation around DeepSeek V4 is more complex than the headline suggests. V3.2 notably reverted to Nvidia hardware after experimenting with Huawei during training — suggesting Huawei's Ascend chips were insufficiently competitive for model training at that point. V4 appears to use Huawei Ascend 950 for inference (serving requests) while training provenance remains unclear. The significance of the Huawei confirmation is strategic rather than technical: it demonstrates that China can now serve frontier-class AI models at scale without depending on Nvidia for inference infrastructure.
Market reaction: After DeepSeek announced V4's release, shares of Chinese contract chip manufacturers rose sharply in Hong Kong, with SMIC surging 9% and Hua Hong Semiconductor rising 15%. The market interpreted V4 as validation that China's domestic chip ecosystem can support frontier AI inference — a significant shift from the narrative that Nvidia dominance was unassailable.
V4 vs. The Competition: Where DeepSeek Wins and Loses
The competitive landscape that V4 enters is very different from the one R1 shook in January 2025. V4's debut is unlikely to have the same market impact as R1, because traders have already priced in the reality that Chinese AI is competitive and cheaper to use. That doesn't mean V4 is unimportant — it means the industry has recalibrated its expectations upward for what DeepSeek can build.
The Distillation Controversy
The launch comes a day after the US accused China of stealing American AI labs' IP on an industrial scale using thousands of proxy accounts. DeepSeek itself has been accused by Anthropic and OpenAI of "distilling," essentially copying, their AI models. This accusation — that DeepSeek used the outputs of closed frontier models to train its own — has shadowed the company since R1's release.
Model distillation is a legitimate and widely-used machine learning technique where a smaller or different model is trained to mimic the outputs of a larger "teacher" model. The controversy is that the terms of service of OpenAI, Anthropic, and Google explicitly prohibit using their model outputs to train competing AI systems. The US White House's Office of Science and Technology Policy has accused unnamed Chinese entities of conducting "industrial-scale" campaigns through proxy accounts to extract training data from US AI APIs.
DeepSeek has not publicly responded to these specific accusations regarding V4. The situation represents a fundamental tension in the AI landscape: open benchmarks and API access make it technically straightforward to distill from frontier models, but the legal and ethical lines around what constitutes legitimate research versus IP theft remain hotly contested.
Geopolitics, Bans, and the Open-Source Gambit
Some countries banned government agencies from using DeepSeek, including Italy, the United States, and South Korea, citing national security concerns. Germany also banned DeepSeek in Apple and Google app stores in 2025, citing illegal transfer of user data to China. These bans reflect real concerns: DeepSeek's privacy policy, like most Chinese consumer internet services, is subject to Chinese data laws that include government access provisions.
And yet: the open-weight release of V4 is a perfect answer to this concern for any organization capable of self-hosting. Download the weights, run the model on your own infrastructure in your own country, and you have none of the data sovereignty concerns of using DeepSeek's API. This is the open-source gambit: by releasing model weights, DeepSeek maximizes global adoption while sidestepping many of the privacy arguments that governments have used to justify bans.
According to Stanford's 2026 AI Index, Chinese companies have "effectively closed" the AI performance gap with their US rivals. While American proprietary models like Claude, ChatGPT and Gemini remain at the top of the industry ladder for now, the gap is narrowing. V4 is evidence that this assessment is correct — and that the pace of narrowing is accelerating, not slowing.
Verdict: What DeepSeek V4 Actually Means
DeepSeek V4 is not another R1. It won't crash stock markets overnight or force a Senate hearing by Monday morning. The industry has priced in Chinese AI competitiveness — and that recalibration is, itself, V4's most important predecessor. What V4 represents instead is something more durable: evidence that the gap-closing is structural, not episodic.
The Hybrid Attention Architecture is a genuine research contribution — not just a scaling exercise. The 1 million token context at $0.14 per million tokens is a commercial fact that no closed-source competitor can currently match on price. The 1.6 trillion parameter open-weight model is the largest ever released to the public, available for any organization in the world to download, run, and fine-tune. And the inference cluster running on Huawei Ascend 950 chips is a proof-of-concept for an AI supply chain that doesn't depend on Nvidia or US export licenses.
The real story of V4 isn't any single benchmark. It's that a company working under chip restrictions, regulatory scrutiny from two governments, personnel departures, and the impossible pressure of following up R1 still delivered the world's biggest open-weight model — with a novel attention architecture, million-token context, and pricing that undercuts the competition by an order of magnitude. The "Sputnik moment" framing from 2025 implied a single alarming event. V4 suggests something more patient and more consequential: a systematic program of efficient frontier AI research that keeps delivering, regardless of what the US throws at it.
For developers and enterprises: V4 deserves immediate evaluation for coding, long-document analysis, and agentic applications. The pricing and context length are genuinely differentiated. The knowledge gap is real and matters for certain use cases. For AI strategists and policymakers: the open-source delivery mechanism makes national bans increasingly performative. The weights are already on Hugging Face. That genie does not return to the bottle.