Breaking · Just Released

DeepSeek V4:
The Model That Refuses
to Stop Shocking the World

1.6 trillion parameters. 1 million token context. A Hybrid Attention Architecture Silicon Valley didn't see coming. One year after R1 rattled markets, Hangzhou is back — and this time it's bigger in every sense.

40 min read V4-Pro · V4-Flash · MoE · MLA · Hybrid Attention Published April 25, 2026
PREVIEW RELEASED APRIL 24, 2026 — DeepSeek V4-Pro and V4-Flash are preview models. Full technical paper not yet published. This analysis draws on Hugging Face model cards, benchmark results released by DeepSeek, and independent technical coverage. Some architectural details remain under NDA or undisclosed.
V4-Pro Parameters
1.6T
49B active per token
(MoE: ~3% utilisation)
V4-Flash Parameters
284B
13B active per token
Low-latency variant
Context Window
1M
tokens — entire codebases
or book-length documents
V4-Flash Input Price
$0.14
per million tokens
cheapest frontier-class API
Biggest Open Model
#1
Beats Kimi K2.6 (1.1T)
MiniMax M1 (456B)
Knowledge Lag vs SOTA
3–6mo
behind GPT-5.4 & Gemini 3.1
(self-reported)
// TABLE OF CONTENTS
  1. The DeepSeek Story: From Startup to Sputnik
  2. V4 at a Glance: What Just Dropped
  3. Architecture Deep Dive: Mixture of Experts
  4. Multi-Head Latent Attention (MLA): The Memory Miracle
  5. Hybrid Attention Architecture: The V4 Innovation
  6. The 1 Million Token Context Window
  7. Benchmark Performance: How Does V4 Actually Score?
  8. Pricing: The Cost-Destruction Machine
  9. The Chip Question: Huawei Ascend & Nvidia Restrictions
  10. V4 vs. The Competition
  11. The Distillation Controversy
  12. Geopolitics, Bans, and the Open-Source Gambit
  13. Verdict: What V4 Actually Means
01
Background

The DeepSeek Story: From Startup to Sputnik

In the summer of 2023, a small Hangzhou-based quant trading firm called High-Flyer quietly spun out an AI research team with a seemingly impossible mandate: build frontier large language models without frontier compute. DeepSeek — the team and later the company — was handed a budget that would be laughed out of any Silicon Valley AI lab. What they built instead redefined what "efficient" means in the context of large language models.

The world noticed in December 2024, when DeepSeek V3 arrived with 671 billion parameters, a then-jaw-dropping architecture, and a training cost claim of just $5.5 million — a fraction of the hundreds of millions being spent by OpenAI, Google, and Anthropic on comparable models. V3 outperformed every other open-weight model and nipped at the heels of frontier closed models.

Then came R1, in January 2025. A reasoning model trained with reinforcement learning that matched OpenAI's o1 on benchmark after benchmark. The training cost: under $6 million. The market reaction: NVIDIA lost $600 billion in market cap in a single day. Marc Andreessen called it "AI's Sputnik moment." The US government launched an immediate investigation into whether DeepSeek had violated chip export controls to acquire the Nvidia H800 chips used in training.

Now it's April 2026. DeepSeek has been quiet since R1 — dealing with personnel departures, regulatory scrutiny from both US and Chinese governments, and the pressure of following up arguably the most impactful single AI release in history. V4 is what they built in that silence.

Dec 2024
DeepSeek V3
671B parameters (37B active). MoE + MLA architecture. $5.5M training cost claim. Best open-weight model at launch.
671B params 37B active MoE + MLA 128K context
Jan 2025
DeepSeek R1 ⚡ The Sputnik
Reasoning model via GRPO RL. Matched OpenAI o1 on AIME, MATH-500, SWE-bench. Sub-$6M training cost. Crashed NVIDIA by $600B market cap in one day.
Market crash GRPO RL Chain-of-thought <$6M cost
Sep 2025
DeepSeek V3.2 / V3.2-Exp
Added DeepSeek Sparse Attention (DSA). Incremental improvement. Used as infrastructure warm-up for V4. Returned to Nvidia chips after Huawei experiment.
DSA attention Sparse attn. Infra testbed
Apr 24, 2026 — TODAY
DeepSeek V4-Pro & V4-Flash
1.6T parameters (49B active). 1M token context. Hybrid Attention Architecture. Biggest open-weight model ever released. Huawei Ascend 950 chip cluster deployed for inference.
1.6T params 1M context Hybrid attention Huawei Ascend
02
What Just Dropped

V4 at a Glance: What Just Dropped

DeepSeek released the model as DeepSeek V4-Pro and DeepSeek V4-Flash. V4-Pro is a larger model aimed at more demanding tasks, while V4-Flash is a smaller version designed to respond faster and cost less to run.

V4-Flash
Speed variant · Low latency · Cost-critical apps
Total params284 billion
Active per token13 billion
Context window1M tokens
ArchitectureMoE + Hybrid Attn
Reasoning modeYes
MultimodalText only
Input price$0.14/M tokens
Output price$0.28/M tokens
Open-weightYes (HuggingFace)

The Pro model has a total of 1.6 trillion parameters (49 billion active), which makes it the biggest open-weight model available, outstripping Moonshot AI's Kimi K 2.6 (1.1 trillion), MiniMax's M1 (456 billion), and more than double DeepSeek V3.2 (671 billion). The jump from V3.2 to V4-Pro is not an incremental model update — it's a more than doubling of total model capacity and a complete architectural overhaul of how the model handles attention at long contexts.

The open-source wager: Both V4 models are available for download and local deployment on Hugging Face — under the same liberal license as prior DeepSeek releases. This "open everything" strategy is not altruism; it is a calculated geopolitical move. Open models build global adoption infrastructure that closed US models cannot replicate. Running DeepSeek on your own hardware means your data never touches DeepSeek's servers — a compelling answer to national security concerns in some markets.

03
Architecture

Architecture Deep Dive: Mixture of Experts

To understand V4, you first need to understand the two-pillar architecture that DeepSeek pioneered in V3 and has now supercharged in V4: Mixture of Experts (MoE) and Multi-Head Latent Attention (MLA). Together, they explain why DeepSeek can build 1.6 trillion parameter models that don't cost a trillion dollars to run.

Mixture of Experts (MoE): How 49B Active Parameters Come From 1.6T Total Dense Model (old approach) DeepSeek V4 MoE (new approach) Input Token ALL parameters active for every token FFN (all neurons — expensive) Output 100% of parameters activated → high cost Input Token Router selects 8 experts + 1 shared Router E3 E7 256 experts total; 8 activated per token Output ~3% of parameters activated → low cost
Fig. 1 — Mixture of Experts vs. Dense model. V4-Pro's 1.6T parameters are divided into 256 routed experts plus 1 always-active shared expert. A router selects 8 specialists per token — meaning only ~49B parameters (3%) are active at any time. This decouples model capacity (what it knows) from inference cost (what it costs to run).

The genius of MoE is that it decouples two things that were previously coupled: model knowledge capacity and inference compute cost. In a dense model, a 1.6 trillion parameter model would require activating all 1.6 trillion parameters for every single token generated — making it essentially unusable at commercial scale without extreme hardware. With MoE, a 1.6 trillion parameter model activates only 49 billion parameters per token. The other 1.55 trillion sit dormant, waiting for the type of input they specialize in.

DeepSeek V3's MoE uses 1 shared expert (always active, every token) plus 256 routed experts (8 selected per token). The shared expert is crucial — it solves the token-dropping problem that plagued earlier MoE designs. In naive MoE, low-confidence tokens can get poorly routed and end up undertrained. The shared expert guarantees every token has at least one strong processing path regardless of routing confidence.

DeepSeek V4-Pro scales this dramatically: with 1.6T total parameters — more than double V3.2's 671B — the model's potential knowledge capacity is enormous, yet its inference cost (governed by active parameters) remains comparable to a 49B dense model. This is the single architectural fact that makes V4's pricing ($0.145/M input tokens) possible.

04
Attention Innovation

Multi-Head Latent Attention: The Memory Miracle

The second pillar of DeepSeek's architecture is MLA — Multi-Head Latent Attention, first introduced in DeepSeek V2 and refined in every major release since. To understand why MLA matters, you first need to understand the KV cache problem that plagues all transformer-based large language models at long context lengths.

When a transformer model generates text token-by-token, it needs to "remember" every previous token in the conversation to compute attention correctly. This is done through a KV (Key-Value) cache — stored representations of every prior token at every layer of the model. For standard multi-head attention (MHA), the memory required for this cache grows proportionally with sequence length. At 1 million token context windows, this becomes catastrophically expensive.

Multi-Head Latent Attention (MLA): KV Cache Compression Standard MHA (old) Input h_t K (full) V (full) KV Cache: ~82KB per token per layer 🔴 128K context → ~40GB KV cache memory 1M context → ~320GB KV cache memory → Requires 32+ high-end GPUs MLA (DeepSeek V2–V4) Input h_t Down-project → latent c W_DKV matrix KV Cache: ~1.15KB per token per layer ✓ Up-project → K, V (attn time) 1M context → ~500MB KV cache → Fits on 4 GPUs instead of 32+
Fig. 2 — Standard MHA vs. MLA KV cache comparison. MLA compresses K and V tensors into a low-dimensional latent vector before caching. At inference time, these are decompressed back to full size for attention computation. The result: 98.6% memory reduction, from roughly 82KB to roughly 1.15KB per token per layer — the difference between needing 32+ GPUs and 4 GPUs for a 1M-token context session.

MLA achieves this through a low-rank joint compression of keys and values. Instead of storing a full-size Key and Value vector for every token at every layer, MLA compresses both into a single low-dimensional latent vector via a learned down-projection matrix. Only this compressed latent is stored in cache. At attention time, it's decompressed back to full-size K and V via an up-projection. The KV cache footprint shrinks by 98.6% — not a rounding error, a transformation.

This is why DeepSeek V4 can offer a 1 million token context window at commercially viable prices. Without MLA, a 1M token context on V4-Pro would require hundreds of gigabytes of GPU memory just for the KV cache, making it economically impossible to offer at $0.145 per million input tokens.

05
V4's New Innovation

Hybrid Attention Architecture: What Makes V4 Different

DeepSeek singled out a technique it dubbed Hybrid Attention Architecture, which it said improves the ability of an AI platform to remember queries across long conversations. This is V4's signature architectural innovation — the reason it represents a genuine leap beyond V3 rather than a simple scaling exercise.

The core problem it solves is a fundamental tension in transformer attention: full attention (where every token attends to every other token) provides perfect recall but scales quadratically with sequence length. Sparse attention (where tokens only attend to a local window) is linear but "forgets" distant context. Prior DeepSeek models used MLA to reduce the memory cost of full attention, but the computational cost still scaled with sequence length.

Hybrid Attention Architecture: Combining Full & Sparse Attention Per Layer Layer stack Layer 1 — Full MLA Attention (attends entire 1M context) Layer 2 — Sparse Local Attention (window: nearby tokens only) Layer 3 — Sparse Local Attention (window: nearby tokens only) Layer 4 — Sparse Local Attention (window: nearby tokens only) Layer N — Full MLA Attention (global context refresh) · · · Full layers: perfect recall, higher cost Sparse layers: fast, efficient, local memory
Fig. 3 — Hybrid Attention Architecture in DeepSeek V4. Full MLA attention layers appear at strategic intervals in the layer stack, providing global context recall across the full 1M token window. The majority of layers use sparse local attention for efficiency. This "dial" between full and sparse attention is the architectural innovation that enables 1M context at practical cost.

The Hybrid Attention Architecture is DeepSeek's answer: alternate between full MLA attention layers and sparse local attention layers within the same model. Full attention layers appear at regular intervals, providing global context recall — the model's "memory refresh" across the entire 1M token window. The majority of layers use cheaper local attention, drastically reducing per-layer compute for the bulk of the processing. The result is a model that can genuinely remember a conversation from token 1 to token 1,000,000 — not just the recent window — without the quadratic compute cost of full attention at every layer.

This is meaningfully different from prior approaches. Earlier long-context models either truncated context, compressed it losily, or simply used sliding windows that effectively forgot older content. V4's Hybrid Attention maintains true long-range dependencies across million-token contexts, which matters enormously for the codebase analysis and long-document reasoning use cases DeepSeek is targeting.

06
Context Window

The 1 Million Token Context Window

DeepSeek pushed the 1 million-token context window — a leap that allows entire codebases or long documents to be sent as a single prompt. To contextualize this: 1 million tokens is roughly 750,000 words — the equivalent of 3–4 average novels, an entire medium-sized codebase, a semester of research papers, or 80+ hours of meeting transcripts. This is no longer a model that helps you with tasks; it's a model that can ingest your entire work product at once and reason across all of it.

1M tokens ≈
750K
words — 3–4 novels
or a full codebase
vs. GPT-4o
larger context than
128K token models
V3.2 context
128K
tokens — V4 is
8× larger
MLA memory savings
98.6%
KV cache reduction
makes 1M viable
07
Benchmarks

Benchmark Performance: How Does V4 Actually Score?

DeepSeek says both models are more efficient and performant than DeepSeek V3.2 due to architectural improvements, and have almost "closed the gap" with current leading models, both open and closed, on reasoning benchmarks. The honest picture from DeepSeek's own benchmark release is more nuanced than that summary suggests.

Coding & Mathematics — Category Leaders

DeepSeek-V4-Pro beats all rival open models for maths and coding. On LiveCodeBench and competitive programming benchmarks, V4-Pro and V4-Flash both claim performance "comparable to GPT-5.4" — the current top closed-source coding model. This is V4's strongest category and where DeepSeek's architectural innovations in long-context reasoning pay clearest dividends: multi-file codebase analysis, long-chain mathematical proofs, and complex agentic coding tasks.

// CODING BENCHMARK — LiveCodeBench (approx. composite, normalized)
DeepSeek V4-Pro
~93
GPT-5.4 (OpenAI)
~95
Gemini 3.1 Pro
~91
DeepSeek V4-Flash
~88
Kimi K2.6
~84
DeepSeek V3.2
~79
Approximate composite from DeepSeek-published benchmarks. Exact methodology not yet disclosed in full paper.

Reasoning — Near Frontier

The company claims its new V4-Pro-Max model outperforms its opensource peers across reasoning benchmarks, and outstrips OpenAI's GPT-5.2 and Gemini 3.0 Pro on some tasks. Both V4 models include a dedicated reasoning mode — like R1, the model generates explicit chain-of-thought reasoning tokens before producing its final answer. DeepSeek claims this mode substantially improves performance on multi-step mathematical proofs, logical puzzles, and complex coding challenges.

World Knowledge — The Honest Gap

The models seem to fall slightly behind frontier models in knowledge tests, specifically OpenAI's GPT-5.4 and Google's latest Gemini 3.1 Pro. This lag suggests a "developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months." This is a remarkably candid self-assessment from DeepSeek — and it reflects a genuine strategic tradeoff. DeepSeek's architecture optimizes for reasoning efficiency and coding capability; richer world knowledge requires more diverse pretraining data at scale, an area where OpenAI and Google still have meaningful advantages through their proprietary training pipelines and web access.

Overall Benchmark Table

Model Coding Reasoning/Math World Knowledge Open-weight? Context
V4-Pro ≈ GPT-5.4 ✓ Beats GPT-5.2 ✓ 3–6mo lag Yes 1M
V4-Flash ≈ GPT-5.4 ✓ Near frontier 3–6mo lag Yes 1M
GPT-5.4 (OpenAI) SOTA SOTA SOTA No (closed) 128K
Gemini 3.1 Pro Strong SOTA SOTA No (closed) 1M
Kimi K2.6 Strong Strong Good Yes 128K
DeepSeek V3.2 Good Good Good Yes 128K
08
Pricing

Pricing: The Cost-Destruction Machine, Again

If DeepSeek's architecture story is impressive, its pricing story is the part that genuinely disrupts the AI industry's business model. V4 follows the same playbook as V3 and R1: deliver near-frontier capability at a price that makes competing products look embarrassing.

Claude Opus 4.7 (input)
$15.00
per million tokens (est.)
Out: ~$75/M · Closed-source
GPT-5.5 (input)
~$10.00
per million tokens (est.)
Out: ~$30/M · Closed-source
Gemini 3.1 Pro (input)
~$3.50
per million tokens (est.)
Out: ~$10/M · Closed-source
Self-hosted V4-Pro
$0.00
API cost (infra only)
Open-weight; run on your hardware

The smaller V4 Flash model costs $0.14 per million input tokens and $0.28 per million output tokens, undercutting GPT-5.4 Nano, Gemini 3.1 Flash, GPT-5.4 Mini, and Claude Haiku 4.5. The larger V4 Pro model, meanwhile, costs $0.145 per million input tokens and $3.48 per million output tokens, also undercutting Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7, and GPT-5.4.

The nuclear option: V4-Flash's $0.28/M output price for a near-frontier model is within the cost range of running GPT-3.5 class models just two years ago. For high-volume production applications — search re-ranking, document classification, agentic coding pipelines — this price point changes the ROI calculus entirely. DeepSeek isn't just winning on benchmarks; it's making the business case for US closed-source models significantly harder to justify.

09
Chip Strategy

The Chip Question: Huawei Ascend & the Export Control Maze

🔴
Huawei Ascend 950 "Supernode" clusters now confirmed for V4 inference
DeepSeek partnered with Chinese tech giant Huawei, which confirmed that its latest AI computing cluster, powered by its "Ascend 950" chips, can support DeepSeek's V4 model. Huawei supports the AI startup with its "Supernode" technology by combining large clusters of its Ascend 950 chips to provide more computing power. It remains unclear how extensively Huawei's chips were used in training V4, versus Nvidia hardware. DeepSeek has been restricted from directly purchasing Nvidia's most advanced AI chips due to Washington's ever-shifting export controls.

Wei Sun, principal analyst at market analysis firm Counterpoint Research, highlighted that V4 is run on domestic chips from Huawei and Cambricon. "It allows AI systems to be built and deployed without relying solely on Nvidia, which is why V4 could ultimately have an even bigger impact than R1 — accelerating adoption domestically and contributing to faster global AI development overall."

The chip situation around DeepSeek V4 is more complex than the headline suggests. V3.2 notably reverted to Nvidia hardware after experimenting with Huawei during training — suggesting Huawei's Ascend chips were insufficiently competitive for model training at that point. V4 appears to use Huawei Ascend 950 for inference (serving requests) while training provenance remains unclear. The significance of the Huawei confirmation is strategic rather than technical: it demonstrates that China can now serve frontier-class AI models at scale without depending on Nvidia for inference infrastructure.

Market reaction: After DeepSeek announced V4's release, shares of Chinese contract chip manufacturers rose sharply in Hong Kong, with SMIC surging 9% and Hua Hong Semiconductor rising 15%. The market interpreted V4 as validation that China's domestic chip ecosystem can support frontier AI inference — a significant shift from the narrative that Nvidia dominance was unassailable.

10
Competitive Landscape

V4 vs. The Competition: Where DeepSeek Wins and Loses

The competitive landscape that V4 enters is very different from the one R1 shook in January 2025. V4's debut is unlikely to have the same market impact as R1, because traders have already priced in the reality that Chinese AI is competitive and cheaper to use. That doesn't mean V4 is unimportant — it means the industry has recalibrated its expectations upward for what DeepSeek can build.

Best for Coding
V4-Pro / V4-Flash
Both models claim performance "comparable to GPT-5.4" on coding benchmarks. At 1/100th the API cost, for pure coding applications this is a no-brainer.
Best for World Knowledge
GPT-5.4 / Gemini 3.1 Pro
DeepSeek admits a 3–6 month lag here. For tasks requiring deep world knowledge, current events, or broad factual Q&A, US closed-source models still lead.
Best Open-Weight Overall
DeepSeek V4-Pro
Biggest open-weight model ever at 1.6T params. Beats Kimi K2.6, MiniMax M1, and all prior open models on coding. Freely downloadable.
Best Price/Performance
V4-Flash
$0.14/M input tokens for near-frontier performance. No other model in this performance tier comes within 5× of this price point via API.
Best Long-Context
V4 (tied Gemini 3.1)
1M token context window matches Gemini 3.1 Pro. V4's Hybrid Attention Architecture may offer better coherence at extreme lengths — not yet independently verified.
Best for Multimodal
GPT-5.4 / Gemini 3.1
V4 is text-only. No vision, audio, or video. This is a significant gap versus US frontier models that now handle images, audio, and video natively.
11
Controversy

The Distillation Controversy

The launch comes a day after the US accused China of stealing American AI labs' IP on an industrial scale using thousands of proxy accounts. DeepSeek itself has been accused by Anthropic and OpenAI of "distilling," essentially copying, their AI models. This accusation — that DeepSeek used the outputs of closed frontier models to train its own — has shadowed the company since R1's release.

Model distillation is a legitimate and widely-used machine learning technique where a smaller or different model is trained to mimic the outputs of a larger "teacher" model. The controversy is that the terms of service of OpenAI, Anthropic, and Google explicitly prohibit using their model outputs to train competing AI systems. The US White House's Office of Science and Technology Policy has accused unnamed Chinese entities of conducting "industrial-scale" campaigns through proxy accounts to extract training data from US AI APIs.

DeepSeek has not publicly responded to these specific accusations regarding V4. The situation represents a fundamental tension in the AI landscape: open benchmarks and API access make it technically straightforward to distill from frontier models, but the legal and ethical lines around what constitutes legitimate research versus IP theft remain hotly contested.

12
Geopolitics

Geopolitics, Bans, and the Open-Source Gambit

Some countries banned government agencies from using DeepSeek, including Italy, the United States, and South Korea, citing national security concerns. Germany also banned DeepSeek in Apple and Google app stores in 2025, citing illegal transfer of user data to China. These bans reflect real concerns: DeepSeek's privacy policy, like most Chinese consumer internet services, is subject to Chinese data laws that include government access provisions.

And yet: the open-weight release of V4 is a perfect answer to this concern for any organization capable of self-hosting. Download the weights, run the model on your own infrastructure in your own country, and you have none of the data sovereignty concerns of using DeepSeek's API. This is the open-source gambit: by releasing model weights, DeepSeek maximizes global adoption while sidestepping many of the privacy arguments that governments have used to justify bans.

According to Stanford's 2026 AI Index, Chinese companies have "effectively closed" the AI performance gap with their US rivals. While American proprietary models like Claude, ChatGPT and Gemini remain at the top of the industry ladder for now, the gap is narrowing. V4 is evidence that this assessment is correct — and that the pace of narrowing is accelerating, not slowing.

13
Final Analysis

Verdict: What DeepSeek V4 Actually Means

DeepSeek V4 is not another R1. It won't crash stock markets overnight or force a Senate hearing by Monday morning. The industry has priced in Chinese AI competitiveness — and that recalibration is, itself, V4's most important predecessor. What V4 represents instead is something more durable: evidence that the gap-closing is structural, not episodic.

The Hybrid Attention Architecture is a genuine research contribution — not just a scaling exercise. The 1 million token context at $0.14 per million tokens is a commercial fact that no closed-source competitor can currently match on price. The 1.6 trillion parameter open-weight model is the largest ever released to the public, available for any organization in the world to download, run, and fine-tune. And the inference cluster running on Huawei Ascend 950 chips is a proof-of-concept for an AI supply chain that doesn't depend on Nvidia or US export licenses.

The real story of V4 isn't any single benchmark. It's that a company working under chip restrictions, regulatory scrutiny from two governments, personnel departures, and the impossible pressure of following up R1 still delivered the world's biggest open-weight model — with a novel attention architecture, million-token context, and pricing that undercuts the competition by an order of magnitude. The "Sputnik moment" framing from 2025 implied a single alarming event. V4 suggests something more patient and more consequential: a systematic program of efficient frontier AI research that keeps delivering, regardless of what the US throws at it.

For developers and enterprises: V4 deserves immediate evaluation for coding, long-document analysis, and agentic applications. The pricing and context length are genuinely differentiated. The knowledge gap is real and matters for certain use cases. For AI strategists and policymakers: the open-source delivery mechanism makes national bans increasingly performative. The weights are already on Hugging Face. That genie does not return to the bottle.