DeepSeek V4 Preview Lands with 1M Context: V4-Pro 1.6T / V4-Flash 284B Open-Sourced under MIT, FLOPs Cut to 27% of V3.2
Contents
On April 24, 2026, DeepSeek released the long-rumored Preview build of the V4 series.
The lineup is a two-pronged release: DeepSeek-V4-Pro (1.6T total parameters / 49B active) and DeepSeek-V4-Flash (284B / 13B active). Both support a 1M context window, and the weights are up on Hugging Face under MIT.
The API was updated at the same time, and chat.deepseek.com now exposes a switch between Expert Mode (deliberation) and Instant Mode (fast response) on day one.
Over the past week or two, frontier-class Chinese models have been landing back-to-back: Tencent Hy3-preview and Ant Ling-2.6-flash, Qwen3.6-Max-Preview and Kimi K2.6, Xiaomi MiMo-V2.5 and V2.5-Pro, and Zhipu AI’s GLM-5.1.
In every comparison, “DeepSeek-V3 family” has been the reference point on the benchmark axis. Now the actual DeepSeek team has stepped onto the same ring, in the form of V4 Preview.
Two-model positioning
V4-Pro and V4-Flash share the same architectural lineage, split into a “frontier tier” and an “efficiency tier.”
flowchart LR
V4[DeepSeek V4 Preview] --> P[V4-Pro<br/>1.6T / 49B active<br/>Top tier<br/>Think Max ready]
V4 --> F[V4-Flash<br/>284B / 13B active<br/>Fast & low-cost<br/>Economy-focused]
P --> PC[chat.deepseek.com<br/>Expert Mode]
F --> FC[chat.deepseek.com<br/>Instant Mode]
P --> API[API updated day one]
F --> API
V4-Pro is positioned as the top-tier cloud model “rivaling the world’s top closed-source models,” and V4-Flash is the “fast, efficient, and economical option.”
Unlike the Qwen3.6 setup, where Max-Preview was kept closed and only the 35B-A3B was opened, DeepSeek is shipping both Pro and Flash as open weights under MIT—frontier tier included.
Specs and base architecture
The numbers from the model cards, side by side.
| Item | V4-Pro | V4-Flash |
|---|---|---|
| Total parameters | 1.6T | 284B |
| Active parameters / token | 49B | 13B |
| Context length | 1M | 1M |
| Architecture | Fine-grained MoE | Fine-grained MoE |
| Precision | FP4 + FP8 Mixed | FP4 + FP8 Mixed |
| Pre-training tokens | 32T+ | 32T+ |
| License | MIT | MIT |
| Post-training | Two stages (expert SFT + GRPO → on-policy distillation merge) | Same |
The FP4+FP8 mixed precision is quietly load-bearing. MoE expert parameters are kept in FP4, everything else in FP8.
1.6T total parameters sounds enormous, but with FP4/FP8 the on-disk footprint comes in at less than half of a naive FP16 layout.
The total-to-active ratio reads more clearly next to the other models from the same week—Hy3 preview (295B/21B) and Ling-2.6-flash (104B/7.4B).
| Model | Total / Active | Context | License |
|---|---|---|---|
| DeepSeek-V4-Pro | 1.6T / 49B | 1M | MIT |
| DeepSeek-V4-Flash | 284B / 13B | 1M | MIT |
| DeepSeek-V3 family | 671B / 37B | 128K | MIT |
| Tencent Hy3 preview | 295B / 21B | 256K | Custom (Hy Community) |
| Zhipu GLM-5.1 | 744B / 40B | 200K | MIT |
| Ant Ling-2.6-flash | 104B / 7.4B | — | MIT |
V3 sat at “671B / 37B active, 128K,” so V4-Pro roughly 2.4x’s the total parameter count, holds active parameters down at 49B, and pushes context all the way out to 1M in one jump.
The “flagship = grow total, keep active flat-ish, bet hard on long context” direction matches what GLM-5.1 did at 744B/40B by adding DSA to reach 200K.
The headline: CSA + HCA hybrid attention
The biggest change in the V4 series is the attention mechanism redesign.
The model card calls it a Hybrid Attention Architecture that combines two attention types.
| Name | Short | Role |
|---|---|---|
| Compressed Sparse Attention | CSA | Main attention—compression and sparsification combined |
| Heavily Compressed Attention | HCA | Sub-attention—aggressive compression at long-context scale |
Where Hy3 preview pushed on GQA and MTP, V4 takes a more aggressive step: split attention itself into two and run them as a hybrid.
The effective cost at 1M context comes out as follows.
| Metric | vs DeepSeek-V3.2 |
|---|---|
| Per-token inference FLOPs | 27% |
| KV cache | 10% |
To run 1M context like 1M context, plain attention runs into both quadratic compute and an exploding KV cache.
V4 attacks both by reshaping the attention hierarchy itself. Among models that put “1M support” on the spec sheet, V4 lands in the class that can actually use 1M without compute or memory blowing up.
mHC takes care of residuals
The other new piece is Manifold-Constrained Hyper-Connections (mHC).
The standard Transformer residual connection is replaced with a hyper-connection constrained on a manifold, with the goal of stabilizing inter-layer signal propagation.
Swapping out residuals as a direction itself echoes Moonshot’s “Block AttnRes” depth-wise hyper-connections that landed in Kimi Linear.
mHC adds the “manifold constraint” on top, designed to preserve gradient flow and representational power even at the deeper, longer 1M-class scale.
Muon optimizer
Both pre-training and post-training explicitly use the Muon optimizer.
Picking Muon over the AdamW family is a trend that keeps showing up in Kimi and Qwen post-training designs, and it’s becoming the practical answer when you want both convergence and training stability past the 1T-parameter mark.
Three reasoning modes
Both V4-Pro and V4-Flash let the user pick from three modes: Non-Think / Think High / Think Max.
| Mode | Behavior | Intended use |
|---|---|---|
| Non-Think | Quick, intuitive | Daily tasks, simple responses |
| Think High | Explicit logical analysis, longer deliberation | Planning, complex reasoning |
| Think Max | Pulls maximum reasoning capacity | Research-grade tasks, ceiling-finding workloads |
Think Max mode is recommended with a context window of at least 384K reserved.
The thinking trace itself emits a long token stream, so a typical 128K window will get cut off mid-thought.
In the chat UI, Expert Mode maps roughly onto Think High / Think Max, and Instant Mode maps onto Non-Think. On the API side, the request specifies thinking_mode directly.
Benchmarks
The headline numbers, lifted straight from the model cards.
V4-Pro (top configuration: V4-Pro-Max)
| Benchmark | V4-Pro | Notes |
|---|---|---|
| MMLU-Pro | 87.5 | General knowledge / reasoning |
| GPQA Diamond | 90.1 | Graduate-level QA |
| SimpleQA-Verified | 57.9 | Factual |
| LiveCodeBench | 93.5 | Top among compared models |
| Codeforces Rating | 3206 | Competitive programming, top among compared models |
| SWE Verified | 80.6 | Real-repo SWE tasks |
| BrowseComp | 83.4 | Browser-operation agent |
| Toolathlon | 51.8 | Tool-use agent |
| MRCR 1M | 83.5 MMR | Long-context comprehension (1M tokens) |
| CorpusQA 1M | 62.0 ACC | 1M QA |
LiveCodeBench 93.5 and Codeforces 3206 are top-tier even by frontier standards—within trading-blow distance of Claude Opus 4.6 and Gemini 3.1 Pro High.
SWE Verified 80.6 is a hair behind Claude and ties Gemini 3.1 Pro High.
On the other side, SimpleQA-Verified 57.9 and GPQA 90.1 still trail Gemini 3.1 Pro High (75.6 and 94.3 respectively). The push is more about “thinking and acting” than raw factual density.
V4-Flash
V4-Flash puts up surprisingly aggressive numbers for the “smaller” model.
| Benchmark | V4-Flash | V4-Pro |
|---|---|---|
| MMLU-Pro (Non-Think) | 83.0 | 82.9 |
| SimpleQA-Verified (Max Mode) | 34.1 | 57.9 |
| LiveCodeBench (Max Mode) | 91.6 | 93.5 |
| MRCR 1M | 78.7 | 83.5 |
Non-Think MMLU-Pro is essentially tied with V4-Pro, and LiveCodeBench is within a hair.
The gap opens up on “knowledge-density tasks (SimpleQA)” and “1M long-context reading,” but for coding and short-to-medium reasoning, V4-Flash is more than enough. The role split is clean.
Going for this range with only 13B active puts V4-Flash in direct contention with Ant Ling-2.6-flash, which staked out the agent-focused efficiency slot at 104B / 7.4B active. The density race in the Flash tier just got tighter.
Chat template moves to a custom encoder
Starting with V4, the Jinja chat template is gone, replaced by a Python-based custom encoder.
from encoding_dsv4 import encode_messages, parse_message_from_completion_text
messages = [
{"role": "user", "content": "hello"},
{"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
{"role": "user", "content": "1+1=?"},
]
prompt = encode_messages(messages, thinking_mode="thinking")
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
tokens = tokenizer.encode(prompt)
reasoning_content is carried directly inside conversation history, so multi-turn handoff of Think-mode reasoning traces becomes natural—but the receiving side (Transformers / vLLM / SGLang and so on) has to align with this encoder.
When grafting V4 into an existing pipeline, the chat template plumbing should be the first thing to suspect.
The recommended sampling defaults are temperature = 1.0, top_p = 1.0, which is the opposite direction from the usual “lower the temperature for stability” school. Worth keeping in mind.
Distribution and license
- Weights:
deepseek-ai/DeepSeek-V4-Pro/deepseek-ai/DeepSeek-V4-Flash(Hugging Face, MIT) - Base versions:
-Basesuffix, FP8-only - API: official DeepSeek API updated day one, switchable between Expert Mode and Instant Mode
- Web UI: chat.deepseek.com
- ModelScope: a Flash-version mirror is up
Shipping under MIT all the way down to the Base version is the same playbook as DeepSeek-OCR—DeepSeek’s open-by-default stance carries through to the V4 generation.
For now this is “Preview”-labeled, so behavior and numbers may shift in the official release.
V4-Pro’s 1.6T / 49B configuration could be read as “yet another Chinese model stacking parameters.” In practice, it’s a fairly applied frontier model: CSA+HCA pulls 1M-context FLOPs down to 27%, while still putting up Codeforces 3206 / LiveCodeBench 93.5.
With Flash hitting the same 1M context at 13B active, the baseline of “Chinese open frontier” has clearly moved up from the V3 benchmark era.
Can you run this at home?
A rough sizing for whether home GPU rigs can touch these models.
V4-Pro (1.6T / 49B active)
Even with FP4+FP8 mixed precision, 1.6T total parameters demands roughly 800GB–1TB of storage / memory just to load the weights.
That fits onto roughly ten H100 80GB cards stacked together. A single Mac Studio M3 Ultra 512GB cannot hold it.
Touching V4-Pro on a personal rig is not realistic. The fastest path is the official DeepSeek API or chat.deepseek.com.
V4-Flash (284B / 13B active)
This one lands around 140–160GB with FP4-heavy placement.
Once third-party Q4-equivalent quantizations land on Hugging Face, that number can come down further.
- RTX 5090 32GB alone: nowhere near fits in VRAM. CPU offload is mandatory, with non-active experts pushed out to DDR5 192GB-class system memory. Whether it runs at usable speed is doubtful
- Mac Studio M3 Ultra 512GB: one of the few personal-tier setups that can hold the full model in unified memory. With 13B active in an MoE, single-digit tok/s should be reachable
- 2–3x H100 80GB: tensor-parallel layout works, but the power, noise, and heat are not home-friendly
Even with FLOPs down to 27%, expanding a full 1M context into KV cache eats memory steadily.
For personal rigs, the safer path is to verify stability at 128K–256K first and only then stretch the context out.
What about my own setup
Mapping this onto the machines I actually own.
- V4-Pro: out of the question. Even at FP4+FP8, the 1.6T weights are effectively 800GB–1TB-class, which a 512GB unified memory cannot hold. The only options are chat.deepseek.com or the official API
- V4-Flash: about 140–160GB with FP4-heavy placement. A Mac Studio M3 Ultra 512GB-class machine could load it at original size, but my single RTX-class GPU can’t. CPU offload is rough, too—MoE expert routing flips constantly, which jams up PCIe bandwidth, so practical speed is harsh
- Wait for quantization: for personal-hardware folks, the realistic first step is waiting for Unsloth / MLX / GGUF crews to push Q4 / Q3 builds
- 1M operation: KV cache starts to dominate, so even with FP4 weights loaded, easing in from a 128K context is safer than going straight to 1M
A pragmatic sequence: check behavior on chat.deepseek.com → test thinking_mode switching via API → run locally once a quantized build appears.
There’s not much point in trying to drag down the original-size weights for personal use.