DeepSeek V4 Preview specs: V4-Pro 1.6T and V4-Flash 284B open under MIT, 1M context, 27% inference FLOPs of V3.2

On April 24, 2026, DeepSeek released the long-rumored Preview build of the V4 series.
The lineup is a two-pronged release: DeepSeek-V4-Pro (1.6T total parameters / 49B active) and DeepSeek-V4-Flash (284B / 13B active). Both support a 1M context window, and the weights are up on Hugging Face under MIT.
The API was updated at the same time, and chat.deepseek.com now exposes a switch between Expert Mode (deliberation) and Instant Mode (fast response) on day one.

Over the past week or two, frontier-class Chinese models have been landing back-to-back: Tencent Hy3-preview and Ant Ling-2.6-flash, Qwen3.6-Max-Preview and Kimi K2.6, Xiaomi MiMo-V2.5 and V2.5-Pro, and Zhipu AI’s GLM-5.1.
In every comparison, “DeepSeek-V3 family” has been the reference point on the benchmark axis. Now the actual DeepSeek team has stepped onto the same ring, in the form of V4 Preview.

Two-model positioning

V4-Pro and V4-Flash share the same architectural lineage, split into a “frontier tier” and an “efficiency tier.”

flowchart LR
  V4[DeepSeek V4 Preview] --> P[V4-Pro<br/>1.6T / 49B active<br/>Top tier<br/>Think Max ready]
  V4 --> F[V4-Flash<br/>284B / 13B active<br/>Fast & low-cost<br/>Economy-focused]
  P --> PC[chat.deepseek.com<br/>Expert Mode]
  F --> FC[chat.deepseek.com<br/>Instant Mode]
  P --> API[API updated day one]
  F --> API

V4-Pro is positioned as the top-tier cloud model “rivaling the world’s top closed-source models,” and V4-Flash is the “fast, efficient, and economical option.”
Unlike the Qwen3.6 setup, where Max-Preview was kept closed and only the 35B-A3B was opened, DeepSeek is shipping both Pro and Flash as open weights under MIT—frontier tier included.

Specs and base architecture

The numbers from the model cards, side by side.

Item	V4-Pro	V4-Flash
Total parameters	1.6T	284B
Active parameters / token	49B	13B
Context length	1M	1M
Architecture	Fine-grained MoE	Fine-grained MoE
Precision	FP4 + FP8 Mixed	FP4 + FP8 Mixed
Pre-training tokens	32T+	32T+
License	MIT	MIT
Post-training	Two stages (expert SFT + GRPO → on-policy distillation merge)	Same

The FP4+FP8 mixed precision is quietly load-bearing. MoE expert parameters are kept in FP4, everything else in FP8.
1.6T total parameters sounds enormous, but with FP4/FP8 the on-disk footprint comes in at less than half of a naive FP16 layout.

The total-to-active ratio reads more clearly next to the other models from the same week—Hy3 preview (295B/21B) and Ling-2.6-flash (104B/7.4B).

Model	Total / Active	Context	License
DeepSeek-V4-Pro	1.6T / 49B	1M	MIT
DeepSeek-V4-Flash	284B / 13B	1M	MIT
DeepSeek-V3 family	671B / 37B	128K	MIT
Tencent Hy3 preview	295B / 21B	256K	Custom (Hy Community)
Zhipu GLM-5.1	744B / 40B	200K	MIT
Ant Ling-2.6-flash	104B / 7.4B	—	MIT

V3 sat at “671B / 37B active, 128K,” so V4-Pro roughly 2.4x’s the total parameter count, holds active parameters down at 49B, and pushes context all the way out to 1M in one jump.
The “flagship = grow total, keep active flat-ish, bet hard on long context” direction matches what GLM-5.1 did at 744B/40B by adding DSA to reach 200K.

The headline: CSA + HCA hybrid attention

The biggest change in the V4 series is the attention mechanism redesign.
The model card calls it a Hybrid Attention Architecture that combines two attention types.

Name	Short	Role
Compressed Sparse Attention	CSA	Main attention—compression and sparsification combined
Heavily Compressed Attention	HCA	Sub-attention—aggressive compression at long-context scale

Where Hy3 preview pushed on GQA and MTP, V4 takes a more aggressive step: split attention itself into two and run them as a hybrid.
The effective cost at 1M context comes out as follows.

Metric	vs DeepSeek-V3.2
Per-token inference FLOPs	27%
KV cache	10%

To run 1M context like 1M context, plain attention runs into both quadratic compute and an exploding KV cache.
V4 attacks both by reshaping the attention hierarchy itself. Among models that put “1M support” on the spec sheet, V4 lands in the class that can actually use 1M without compute or memory blowing up.

mHC takes care of residuals

The other new piece is Manifold-Constrained Hyper-Connections (mHC).
The standard Transformer residual connection is replaced with a hyper-connection constrained on a manifold, with the goal of stabilizing inter-layer signal propagation.

Swapping out residuals as a direction itself echoes Moonshot’s “Block AttnRes” depth-wise hyper-connections that landed in Kimi Linear.
mHC adds the “manifold constraint” on top, designed to preserve gradient flow and representational power even at the deeper, longer 1M-class scale.

Muon optimizer

Both pre-training and post-training explicitly use the Muon optimizer.
Picking Muon over the AdamW family is a trend that keeps showing up in Kimi and Qwen post-training designs, and it’s becoming the practical answer when you want both convergence and training stability past the 1T-parameter mark.

Three reasoning modes

Both V4-Pro and V4-Flash let the user pick from three modes: Non-Think / Think High / Think Max.

Mode	Behavior	Intended use
Non-Think	Quick, intuitive	Daily tasks, simple responses
Think High	Explicit logical analysis, longer deliberation	Planning, complex reasoning
Think Max	Pulls maximum reasoning capacity	Research-grade tasks, ceiling-finding workloads

Think Max mode is recommended with a context window of at least 384K reserved.
The thinking trace itself emits a long token stream, so a typical 128K window will get cut off mid-thought.

In the chat UI, Expert Mode maps roughly onto Think High / Think Max, and Instant Mode maps onto Non-Think. On the API side, the request specifies thinking_mode directly.

Benchmarks

The headline numbers, lifted straight from the model cards.

V4-Pro (top configuration: V4-Pro-Max)

Benchmark	V4-Pro	Notes
MMLU-Pro	87.5	General knowledge / reasoning
GPQA Diamond	90.1	Graduate-level QA
SimpleQA-Verified	57.9	Factual
LiveCodeBench	93.5	Top among compared models
Codeforces Rating	3206	Competitive programming, top among compared models
SWE Verified	80.6	Real-repo SWE tasks
BrowseComp	83.4	Browser-operation agent
Toolathlon	51.8	Tool-use agent
MRCR 1M	83.5 MMR	Long-context comprehension (1M tokens)
CorpusQA 1M	62.0 ACC	1M QA

LiveCodeBench 93.5 and Codeforces 3206 are top-tier even by frontier standards—within trading-blow distance of Claude Opus 4.6 and Gemini 3.1 Pro High.
SWE Verified 80.6 is a hair behind Claude and ties Gemini 3.1 Pro High.
On the other side, SimpleQA-Verified 57.9 and GPQA 90.1 still trail Gemini 3.1 Pro High (75.6 and 94.3 respectively). The push is more about “thinking and acting” than raw factual density.

V4-Flash

V4-Flash puts up surprisingly aggressive numbers for the “smaller” model.

Benchmark	V4-Flash	V4-Pro
MMLU-Pro (Non-Think)	83.0	82.9
SimpleQA-Verified (Max Mode)	34.1	57.9
LiveCodeBench (Max Mode)	91.6	93.5
MRCR 1M	78.7	83.5

Non-Think MMLU-Pro is essentially tied with V4-Pro, and LiveCodeBench is within a hair.
The gap opens up on “knowledge-density tasks (SimpleQA)” and “1M long-context reading,” but for coding and short-to-medium reasoning, V4-Flash is more than enough. The role split is clean.

Going for this range with only 13B active puts V4-Flash in direct contention with Ant Ling-2.6-flash, which staked out the agent-focused efficiency slot at 104B / 7.4B active. The density race in the Flash tier just got tighter.

Chat template moves to a custom encoder

Starting with V4, the Jinja chat template is gone, replaced by a Python-based custom encoder.

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"},
]

prompt = encode_messages(messages, thinking_mode="thinking")

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
tokens = tokenizer.encode(prompt)

reasoning_content is carried directly inside conversation history, so multi-turn handoff of Think-mode reasoning traces becomes natural—but the receiving side (Transformers / vLLM / SGLang and so on) has to align with this encoder.
When grafting V4 into an existing pipeline, the chat template plumbing should be the first thing to suspect.

The recommended sampling defaults are temperature = 1.0, top_p = 1.0, which is the opposite direction from the usual “lower the temperature for stability” school. Worth keeping in mind.

Distribution and license

Weights: deepseek-ai/DeepSeek-V4-Pro / deepseek-ai/DeepSeek-V4-Flash (Hugging Face, MIT)
Base versions: -Base suffix, FP8-only
API: official DeepSeek API updated day one, switchable between Expert Mode and Instant Mode
Web UI: chat.deepseek.com
ModelScope: a Flash-version mirror is up

Shipping under MIT all the way down to the Base version is the same playbook as DeepSeek-OCR—DeepSeek’s open-by-default stance carries through to the V4 generation.
For now this is “Preview”-labeled, so behavior and numbers may shift in the official release.

V4-Pro’s 1.6T / 49B configuration could be read as “yet another Chinese model stacking parameters.” In practice, it’s a fairly applied frontier model: CSA+HCA pulls 1M-context FLOPs down to 27%, while still putting up Codeforces 3206 / LiveCodeBench 93.5.
With Flash hitting the same 1M context at 13B active, the baseline of “Chinese open frontier” has clearly moved up from the V3 benchmark era.

Can you run this at home?

A rough sizing for whether home GPU rigs can touch these models.

V4-Pro (1.6T / 49B active)

Even with FP4+FP8 mixed precision, 1.6T total parameters demands roughly 800GB–1TB of storage / memory just to load the weights.
That fits onto roughly ten H100 80GB cards stacked together. A single Mac Studio M3 Ultra 512GB cannot hold it.
Touching V4-Pro on a personal rig is not realistic. The fastest path is the official DeepSeek API or chat.deepseek.com.

V4-Flash (284B / 13B active)

This one lands around 140–160GB with FP4-heavy placement.
Once third-party Q4-equivalent quantizations land on Hugging Face, that number can come down further.

RTX 5090 32GB alone: nowhere near fits in VRAM. CPU offload is mandatory, with non-active experts pushed out to DDR5 192GB-class system memory. Whether it runs at usable speed is doubtful
Mac Studio M3 Ultra 512GB: one of the few personal-tier setups that can hold the full model in unified memory. With 13B active in an MoE, single-digit tok/s should be reachable
2–3x H100 80GB: tensor-parallel layout works, but the power, noise, and heat are not home-friendly

Even with FLOPs down to 27%, expanding a full 1M context into KV cache eats memory steadily.
For personal rigs, the safer path is to verify stability at 128K–256K first and only then stretch the context out.

What about my own setup

Mapping this onto the machines I actually own.

V4-Pro: out of the question. Even at FP4+FP8, the 1.6T weights are effectively 800GB–1TB-class, which a 512GB unified memory cannot hold. The only options are chat.deepseek.com or the official API
V4-Flash: about 140–160GB with FP4-heavy placement. A Mac Studio M3 Ultra 512GB-class machine could load it at original size, but my single RTX-class GPU can’t. CPU offload is rough, too—MoE expert routing flips constantly, which jams up PCIe bandwidth, so practical speed is harsh
Wait for quantization: for personal-hardware folks, the realistic first step is waiting for Unsloth / MLX / GGUF crews to push Q4 / Q3 builds
1M operation: KV cache starts to dominate, so even with FP4 weights loaded, easing in from a 128K context is safer than going straight to 1M

A pragmatic sequence: check behavior on chat.deepseek.com → test thinking_mode switching via API → run locally once a quantized build appears.
There’s not much point in trying to drag down the original-size weights for personal use.