Tech 11 min read

DeepSeek V4 Preview Lands with 1M Context: V4-Pro 1.6T / V4-Flash 284B Open-Sourced under MIT, FLOPs Cut to 27% of V3.2

IkesanContents

On April 24, 2026, DeepSeek released the long-rumored Preview build of the V4 series.
The lineup is a two-pronged release: DeepSeek-V4-Pro (1.6T total parameters / 49B active) and DeepSeek-V4-Flash (284B / 13B active). Both support a 1M context window, and the weights are up on Hugging Face under MIT.
The API was updated at the same time, and chat.deepseek.com now exposes a switch between Expert Mode (deliberation) and Instant Mode (fast response) on day one.

Over the past week or two, frontier-class Chinese models have been landing back-to-back: Tencent Hy3-preview and Ant Ling-2.6-flash, Qwen3.6-Max-Preview and Kimi K2.6, Xiaomi MiMo-V2.5 and V2.5-Pro, and Zhipu AI’s GLM-5.1.
In every comparison, “DeepSeek-V3 family” has been the reference point on the benchmark axis. Now the actual DeepSeek team has stepped onto the same ring, in the form of V4 Preview.

Two-model positioning

V4-Pro and V4-Flash share the same architectural lineage, split into a “frontier tier” and an “efficiency tier.”

flowchart LR
  V4[DeepSeek V4 Preview] --> P[V4-Pro<br/>1.6T / 49B active<br/>Top tier<br/>Think Max ready]
  V4 --> F[V4-Flash<br/>284B / 13B active<br/>Fast & low-cost<br/>Economy-focused]
  P --> PC[chat.deepseek.com<br/>Expert Mode]
  F --> FC[chat.deepseek.com<br/>Instant Mode]
  P --> API[API updated day one]
  F --> API

V4-Pro is positioned as the top-tier cloud model “rivaling the world’s top closed-source models,” and V4-Flash is the “fast, efficient, and economical option.”
Unlike the Qwen3.6 setup, where Max-Preview was kept closed and only the 35B-A3B was opened, DeepSeek is shipping both Pro and Flash as open weights under MIT—frontier tier included.

Specs and base architecture

The numbers from the model cards, side by side.

ItemV4-ProV4-Flash
Total parameters1.6T284B
Active parameters / token49B13B
Context length1M1M
ArchitectureFine-grained MoEFine-grained MoE
PrecisionFP4 + FP8 MixedFP4 + FP8 Mixed
Pre-training tokens32T+32T+
LicenseMITMIT
Post-trainingTwo stages (expert SFT + GRPO → on-policy distillation merge)Same

The FP4+FP8 mixed precision is quietly load-bearing. MoE expert parameters are kept in FP4, everything else in FP8.
1.6T total parameters sounds enormous, but with FP4/FP8 the on-disk footprint comes in at less than half of a naive FP16 layout.

The total-to-active ratio reads more clearly next to the other models from the same week—Hy3 preview (295B/21B) and Ling-2.6-flash (104B/7.4B).

ModelTotal / ActiveContextLicense
DeepSeek-V4-Pro1.6T / 49B1MMIT
DeepSeek-V4-Flash284B / 13B1MMIT
DeepSeek-V3 family671B / 37B128KMIT
Tencent Hy3 preview295B / 21B256KCustom (Hy Community)
Zhipu GLM-5.1744B / 40B200KMIT
Ant Ling-2.6-flash104B / 7.4BMIT

V3 sat at “671B / 37B active, 128K,” so V4-Pro roughly 2.4x’s the total parameter count, holds active parameters down at 49B, and pushes context all the way out to 1M in one jump.
The “flagship = grow total, keep active flat-ish, bet hard on long context” direction matches what GLM-5.1 did at 744B/40B by adding DSA to reach 200K.

The headline: CSA + HCA hybrid attention

The biggest change in the V4 series is the attention mechanism redesign.
The model card calls it a Hybrid Attention Architecture that combines two attention types.

NameShortRole
Compressed Sparse AttentionCSAMain attention—compression and sparsification combined
Heavily Compressed AttentionHCASub-attention—aggressive compression at long-context scale

Where Hy3 preview pushed on GQA and MTP, V4 takes a more aggressive step: split attention itself into two and run them as a hybrid.
The effective cost at 1M context comes out as follows.

Metricvs DeepSeek-V3.2
Per-token inference FLOPs27%
KV cache10%

To run 1M context like 1M context, plain attention runs into both quadratic compute and an exploding KV cache.
V4 attacks both by reshaping the attention hierarchy itself. Among models that put “1M support” on the spec sheet, V4 lands in the class that can actually use 1M without compute or memory blowing up.

mHC takes care of residuals

The other new piece is Manifold-Constrained Hyper-Connections (mHC).
The standard Transformer residual connection is replaced with a hyper-connection constrained on a manifold, with the goal of stabilizing inter-layer signal propagation.

Swapping out residuals as a direction itself echoes Moonshot’s “Block AttnRes” depth-wise hyper-connections that landed in Kimi Linear.
mHC adds the “manifold constraint” on top, designed to preserve gradient flow and representational power even at the deeper, longer 1M-class scale.

Muon optimizer

Both pre-training and post-training explicitly use the Muon optimizer.
Picking Muon over the AdamW family is a trend that keeps showing up in Kimi and Qwen post-training designs, and it’s becoming the practical answer when you want both convergence and training stability past the 1T-parameter mark.

Three reasoning modes

Both V4-Pro and V4-Flash let the user pick from three modes: Non-Think / Think High / Think Max.

ModeBehaviorIntended use
Non-ThinkQuick, intuitiveDaily tasks, simple responses
Think HighExplicit logical analysis, longer deliberationPlanning, complex reasoning
Think MaxPulls maximum reasoning capacityResearch-grade tasks, ceiling-finding workloads

Think Max mode is recommended with a context window of at least 384K reserved.
The thinking trace itself emits a long token stream, so a typical 128K window will get cut off mid-thought.

In the chat UI, Expert Mode maps roughly onto Think High / Think Max, and Instant Mode maps onto Non-Think. On the API side, the request specifies thinking_mode directly.

Benchmarks

The headline numbers, lifted straight from the model cards.

V4-Pro (top configuration: V4-Pro-Max)

BenchmarkV4-ProNotes
MMLU-Pro87.5General knowledge / reasoning
GPQA Diamond90.1Graduate-level QA
SimpleQA-Verified57.9Factual
LiveCodeBench93.5Top among compared models
Codeforces Rating3206Competitive programming, top among compared models
SWE Verified80.6Real-repo SWE tasks
BrowseComp83.4Browser-operation agent
Toolathlon51.8Tool-use agent
MRCR 1M83.5 MMRLong-context comprehension (1M tokens)
CorpusQA 1M62.0 ACC1M QA

LiveCodeBench 93.5 and Codeforces 3206 are top-tier even by frontier standards—within trading-blow distance of Claude Opus 4.6 and Gemini 3.1 Pro High.
SWE Verified 80.6 is a hair behind Claude and ties Gemini 3.1 Pro High.
On the other side, SimpleQA-Verified 57.9 and GPQA 90.1 still trail Gemini 3.1 Pro High (75.6 and 94.3 respectively). The push is more about “thinking and acting” than raw factual density.

V4-Flash

V4-Flash puts up surprisingly aggressive numbers for the “smaller” model.

BenchmarkV4-FlashV4-Pro
MMLU-Pro (Non-Think)83.082.9
SimpleQA-Verified (Max Mode)34.157.9
LiveCodeBench (Max Mode)91.693.5
MRCR 1M78.783.5

Non-Think MMLU-Pro is essentially tied with V4-Pro, and LiveCodeBench is within a hair.
The gap opens up on “knowledge-density tasks (SimpleQA)” and “1M long-context reading,” but for coding and short-to-medium reasoning, V4-Flash is more than enough. The role split is clean.

Going for this range with only 13B active puts V4-Flash in direct contention with Ant Ling-2.6-flash, which staked out the agent-focused efficiency slot at 104B / 7.4B active. The density race in the Flash tier just got tighter.

Chat template moves to a custom encoder

Starting with V4, the Jinja chat template is gone, replaced by a Python-based custom encoder.

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"},
]

prompt = encode_messages(messages, thinking_mode="thinking")

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
tokens = tokenizer.encode(prompt)

reasoning_content is carried directly inside conversation history, so multi-turn handoff of Think-mode reasoning traces becomes natural—but the receiving side (Transformers / vLLM / SGLang and so on) has to align with this encoder.
When grafting V4 into an existing pipeline, the chat template plumbing should be the first thing to suspect.

The recommended sampling defaults are temperature = 1.0, top_p = 1.0, which is the opposite direction from the usual “lower the temperature for stability” school. Worth keeping in mind.

Distribution and license

  • Weights: deepseek-ai/DeepSeek-V4-Pro / deepseek-ai/DeepSeek-V4-Flash (Hugging Face, MIT)
  • Base versions: -Base suffix, FP8-only
  • API: official DeepSeek API updated day one, switchable between Expert Mode and Instant Mode
  • Web UI: chat.deepseek.com
  • ModelScope: a Flash-version mirror is up

Shipping under MIT all the way down to the Base version is the same playbook as DeepSeek-OCR—DeepSeek’s open-by-default stance carries through to the V4 generation.
For now this is “Preview”-labeled, so behavior and numbers may shift in the official release.


V4-Pro’s 1.6T / 49B configuration could be read as “yet another Chinese model stacking parameters.” In practice, it’s a fairly applied frontier model: CSA+HCA pulls 1M-context FLOPs down to 27%, while still putting up Codeforces 3206 / LiveCodeBench 93.5.
With Flash hitting the same 1M context at 13B active, the baseline of “Chinese open frontier” has clearly moved up from the V3 benchmark era.

Can you run this at home?

A rough sizing for whether home GPU rigs can touch these models.

V4-Pro (1.6T / 49B active)

Even with FP4+FP8 mixed precision, 1.6T total parameters demands roughly 800GB–1TB of storage / memory just to load the weights.
That fits onto roughly ten H100 80GB cards stacked together. A single Mac Studio M3 Ultra 512GB cannot hold it.
Touching V4-Pro on a personal rig is not realistic. The fastest path is the official DeepSeek API or chat.deepseek.com.

V4-Flash (284B / 13B active)

This one lands around 140–160GB with FP4-heavy placement.
Once third-party Q4-equivalent quantizations land on Hugging Face, that number can come down further.

  • RTX 5090 32GB alone: nowhere near fits in VRAM. CPU offload is mandatory, with non-active experts pushed out to DDR5 192GB-class system memory. Whether it runs at usable speed is doubtful
  • Mac Studio M3 Ultra 512GB: one of the few personal-tier setups that can hold the full model in unified memory. With 13B active in an MoE, single-digit tok/s should be reachable
  • 2–3x H100 80GB: tensor-parallel layout works, but the power, noise, and heat are not home-friendly

Even with FLOPs down to 27%, expanding a full 1M context into KV cache eats memory steadily.
For personal rigs, the safer path is to verify stability at 128K–256K first and only then stretch the context out.

What about my own setup

Mapping this onto the machines I actually own.

  • V4-Pro: out of the question. Even at FP4+FP8, the 1.6T weights are effectively 800GB–1TB-class, which a 512GB unified memory cannot hold. The only options are chat.deepseek.com or the official API
  • V4-Flash: about 140–160GB with FP4-heavy placement. A Mac Studio M3 Ultra 512GB-class machine could load it at original size, but my single RTX-class GPU can’t. CPU offload is rough, too—MoE expert routing flips constantly, which jams up PCIe bandwidth, so practical speed is harsh
  • Wait for quantization: for personal-hardware folks, the realistic first step is waiting for Unsloth / MLX / GGUF crews to push Q4 / Q3 builds
  • 1M operation: KV cache starts to dominate, so even with FP4 weights loaded, easing in from a 128K context is safer than going straight to 1M

A pragmatic sequence: check behavior on chat.deepseek.com → test thinking_mode switching via API → run locally once a quantized build appears.
There’s not much point in trying to drag down the original-size weights for personal use.