Tech 8 min read

DeepSeek-V4-Pro-DSpark is not a new model but a speculative-decoding V4-Pro

IkesanContents

deepseek-ai/DeepSeek-V4-Pro-DSpark showed up on Hugging Face.
The name alone makes it look like a new Pro version of DeepSeek V4, but the model card states plainly that it is not a new model.
It’s the same DeepSeek-V4-Pro checkpoint with an extra module for speculative decoding bolted on — a derivative repository, not a new base model.

In the DeepSeek V4 Preview post I wrote in April, I went through the 1.6T total parameters, 49B active, 1M context, and the CSA+HCA long-context attention.
DSpark isn’t an announcement that pushes that base capability further; what’s in focus here is the machinery for reading several decode-time tokens ahead at once.

What DSpark actually adds

Speculative decoding is an inference technique where a lightweight draft model proposes several tokens ahead and the main model verifies and accepts them.
Instead of calling the main model one token at a time, you shorten decode latency by however many tokens get accepted.
Quality is verified by the main model, so the draft side’s job is to be “easy to hit, cheap, and fast.”

Verification happens in a single forward pass of the main model.
The draft’s candidate sequence is run through the main model, and tokens are accepted from the front for as long as they match the main model’s distribution.
The first position that diverges is filled in by the main model’s own output, so the token sequence that comes out is identical to running the main model alone.
The premise of the method is that you only get faster, without losing quality.

flowchart TD
    A[Confirmed token sequence] --> B[DSpark draft head<br/>proposes 5 tokens at once]
    B --> C[V4-Pro verifies all 5<br/>in a single pass]
    C --> D{How many match<br/>from the front?}
    D -->|All match| E[Accept 5 tokens<br/>main model adds one more]
    D -->|Diverges midway| F[Accept the matching prefix<br/>main model fixes the rest]
    E --> A
    F --> A

DeepSpec is the codebase DeepSeek points to for training and evaluating these draft models.
It handles three kinds of draft: Eagle3, from the externally proposed EAGLE line, alongside DeepSeek’s own DSpark and DFlash.
The pipeline runs in three stages — data prep, training, evaluation. First you regenerate target outputs with the main model and cache them, then train the draft against that cache as the teacher.
The README warns that even the default example, which uses the small Qwen/Qwen3-4B as the target, produces a target cache of around 38TB. Put a 1.6T-class model like V4-Pro in the target slot and it’s obviously heavier. This is not a tool for casually retraining on hand.

DSpark’s config.json for V4-Pro adds settings the regular V4-Pro doesn’t have.

KeyDSpark valueMeaning
dspark_block_size5Candidate tokens drafted at once
dspark_target_layer_ids58, 59, 60Which late layers of the main model the hidden states come from
dspark_markov_rank512Low-rank representation on the Markov head
dspark_noise_token_id128799Dedicated token used to pad the draft input

In inference/model.py, DSparkBlock lives under the mtp.* namespace, and forward_spec returns the draft candidates, logits, and confidence.
The hidden states taken from the main model’s late layers are merged by main_proj, a draft sequence padded with the noise token is built, and 5 tokens’ worth of candidates are generated.
The Markov head adds a bias derived from the previous token, and the confidence head emits a per-candidate acceptance likelihood.

File size actually goes up

By the Hugging Face API, the regular DeepSeek-V4-Pro has a usedStorage of about 864.8GB, and the DSpark version about 892.8GB.
Speculative decoding is a mechanism for cutting inference latency, not for shrinking the distributed weight files.
The DSpark version’s storage is larger by exactly the added modules.

The model body’s specs are still V4-Pro’s.
1.6T total parameters, 49B active, max position embeddings of 1,048,576, 384 MoE experts, 6 experts per token.
Weights are mostly FP8, with the MoE experts in FP4.

Hugging Face’s parameter count shows about 861.6B for V4-Pro and about 889.5B for the DSpark version. That doesn’t line up with the 1.6T headline, but it’s a matter of how things are counted.
FP4 expert weights are stored two per byte (int8), and the metadata counts the packed element count.
In V4-Pro’s breakdown, int8 accounts for about 786.4B elements; unpack that 2x and the experts alone come to about 1.57T. Add the remaining FP8 and BF16 and you land back near the 1.6T headline.
The DSpark version’s element count being higher isn’t because it became a different model; it’s just the draft head added on top.

inference/README.md also makes clear the bar for running this locally isn’t low.
You first convert the Hugging Face-format weights into the project’s own format, then run torchrun with, in their example, EXPERTS=256 and MP=4.
To use FP8, you drop expert_dtype: "fp4" from config.json and pass --expert-dtype fp8 at conversion time.

This isn’t a story of “DSpark landing made V4-Pro easier to touch on a personal PC.”
You need a GPU setup that can host V4-Pro first; on top of that, this is a derivative for shaving decode latency.

A separate look-ahead path from the existing MTP

V4-Pro’s config already has num_nextn_predict_layers: 1. This is the main model’s own MTP (multi-token prediction), a single layer for the main model to also guess a token further ahead while it emits the next one.
DSpark is a separate look-ahead path. In model.py, a ModuleList called self.mtp is built independently of the main model’s layers, and DSparkBlocks sit in it.

The number of stages is set by an inference-side argument, n_mtp_layers, and isn’t written in config.json (the code default is 1).
What config.json does pin down is the source layers. As dspark_target_layer_ids: [58, 59, 60] shows, the hidden states are drawn from the last three of the 61 layers.
DSparkBlock’s main_proj concatenates those three layers’ hidden states (input dim is dim × 3) and merges them into one before handing off to draft generation.

DSpark isn’t just a sampling setting.
It takes the hidden states from the last three layers, builds candidates in a dedicated block, and carries a Markov head that adds a previous-token bias plus a confidence head that emits a per-candidate acceptance likelihood.
Structurally it’s close to “a bolt-on draft head for running V4-Pro faster.”

V4-Pro’s body is also designed to cut down the memory side in the first place.
The KV head is num_key_value_heads: 1, effectively multi-query, and the paper puts the KV cache at one-tenth of V3.2’s.
Once you’ve cut memory to the bone, the latency that remains is sequential decoding itself — emitting one token at a time. DSpark is the derivative that goes after exactly that.

In the post comparing Chinese agent models after Claude Fable’s suspension, I treated DeepSeek V4 as “the model that put 1M-context open weights on the table.”
Adding the DSpark version, what comes next is the inference-runtime side of the story: how to make those distributed weights generate fast.
For people hitting it through an API it doesn’t show on the surface, but on your own host the difference shows up directly in output speed and GPU occupancy time.

No speed numbers to read yet

The model card carries no speed benchmark for the DSpark version itself.
DeepSpec publishes evaluation scripts and dataset names (gsm8k, math500, aime25, humaneval, mbpp, livecodebench, mt-bench, and so on), but the README has no speedup multiplier or acceptance length (how many tokens pass per step on average) for DeepSeek-V4-Pro-DSpark.
It was created on 2026-06-27. By the time I checked back the next day, downloads had climbed into the low hundreds, but given the nature of an 893GB distribution that doesn’t necessarily mean that many full runs, and there’s still no third-party benchmark measuring speed.

So what’s readable at this stage stops at the structure.
This isn’t a capability comparison of the V4-Pro body; it’s that, for putting a 1.6T-class MoE into production inference, DeepSeek has started shipping a distribution that includes the draft head.
Under the same MIT license, the inference code, conversion scripts, encoding implementation, and the 66-shard safetensors are all there.

References