Together AI announces Mamba-3: ~7x faster long-context inference than Transformers, with complex-valued SSM
Together AI’s Mamba‑3 is a next‑generation SSM (State Space Model) built with a design philosophy different from Mamba‑2. While Mamba‑2 primarily optimized training speed, Mamba‑3 is redesigned with inference‑time latency reduction as the first goal. It has been accepted as an ICLR 2026 conference paper.
Paper: arXiv:2603.15569, code: github.com/state-spaces/mamba.
What is an SSM, and why Transformers are costly at inference
Today’s LLMs (ChatGPT, Claude, Gemini, etc.) are almost all powered by the Transformer architecture. Its core is the Attention mechanism, which “computes relationships among all tokens in the input.”
While highly accurate, Attention has a fatal weakness: as sequences grow longer, computation blows up quadratically.
トークン数 Attentionの計算量(概算)
1,000 1,000,000(100万)
4,000 16,000,000(1600万)
16,000 256,000,000(2.56億)
128,000 16,384,000,000(163億)
When the number of tokens quadruples, the compute is 16×. This is the root cause of slow long‑context inference and a key reason for high API cost.
An SSM (State Space Model) tackles this issue at its root. Instead of “comparing all pairs of tokens” like Attention, it reads the sequence while updating a fixed‑size state.
An intuitive analogy: Transformer vs SSM is like “database” vs “memory”.
- Transformer = database‑like: Keeps all past tokens as a KV cache and queries them as needed. Accurate but storage‑hungry
- SSM = memory‑like: Reads with a fixed‑size memory that is updated over time. Like a human reading a book—not cross‑checking all pages at once, but carrying a running summary (state) while reading page by page. Memory footprint is constant, but fine details get forgotten
graph LR
subgraph Transformer
T1[トークン1] <--> T2[トークン2]
T1 <--> T3[トークン3]
T1 <--> T4[トークン4]
T2 <--> T3
T2 <--> T4
T3 <--> T4
end
subgraph SSM
S1[トークン1] --> State1[状態更新]
State1 --> S2[トークン2]
S2 --> State2[状態更新]
State2 --> S3[トークン3]
S3 --> State3[状態更新]
State3 --> S4[トークン4]
end
SSM compute scales linearly with sequence length (4× tokens → 4× compute), so the speed gap vs Transformers widens as the sequence grows. This is why Mamba‑3 is about 7× faster at 16,384 tokens.
History of SSMs (S4 → Mamba → Mamba‑2 → Mamba‑3)
Research on SSMs has progressed step by step. The key figures are CMU’s Albert Gu (founder of modern SSM research; creator of HiPPO, S4, and Mamba; now Chief Scientist at Cartesia AI) and Princeton’s Tri Dao (also the creator of FlashAttention; now Chief Scientist at Together AI). They collaborated during their PhD days at Stanford and have co‑advised the Mamba series.
| Model | Year | Main contribution |
|---|---|---|
| S4 | 2021 | First breakthrough showing SSMs can efficiently model long‑range dependencies; strong results on audio and time series |
| Mamba | Late 2023 | Introduced a “selection mechanism” that adapts SSM parameters to the input; first to match Transformer quality at LLM scale |
| Mamba‑2 | 2024 | Established a mathematical equivalence between SSMs and Attention, enabling reuse of existing GPU kernels (e.g., Flash); training sped up significantly |
| Mamba‑3 | 2026 | Focuses on lowering inference latency; with complex‑valued states and a new discretization, achieves the same quality with half the state size |
Up through Mamba‑2 the question was “can we reach Transformer‑level quality?” Mamba‑3 assumes that and shifts the focus to “how fast can we actually run on GPUs?”
Architectural improvements
1. Exponential Trapezoid Discretization
SSMs are defined as continuous‑time systems (differential equations). To run on computers, they must be converted to discrete time (step‑by‑step), a process called discretization.
Different discretizations preserve different amounts of information from the original continuous system. As a rough analogy, consider video frame rate—30 fps vs 60 fps can feel very different. The higher the discretization accuracy, the more faithfully we can reconstruct the underlying continuous signal.
| Discretization | Order | Characteristics |
|---|---|---|
| ZOH (Zero‑Order Hold) | First | Used in the original Mamba. Simplest but less expressive; assumes the input stays constant between steps |
| Exponential Euler | First | Used in Mamba‑2. Similar accuracy to ZOH but more compute‑efficient |
| Exponential Trapezoid | Second | Introduced in Mamba‑3. Accounts for changes in both the input and the state, enabling richer sequence patterns |
What changes when we go from first‑ to second‑order accuracy? In Mamba‑2, limited discretization accuracy had to be compensated by an auxiliary short convolution layer, responsible for capturing very local patterns (n‑gram‑like short‑range dependencies). With exponential trapezoid discretization, the SSM core alone can capture this information, so the short‑convolution layer can be removed entirely. According to the paper, removing it does not hurt accuracy—if anything, it slightly improves it. With one fewer layer, the inference pipeline is simpler and faster.
2. Complex‑valued states
Up to Mamba‑2, the state vector was real‑valued. Mamba‑3 extends it to be complex‑valued.
Complex numbers may sound intimidating, but the point here is simple: they represent oscillatory patterns efficiently.
With real‑valued states, tracking “periodically varying information” often requires combining multiple state variables. With complex numbers, a single variable naturally holds both amplitude and phase, letting us represent the same information with fewer state variables. This is the main reason Mamba‑3 matches perplexity with “half the state size.”
| State type | State size | Patterns captured |
|---|---|---|
| Real‑valued (Mamba‑2) | N | Exponential decay/growth |
| Complex‑valued (Mamba‑3) | N/2 | Exponential decay/growth + oscillation/phase rotation |
For example, natural language contains many long‑distance correspondences—matching brackets, if‑then structures, subject–predicate agreement, etc. To follow such distant dependencies, the model must maintain a “waiting for something” state while reading intervening irrelevant tokens. Phase rotation in the complex plane naturally represents this form of “holding.”
This shows up dramatically in state‑tracking benchmarks:
| Task | Mamba‑3 | Mamba‑2 |
|---|---|---|
| Parity (track odd/even) | 100% | 0.9% |
| Arithmetic (no parentheses) | 98.51% | 47.81% |
| Arithmetic (with parentheses) | 87.75% | 0.88% |
Parity asks you to read a sequence of 0s and 1s and track whether the number of 1s seen so far is even or odd. It looks simple but requires keeping the correct state over very long sequences—historically difficult for SSMs. The jump from Mamba‑2’s 0.9% (near random) to Mamba‑3’s 100% shows how much complex‑valued states improved state‑tracking.
Implementation‑wise, the complex state matrix is represented as block‑diagonal 2×2 rotation matrices applied to real vectors—a “RoPE trick.” For positional encoding Mamba‑3 adopts RoPE (Rotary Position Embedding), widely used in Transformers and mathematically consistent with complex‑valued representations. Removing RoPE collapses parity accuracy to 2.27%, confirming that this mechanism is essential.
3. MIMO SSM (Multi‑Input Multi‑Output)
Mamba‑2 used a SISO (Single‑Input Single‑Output) structure: each channel (feature dimension) runs as an independent SSM, so the state of channel A never affects channel B.
graph TD
subgraph SISO構造
direction LR
C1[チャンネル1] --> SSM1[SSM] --> O1[出力1]
C2[チャンネル2] --> SSM2[SSM] --> O2[出力2]
C3[チャンネル3] --> SSM3[SSM] --> O3[出力3]
end
subgraph MIMO構造
direction LR
M1[チャンネル1] --> MSSM[SSM<br/>r=4] --> MO1[出力1]
M2[チャンネル2] --> MSSM --> MO2[出力2]
M3[チャンネル3] --> MSSM --> MO3[出力3]
end
In MIMO (Multi‑Input Multi‑Output), channels interact: combinations of input channels influence the outputs. Here r=4 denotes rank‑4 coupling—a low‑rank approximation rather than full coupling. Fully coupling all channels would explode compute, so information is exchanged through a 4‑dimensional subspace.
Attention in Transformers already has interaction across all dimensions within a head. SISO SSMs lacked this, and MIMO supplies it.
A neat technical point: although MIMO can increase FLOPs by up to 4× vs Mamba‑2, the measured latency stays almost the same. This leverages the shift of GPU bottlenecks from compute‑ to memory‑bound regimes. MIMO raises arithmetic intensity (ops per byte moved) from about 2.5 ops/byte toward Θ(R) ops/byte, allowing better utilization of GPU compute units.
Empirically, MIMO improves accuracy by about +1.2 points over SISO, while increasing decode latency only slightly (~8%).
Other changes include adding a QKNorm normalization layer for training stability and unifying the MLP to the standard design shared by Transformers and Gated DeltaNet.
Mamba‑2 vs Mamba‑3
| Item | Mamba‑2 | Mamba‑3 |
|---|---|---|
| Design goal | Maximize training speed | Minimize inference latency |
| Discretization | Exponential Euler (1st‑order) | Exponential trapezoid (2nd‑order) |
| State type | Real‑valued | Complex‑valued (implemented via RoPE) |
| IO structure | SISO only | Choose SISO or MIMO |
| Short convolution | Present (required) | Not needed (removed) |
| Normalization | RMSNorm | QKNorm |
| Perplexity at state size 128 | Baseline | Matches at state size 64 (half) |
| Parity task | 0.9% | 100% |
| Single‑token decode latency | 0.203 ms | 0.156 ms |
Inference speed benchmark (H100 GPU, 1.5B model)
Comparison of prefill + decode latency (seconds). “Prefill” processes the input; “decode” generates output tokens sequentially. Measured with batch size 128.
| Model | 512 | 1024 | 2048 | 4096 | 16384 |
|---|---|---|---|---|---|
| Transformer (vLLM, Llama‑3.2‑1B) | 4.45 | 9.60 | 20.37 | 58.64 | 976.50 |
| Mamba‑2 | 4.66 | 9.32 | 18.62 | 37.22 | 149.02 |
| Mamba‑3 SISO | 4.39 | 8.78 | 17.57 | 35.11 | 140.61 |
| Mamba‑3 MIMO (r=4) | 4.74 | 9.48 | 18.96 | 37.85 | 151.81 |
With short inputs (512 tokens) there’s barely any difference. The gap emerges on long contexts: at 16,384 tokens, the Transformer takes about 16 minutes, whereas Mamba‑3 SISO takes about 2.3 minutes—roughly 6.9× faster. Transformer latency grows 16.7× from 4,096 → 16,384 (quadratic), while Mamba‑3 SISO grows only 4.0× (linear).
SISO is the fastest across all sequence lengths. MIMO decodes slightly slower but yields higher accuracy, so you can choose based on your use case.
Accuracy benchmark (1.5B scale)
Average accuracy on downstream tasks.
| Model | ARC‑E | ARC‑C | HellaSwag | PIQA | WinoGrande | OBQA | Avg. |
|---|---|---|---|---|---|---|---|
| Transformer | 74.0 | 40.4 | 60.6 | 73.8 | 58.7 | 29.6 | 55.4 |
| Gated DeltaNet | 75.3 | 41.2 | 61.3 | 74.3 | 58.0 | 31.6 | 55.8 |
| Mamba‑2 | 75.3 | 41.8 | 61.4 | 73.6 | 57.5 | 32.6 | 55.7 |
| Mamba‑3 SISO | 75.9 | 42.7 | 61.9 | 73.6 | 59.4 | 32.0 | 56.4 |
| Mamba‑3 MIMO | 76.5 | 44.5 | 62.3 | 75.3 | 60.6 | 32.6 | 57.6 |
What each benchmark measures:
| Benchmark | Description |
|---|---|
| ARC‑E / ARC‑C | Elementary‑level science questions (Easy and Challenge). Common‑sense reasoning |
| HellaSwag | Choose the most plausible continuation of a sentence. Everyday scenario understanding |
| PIQA | Physical commonsense (e.g., “How to warm a cup?”) |
| WinoGrande | Resolving pronoun references. Discourse understanding |
| OBQA | Elementary‑school‑level science knowledge |
Mamba‑3 MIMO improves +2.2 points over a Transformer and +1.9 points over Mamba‑2. At the 1.5B scale it overturns the conventional wisdom that “SSMs trail Transformers in accuracy.”
Tasks SSMs struggle with (precise retrieval)
On retrieval benchmarks, SSMs still lag behind Transformers. The core limitation is that SSMs compress information into a fixed‑size state.
Transformers retain all past tokens in the KV cache, so they can retrieve “the exact value from 1,000 tokens ago.” SSMs keep only a fixed‑size state; older information is overwritten by new information. Like human memory, recent content is crisp while distant details become fuzzy.
Examples include tasks such as “answer with the exact numerical value appearing in the third paragraph of this document” or “extract all spans that contain a specific keyword from a long input,” as well as RAG usage that requires quoting precisely from many retrieved passages.
The paper suggests that future mainstream models will be hybrids combining SSMs with global Attention. This is already an industry trend (see below).
Multi‑layer kernel implementations
To balance speed, usability, and accuracy control, Mamba‑3 uses different GPU kernels for different phases.
A GPU kernel is a low‑level program that runs on the GPU. Even for the same math, speed can vary drastically depending on implementation and memory‑access patterns.
| Kernel | Usage | Why this implementation |
|---|---|---|
| Triton | Prefill kernel | Python‑like way to write GPU kernels; prioritizes developer velocity and readability |
| TileLang | For MIMO | Channels interact in MIMO, so precise control of memory hierarchy (registers → shared → global) matters; TileLang expresses tile‑level memory management declaratively |
| CuTe DSL | Decode kernel | Directly uses Hopper‑specific instructions (e.g., TMA) on NVIDIA GPUs like H100 to minimize decode latency |
Having a decode kernel tuned for Hopper‑generation GPUs (e.g., H100) contributes to the measured inference speed. Conversely, older GPUs (e.g., A100) may not achieve the same speeds.
Who already uses SSMs
Mamba‑3’s results are reported at the 1.5B scale, but the SSM architecture is already in commercial use by multiple companies. The idea that “SSMs are still just research” is outdated.
SSM–Transformer hybrid models
Hybrids that combine SSM speed with Transformer retrieval precision are becoming mainstream.
| Company | Product | Scale | Notes |
|---|---|---|---|
| AI21 Labs | Jamba 1.5 | 398B (94B active) MoE | First commercial SSM hybrid; 256K context; available as an NVIDIA NIM and via API |
| NVIDIA | Nemotron‑H | 8B / 47B / 56B | Replaces 92% of Attention layers with Mamba‑2 blocks; achieves 3× throughput vs Transformer |
| NVIDIA | Nemotron 3 Super | 120B (12B active) MoE | Designed for agentic inference |
| IBM | Bamba → Granite 4.0 | 8B | Matches Llama‑3.1 8B quality with 1/7 the data; tech rolling into the next Granite 4.0 |
| Zyphra | Zamba2 | 1.2B / 2.7B / 7.4B | On‑device focus; 2× inference speed and 27% less memory |
Pure SSM models
| Company | Product | Scale | Notes |
|---|---|---|---|
| TII (UAE) | Falcon Mamba 7B | 7B | Best open‑source SSM so far; outperforms Llama‑3.1 8B and Mistral 7B |
| Cartesia AI | Sonic 3 (TTS) / Rene (LLM) | 1.3B | Real‑time speech AI in 42 languages; Albert Gu (SSM founder) is Chief Scientist |
One notable point: NVIDIA itself adopts an SSM hybrid in its inference stack (Nemotron‑H). The GPU vendor pushing a Transformer alternative makes sense because SSMs make GPUs more efficient. If the same hardware can serve more requests, demand doesn’t fall—new use cases open up.
Who benefits, and how
LLM API providers (Together AI, AWS Bedrock, etc.)
You can process more requests per GPU. Workloads heavy on long inputs/outputs—document summarization, code generation, long‑form translation—see large throughput gains per GPU. That feeds directly into lower service prices.
Concretely, if Mamba‑3 is ~7× faster than a Transformer at 16,384 tokens, you could potentially handle roughly 7× more requests per GPU. Since cloud GPU time dominates LLM service cost, the impact is substantial.
Edge and on‑device inference
The fixed‑size state of SSMs is a natural fit for memory‑constrained devices (phones, IoT). Transformer KV caches grow with sequence length; SSM state stays constant regardless of input length.
Zyphra’s Zamba2 follows exactly this path, optimizing 1.2B–7.4B models for on‑device inference and cutting memory by 27%. Running quality LLMs on phones becomes realistic by choosing SSMs where Transformer memory is tight.
Real‑time speech and video
The Mamba family is intrinsically well‑suited to continuous signals (speech, sensor data). S4 first shined on speech recognition and time‑series forecasting. Mamba‑3’s low‑latency inference can power real‑time ASR, simultaneous translation, and live captioning with lower latency than Transformers.
Cartesia AI’s Sonic 3 already offers SSM‑based real‑time TTS in 42 languages, demonstrating SSM practicality in this domain.
Applications needing long context
Because latency grows only linearly with tokens, tasks like repo‑wide code analysis, book‑length summarization/translation, or chatbots retaining very long histories become practical at lengths that were unrealistic with Transformers.
AI21 Labs’ Jamba 1.5 supports 256K context specifically because it is an SSM hybrid. Handling 256K tokens with a pure Transformer would consume tens of GBs of memory just for the KV cache.
Current limitations
That said, Transformers won’t be replaced overnight.
- As noted, SSMs lag on tasks requiring precise retrieval. They are not ideal for RAG generation where exact citation is mandatory
- Compared to Transformers, the SSM ecosystem is smaller—fewer optimization stacks (vLLM, TensorRT‑LLM) and finetuning tools (e.g., LoRA), though NVIDIA’s adoption in Nemotron‑H should accelerate tooling
- The Mamba‑3 paper reports experiments from 180M to 1.5B; results at 70B–hundreds of billions are not reported. While large‑scale hybrids exist (AI21’s 398B Jamba 1.5, NVIDIA’s 56B Nemotron‑H), pure‑SSM results at that scale are still to come
- Even the paper notes that hybrids (SSM + Attention) are the most promising for practical use