Together AI announces Mamba-3: ~7x faster long-context inference than Transformers, with complex-valued SSM

Together AI’s Mamba‑3 is a next‑generation SSM (State Space Model) built with a design philosophy different from Mamba‑2. While Mamba‑2 primarily optimized training speed, Mamba‑3 is redesigned with inference‑time latency reduction as the first goal. It has been accepted as an ICLR 2026 conference paper.

Paper: arXiv:2603.15569, code: github.com/state-spaces/mamba.

What is an SSM, and why Transformers are costly at inference

Today’s LLMs (ChatGPT, Claude, Gemini, etc.) are almost all powered by the Transformer architecture. Its core is the Attention mechanism, which “computes relationships among all tokens in the input.”

While highly accurate, Attention has a fatal weakness: as sequences grow longer, computation blows up quadratically.

トークン数    Attentionの計算量（概算）
  1,000       1,000,000（100万）
  4,000      16,000,000（1600万）
 16,000     256,000,000（2.56億）
128,000  16,384,000,000（163億）

When the number of tokens quadruples, the compute is 16×. This is the root cause of slow long‑context inference and a key reason for high API cost.

An SSM (State Space Model) tackles this issue at its root. Instead of “comparing all pairs of tokens” like Attention, it reads the sequence while updating a fixed‑size state.

An intuitive analogy: Transformer vs SSM is like “database” vs “memory”.

Transformer = database‑like: Keeps all past tokens as a KV cache and queries them as needed. Accurate but storage‑hungry
SSM = memory‑like: Reads with a fixed‑size memory that is updated over time. Like a human reading a book—not cross‑checking all pages at once, but carrying a running summary (state) while reading page by page. Memory footprint is constant, but fine details get forgotten

graph LR
    subgraph Transformer
        T1[トークン1] <--> T2[トークン2]
        T1 <--> T3[トークン3]
        T1 <--> T4[トークン4]
        T2 <--> T3
        T2 <--> T4
        T3 <--> T4
    end

    subgraph SSM
        S1[トークン1] --> State1[状態更新]
        State1 --> S2[トークン2]
        S2 --> State2[状態更新]
        State2 --> S3[トークン3]
        S3 --> State3[状態更新]
        State3 --> S4[トークン4]
    end

SSM compute scales linearly with sequence length (4× tokens → 4× compute), so the speed gap vs Transformers widens as the sequence grows. This is why Mamba‑3 is about 7× faster at 16,384 tokens.

History of SSMs (S4 → Mamba → Mamba‑2 → Mamba‑3)

Research on SSMs has progressed step by step. The key figures are CMU’s Albert Gu (founder of modern SSM research; creator of HiPPO, S4, and Mamba; now Chief Scientist at Cartesia AI) and Princeton’s Tri Dao (also the creator of FlashAttention; now Chief Scientist at Together AI). They collaborated during their PhD days at Stanford and have co‑advised the Mamba series.

Model	Year	Main contribution
S4	2021	First breakthrough showing SSMs can efficiently model long‑range dependencies; strong results on audio and time series
Mamba	Late 2023	Introduced a “selection mechanism” that adapts SSM parameters to the input; first to match Transformer quality at LLM scale
Mamba‑2	2024	Established a mathematical equivalence between SSMs and Attention, enabling reuse of existing GPU kernels (e.g., Flash); training sped up significantly
Mamba‑3	2026	Focuses on lowering inference latency; with complex‑valued states and a new discretization, achieves the same quality with half the state size

Up through Mamba‑2 the question was “can we reach Transformer‑level quality?” Mamba‑3 assumes that and shifts the focus to “how fast can we actually run on GPUs?”

Architectural improvements

1. Exponential Trapezoid Discretization

SSMs are defined as continuous‑time systems (differential equations). To run on computers, they must be converted to discrete time (step‑by‑step), a process called discretization.

Different discretizations preserve different amounts of information from the original continuous system. As a rough analogy, consider video frame rate—30 fps vs 60 fps can feel very different. The higher the discretization accuracy, the more faithfully we can reconstruct the underlying continuous signal.

Discretization	Order	Characteristics
ZOH (Zero‑Order Hold)	First	Used in the original Mamba. Simplest but less expressive; assumes the input stays constant between steps
Exponential Euler	First	Used in Mamba‑2. Similar accuracy to ZOH but more compute‑efficient
Exponential Trapezoid	Second	Introduced in Mamba‑3. Accounts for changes in both the input and the state, enabling richer sequence patterns

What changes when we go from first‑ to second‑order accuracy? In Mamba‑2, limited discretization accuracy had to be compensated by an auxiliary short convolution layer, responsible for capturing very local patterns (n‑gram‑like short‑range dependencies). With exponential trapezoid discretization, the SSM core alone can capture this information, so the short‑convolution layer can be removed entirely. According to the paper, removing it does not hurt accuracy—if anything, it slightly improves it. With one fewer layer, the inference pipeline is simpler and faster.

2. Complex‑valued states

Up to Mamba‑2, the state vector was real‑valued. Mamba‑3 extends it to be complex‑valued.

Complex numbers may sound intimidating, but the point here is simple: they represent oscillatory patterns efficiently.

With real‑valued states, tracking “periodically varying information” often requires combining multiple state variables. With complex numbers, a single variable naturally holds both amplitude and phase, letting us represent the same information with fewer state variables. This is the main reason Mamba‑3 matches perplexity with “half the state size.”

State type	State size	Patterns captured
Real‑valued (Mamba‑2)	N	Exponential decay/growth
Complex‑valued (Mamba‑3)	N/2	Exponential decay/growth + oscillation/phase rotation

For example, natural language contains many long‑distance correspondences—matching brackets, if‑then structures, subject–predicate agreement, etc. To follow such distant dependencies, the model must maintain a “waiting for something” state while reading intervening irrelevant tokens. Phase rotation in the complex plane naturally represents this form of “holding.”

This shows up dramatically in state‑tracking benchmarks:

Task	Mamba‑3	Mamba‑2
Parity (track odd/even)	100%	0.9%
Arithmetic (no parentheses)	98.51%	47.81%
Arithmetic (with parentheses)	87.75%	0.88%

Parity asks you to read a sequence of 0s and 1s and track whether the number of 1s seen so far is even or odd. It looks simple but requires keeping the correct state over very long sequences—historically difficult for SSMs. The jump from Mamba‑2’s 0.9% (near random) to Mamba‑3’s 100% shows how much complex‑valued states improved state‑tracking.

Implementation‑wise, the complex state matrix is represented as block‑diagonal 2×2 rotation matrices applied to real vectors—a “RoPE trick.” For positional encoding Mamba‑3 adopts RoPE (Rotary Position Embedding), widely used in Transformers and mathematically consistent with complex‑valued representations. Removing RoPE collapses parity accuracy to 2.27%, confirming that this mechanism is essential.

3. MIMO SSM (Multi‑Input Multi‑Output)

Mamba‑2 used a SISO (Single‑Input Single‑Output) structure: each channel (feature dimension) runs as an independent SSM, so the state of channel A never affects channel B.

graph TD
    subgraph SISO構造
        direction LR
        C1[チャンネル1] --> SSM1[SSM] --> O1[出力1]
        C2[チャンネル2] --> SSM2[SSM] --> O2[出力2]
        C3[チャンネル3] --> SSM3[SSM] --> O3[出力3]
    end

    subgraph MIMO構造
        direction LR
        M1[チャンネル1] --> MSSM[SSM<br/>r=4] --> MO1[出力1]
        M2[チャンネル2] --> MSSM --> MO2[出力2]
        M3[チャンネル3] --> MSSM --> MO3[出力3]
    end

In MIMO (Multi‑Input Multi‑Output), channels interact: combinations of input channels influence the outputs. Here r=4 denotes rank‑4 coupling—a low‑rank approximation rather than full coupling. Fully coupling all channels would explode compute, so information is exchanged through a 4‑dimensional subspace.

Attention in Transformers already has interaction across all dimensions within a head. SISO SSMs lacked this, and MIMO supplies it.

A neat technical point: although MIMO can increase FLOPs by up to 4× vs Mamba‑2, the measured latency stays almost the same. This leverages the shift of GPU bottlenecks from compute‑ to memory‑bound regimes. MIMO raises arithmetic intensity (ops per byte moved) from about 2.5 ops/byte toward Θ(R) ops/byte, allowing better utilization of GPU compute units.

Empirically, MIMO improves accuracy by about +1.2 points over SISO, while increasing decode latency only slightly (~8%).

Other changes include adding a QKNorm normalization layer for training stability and unifying the MLP to the standard design shared by Transformers and Gated DeltaNet.

Mamba‑2 vs Mamba‑3

Item	Mamba‑2	Mamba‑3
Design goal	Maximize training speed	Minimize inference latency
Discretization	Exponential Euler (1st‑order)	Exponential trapezoid (2nd‑order)
State type	Real‑valued	Complex‑valued (implemented via RoPE)
IO structure	SISO only	Choose SISO or MIMO
Short convolution	Present (required)	Not needed (removed)
Normalization	RMSNorm	QKNorm
Perplexity at state size 128	Baseline	Matches at state size 64 (half)
Parity task	0.9%	100%
Single‑token decode latency	0.203 ms	0.156 ms

Inference speed benchmark (H100 GPU, 1.5B model)

Comparison of prefill + decode latency (seconds). “Prefill” processes the input; “decode” generates output tokens sequentially. Measured with batch size 128.

Model	512	1024	2048	4096	16384
Transformer (vLLM, Llama‑3.2‑1B)	4.45	9.60	20.37	58.64	976.50
Mamba‑2	4.66	9.32	18.62	37.22	149.02
Mamba‑3 SISO	4.39	8.78	17.57	35.11	140.61
Mamba‑3 MIMO (r=4)	4.74	9.48	18.96	37.85	151.81

With short inputs (512 tokens) there’s barely any difference. The gap emerges on long contexts: at 16,384 tokens, the Transformer takes about 16 minutes, whereas Mamba‑3 SISO takes about 2.3 minutes—roughly 6.9× faster. Transformer latency grows 16.7× from 4,096 → 16,384 (quadratic), while Mamba‑3 SISO grows only 4.0× (linear).

SISO is the fastest across all sequence lengths. MIMO decodes slightly slower but yields higher accuracy, so you can choose based on your use case.

Accuracy benchmark (1.5B scale)

Average accuracy on downstream tasks.

Model	ARC‑E	ARC‑C	HellaSwag	PIQA	WinoGrande	OBQA	Avg.
Transformer	74.0	40.4	60.6	73.8	58.7	29.6	55.4
Gated DeltaNet	75.3	41.2	61.3	74.3	58.0	31.6	55.8
Mamba‑2	75.3	41.8	61.4	73.6	57.5	32.6	55.7
Mamba‑3 SISO	75.9	42.7	61.9	73.6	59.4	32.0	56.4
Mamba‑3 MIMO	76.5	44.5	62.3	75.3	60.6	32.6	57.6

What each benchmark measures:

Benchmark	Description
ARC‑E / ARC‑C	Elementary‑level science questions (Easy and Challenge). Common‑sense reasoning
HellaSwag	Choose the most plausible continuation of a sentence. Everyday scenario understanding
PIQA	Physical commonsense (e.g., “How to warm a cup?”)
WinoGrande	Resolving pronoun references. Discourse understanding
OBQA	Elementary‑school‑level science knowledge

Mamba‑3 MIMO improves +2.2 points over a Transformer and +1.9 points over Mamba‑2. At the 1.5B scale it overturns the conventional wisdom that “SSMs trail Transformers in accuracy.”

Tasks SSMs struggle with (precise retrieval)

On retrieval benchmarks, SSMs still lag behind Transformers. The core limitation is that SSMs compress information into a fixed‑size state.

Transformers retain all past tokens in the KV cache, so they can retrieve “the exact value from 1,000 tokens ago.” SSMs keep only a fixed‑size state; older information is overwritten by new information. Like human memory, recent content is crisp while distant details become fuzzy.

Examples include tasks such as “answer with the exact numerical value appearing in the third paragraph of this document” or “extract all spans that contain a specific keyword from a long input,” as well as RAG usage that requires quoting precisely from many retrieved passages.

The paper suggests that future mainstream models will be hybrids combining SSMs with global Attention. This is already an industry trend (see below).

Multi‑layer kernel implementations

To balance speed, usability, and accuracy control, Mamba‑3 uses different GPU kernels for different phases.

A GPU kernel is a low‑level program that runs on the GPU. Even for the same math, speed can vary drastically depending on implementation and memory‑access patterns.

Kernel	Usage	Why this implementation
Triton	Prefill kernel	Python‑like way to write GPU kernels; prioritizes developer velocity and readability
TileLang	For MIMO	Channels interact in MIMO, so precise control of memory hierarchy (registers → shared → global) matters; TileLang expresses tile‑level memory management declaratively
CuTe DSL	Decode kernel	Directly uses Hopper‑specific instructions (e.g., TMA) on NVIDIA GPUs like H100 to minimize decode latency

Having a decode kernel tuned for Hopper‑generation GPUs (e.g., H100) contributes to the measured inference speed. Conversely, older GPUs (e.g., A100) may not achieve the same speeds.

Who already uses SSMs

Mamba‑3’s results are reported at the 1.5B scale, but the SSM architecture is already in commercial use by multiple companies. The idea that “SSMs are still just research” is outdated.

SSM–Transformer hybrid models

Hybrids that combine SSM speed with Transformer retrieval precision are becoming mainstream.

Company	Product	Scale	Notes
AI21 Labs	Jamba 1.5	398B (94B active) MoE	First commercial SSM hybrid; 256K context; available as an NVIDIA NIM and via API
NVIDIA	Nemotron‑H	8B / 47B / 56B	Replaces 92% of Attention layers with Mamba‑2 blocks; achieves 3× throughput vs Transformer
NVIDIA	Nemotron 3 Super	120B (12B active) MoE	Designed for agentic inference
IBM	Bamba → Granite 4.0	8B	Matches Llama‑3.1 8B quality with 1/7 the data; tech rolling into the next Granite 4.0
Zyphra	Zamba2	1.2B / 2.7B / 7.4B	On‑device focus; 2× inference speed and 27% less memory

Pure SSM models

Company	Product	Scale	Notes
TII (UAE)	Falcon Mamba 7B	7B	Best open‑source SSM so far; outperforms Llama‑3.1 8B and Mistral 7B
Cartesia AI	Sonic 3 (TTS) / Rene (LLM)	1.3B	Real‑time speech AI in 42 languages; Albert Gu (SSM founder) is Chief Scientist

One notable point: NVIDIA itself adopts an SSM hybrid in its inference stack (Nemotron‑H). The GPU vendor pushing a Transformer alternative makes sense because SSMs make GPUs more efficient. If the same hardware can serve more requests, demand doesn’t fall—new use cases open up.

Who benefits, and how

LLM API providers (Together AI, AWS Bedrock, etc.)

You can process more requests per GPU. Workloads heavy on long inputs/outputs—document summarization, code generation, long‑form translation—see large throughput gains per GPU. That feeds directly into lower service prices.

Concretely, if Mamba‑3 is ~7× faster than a Transformer at 16,384 tokens, you could potentially handle roughly 7× more requests per GPU. Since cloud GPU time dominates LLM service cost, the impact is substantial.

Edge and on‑device inference

The fixed‑size state of SSMs is a natural fit for memory‑constrained devices (phones, IoT). Transformer KV caches grow with sequence length; SSM state stays constant regardless of input length.

Zyphra’s Zamba2 follows exactly this path, optimizing 1.2B–7.4B models for on‑device inference and cutting memory by 27%. Running quality LLMs on phones becomes realistic by choosing SSMs where Transformer memory is tight.

Real‑time speech and video

The Mamba family is intrinsically well‑suited to continuous signals (speech, sensor data). S4 first shined on speech recognition and time‑series forecasting. Mamba‑3’s low‑latency inference can power real‑time ASR, simultaneous translation, and live captioning with lower latency than Transformers.

Cartesia AI’s Sonic 3 already offers SSM‑based real‑time TTS in 42 languages, demonstrating SSM practicality in this domain.

Applications needing long context

Because latency grows only linearly with tokens, tasks like repo‑wide code analysis, book‑length summarization/translation, or chatbots retaining very long histories become practical at lengths that were unrealistic with Transformers.

AI21 Labs’ Jamba 1.5 supports 256K context specifically because it is an SSM hybrid. Handling 256K tokens with a pure Transformer would consume tens of GBs of memory just for the KV cache.

Current limitations

That said, Transformers won’t be replaced overnight.

As noted, SSMs lag on tasks requiring precise retrieval. They are not ideal for RAG generation where exact citation is mandatory
Compared to Transformers, the SSM ecosystem is smaller—fewer optimization stacks (vLLM, TensorRT‑LLM) and finetuning tools (e.g., LoRA), though NVIDIA’s adoption in Nemotron‑H should accelerate tooling
The Mamba‑3 paper reports experiments from 180M to 1.5B; results at 70B–hundreds of billions are not reported. While large‑scale hybrids exist (AI21’s 398B Jamba 1.5, NVIDIA’s 56B Nemotron‑H), pure‑SSM results at that scale are still to come
Even the paper notes that hybrids (SSM + Attention) are the most promising for practical use