LFM2.5 - a hybrid architecture that's neither Transformer nor Mamba

Update 2026-06-08: I tested this model on M1 Max 64GB → LFM2.5 1.2B JP on M1 Max 64GB: 208 tok/s decode, JSON OK, name hallucinated

The usual story about LLM architecture is that Transformer dominates, while SSM-based systems such as Mamba are the main alternative. That got old fast, so I looked at something that is neither of those: Liquid AI’s LFM2.5.

What Liquid AI is

Liquid AI is an AI startup spun out of MIT’s CSAIL, the Computer Science and Artificial Intelligence Laboratory. It was founded in 2022 and builds foundation models for edge devices based on research into liquid neural networks.

The company’s name comes from “Liquid Neural Networks,” inspired by the nervous system of nematodes. Their behavior changes dynamically in response to input, and that idea carries over into the LFM architecture.

LFM2.5 architecture

The core of LFM2.5 is a hybrid of attention and short-range convolutions.

For the 1.2B model, out of 16 total layers:

10 layers: Double-gated LIV convolution for short-range dependencies
6 layers: Grouped Query Attention (GQA) for long-range dependencies

Attention blocks account for only about 37% of the model. The remaining 63% is made up of cheaper convolution blocks.

That split matters. CPU prefill/decode is about 2x faster than similarly sized Transformer models, and KV cache usage is smaller too, so memory efficiency improves as well.

What LIV convolution is

LIV stands for “Linear Input-Varying.” As the name suggests, it is a linear-time convolution operator whose behavior changes based on the input.

The processing flow looks like this:

input x -> linear transform -> [component1, component2, component3]
component1 -> depthwise Conv1D (kernel size = 3)
output = Linear(component1_conv * gate(component2) * component3)

Because the kernel size is just 3, the compute cost stays linear. At the same time, the double-gated structure gives it much higher expressiveness than a static convolution filter.

The main difference from ordinary convolutions is the input-dependent gating. The same weights can extract different features depending on the input context. That is a good fit for Liquid AI’s core idea of dynamic behavior.

Why short convolutions instead of SSMs?

This was the most interesting part.

Liquid AI ran hardware-in-the-loop architecture search to find the best layer mix across quality, latency, and memory. The search space included GQA, short convolutions, linear attention, S4, Mamba, and Mamba2.

The result:

When a small number of GQA blocks are available, cheap gated short convolutions give a quality-latency-memory tradeoff that is as good as or better than adding linear attention, SSMs, or long convolutions.

In other words, if attention already covers the long-range part, you do not need SSMs for the local part. Short convolutions are simpler, work better with CPU caches, and are faster on real devices.

Liquid AI’s position is that although SSMs can theoretically handle long dependencies in linear time, short convolutions plus a small number of attention blocks are more practical on edge hardware.

The LFM2.5 family

The public LFM2.5 lineup centers on the 1.2B class.

Model	Parameters	Use case
LFM2.5-1.2B-Base	1.17B	Base model for fine-tuning
LFM2.5-1.2B-Instruct	1.17B	General instruction following
LFM2.5-1.2B-Thinking	1.17B	Reasoning-focused, with chain-of-thought output
LFM2.5-1.2B-JP	1.17B	Japanese-specific
LFM2.5-VL-1.6B	1.6B	Vision-language
LFM2.5-Audio-1.5B	1.5B	Voice interaction, ASR, TTS

LFM2 covered a much wider range from 350M to 8.3B, but LFM2.5 is focused on the 1.2B class, which is where on-device AI demand is strongest.

Changes from LFM2:

Pretraining tokens: 10T -> 28T (2.8x)
Post-training: large-scale multi-stage reinforcement learning added
Japanese-specific and Thinking models added
Architecture itself unchanged

The context length is 32,768 tokens. It is distributed in Safetensors, GGUF, ONNX, and MLX formats.

Benchmarks

Text (LFM2.5-1.2B-Instruct)

Benchmark	LFM2.5-1.2B	Qwen3-1.7B	Llama 3.2 1B	Gemma 3 1B
GPQA	38.89	34.85	16.57	24.24
MMLU-Pro	44.35	42.91	20.80	14.04
IFEval	86.23	73.68	52.37	63.25
AIME25	14.00	9.33	0.33	1.00

With 1.17B parameters, it beats Qwen3-1.7B, which is 47% larger, on almost every metric.

Japanese (LFM2.5-1.2B-JP)

Benchmark	LFM2.5-JP	LFM2.5-Instruct	Qwen3-1.7B	Llama 3.2 1B
JMMLU	50.7	47.7	47.7	34.0
M-IFEval (ja)	58.1	41.8	40.3	24.1
GSM8K (ja)	56.0	46.8	46.0	25.2

The Japanese-specific version beats Qwen3-1.7B on every metric. For a local Japanese model in the 1.2B class, that is very strong.

Inference speed on edge devices

Device	Framework	Prefill	Decode	Memory
AMD Ryzen AI 9 HX 370 (CPU)	llama.cpp	-	239 tok/s	<1GB
Snapdragon X Elite (NPU)	NexaML	2,591 tok/s	63 tok/s	0.9GB
Galaxy S25 Ultra (CPU, Q4)	llama.cpp	335 tok/s	70 tok/s	719MB

For comparison on the Galaxy S25 Ultra, Qwen3-1.7B does 181 tok/s prefill, 40 tok/s decode, and uses 1,306MB of memory. LFM2.5 is almost 2x faster with half the memory.

Getting usable speed in under 700MB on a phone is a big deal for on-device AI work.

Community use case: Z-Image-Engineer V4

One community model fine-tuned from LFM2.5-1.2B-Base is BennyDaBall/LFM2.5-1.2B-Z-Image-Engineer-V4.

It specializes in automatic expansion of image-generation prompts. If you enter a short phrase like “neon samurai,” it turns that into a 200-250 word prompt with lighting, lens settings, composition, and mood.

It is used mainly in workflows for Z-Image Turbo and Flux2 Klein, and there is even a ComfyUI custom node. It also works in LM Studio and Ollama.

The original series was developed on a Qwen3-4B base, but the LFM2.5-1.2B version is about 3x faster than the Qwen3 version and only about 700MB when quantized to Q4. It was fully fine-tuned on a 55,000-item dataset and uses a custom regularization method called SMART Training.

This is a good example of LFM2.5’s strength: even a 1.2B model can be very practical when the task is specialized.

License note

LFM2 was Apache 2.0-based, but LFM2.5 switched to the proprietary LFM Open License v1.0. Check the terms before using it.