Tech 5 min read

LFM2.5 - a hybrid architecture that's neither Transformer nor Mamba

IkesanContents

The usual story about LLM architecture is that Transformer dominates, while SSM-based systems such as Mamba are the main alternative. That got old fast, so I looked at something that is neither of those: Liquid AI’s LFM2.5.

What Liquid AI is

Liquid AI is an AI startup spun out of MIT’s CSAIL, the Computer Science and Artificial Intelligence Laboratory. It was founded in 2022 and builds foundation models for edge devices based on research into liquid neural networks.

The company’s name comes from “Liquid Neural Networks,” inspired by the nervous system of nematodes. Their behavior changes dynamically in response to input, and that idea carries over into the LFM architecture.

LFM2.5 architecture

The core of LFM2.5 is a hybrid of attention and short-range convolutions.

For the 1.2B model, out of 16 total layers:

  • 10 layers: Double-gated LIV convolution for short-range dependencies
  • 6 layers: Grouped Query Attention (GQA) for long-range dependencies

Attention blocks account for only about 37% of the model. The remaining 63% is made up of cheaper convolution blocks.

That split matters. CPU prefill/decode is about 2x faster than similarly sized Transformer models, and KV cache usage is smaller too, so memory efficiency improves as well.

What LIV convolution is

LIV stands for “Linear Input-Varying.” As the name suggests, it is a linear-time convolution operator whose behavior changes based on the input.

The processing flow looks like this:

input x -> linear transform -> [component1, component2, component3]
component1 -> depthwise Conv1D (kernel size = 3)
output = Linear(component1_conv * gate(component2) * component3)

Because the kernel size is just 3, the compute cost stays linear. At the same time, the double-gated structure gives it much higher expressiveness than a static convolution filter.

The main difference from ordinary convolutions is the input-dependent gating. The same weights can extract different features depending on the input context. That is a good fit for Liquid AI’s core idea of dynamic behavior.

Why short convolutions instead of SSMs?

This was the most interesting part.

Liquid AI ran hardware-in-the-loop architecture search to find the best layer mix across quality, latency, and memory. The search space included GQA, short convolutions, linear attention, S4, Mamba, and Mamba2.

The result:

When a small number of GQA blocks are available, cheap gated short convolutions give a quality-latency-memory tradeoff that is as good as or better than adding linear attention, SSMs, or long convolutions.

In other words, if attention already covers the long-range part, you do not need SSMs for the local part. Short convolutions are simpler, work better with CPU caches, and are faster on real devices.

Liquid AI’s position is that although SSMs can theoretically handle long dependencies in linear time, short convolutions plus a small number of attention blocks are more practical on edge hardware.

The LFM2.5 family

The public LFM2.5 lineup centers on the 1.2B class.

ModelParametersUse case
LFM2.5-1.2B-Base1.17BBase model for fine-tuning
LFM2.5-1.2B-Instruct1.17BGeneral instruction following
LFM2.5-1.2B-Thinking1.17BReasoning-focused, with chain-of-thought output
LFM2.5-1.2B-JP1.17BJapanese-specific
LFM2.5-VL-1.6B1.6BVision-language
LFM2.5-Audio-1.5B1.5BVoice interaction, ASR, TTS

LFM2 covered a much wider range from 350M to 8.3B, but LFM2.5 is focused on the 1.2B class, which is where on-device AI demand is strongest.

Changes from LFM2:

  • Pretraining tokens: 10T -> 28T (2.8x)
  • Post-training: large-scale multi-stage reinforcement learning added
  • Japanese-specific and Thinking models added
  • Architecture itself unchanged

The context length is 32,768 tokens. It is distributed in Safetensors, GGUF, ONNX, and MLX formats.

Benchmarks

Text (LFM2.5-1.2B-Instruct)

BenchmarkLFM2.5-1.2BQwen3-1.7BLlama 3.2 1BGemma 3 1B
GPQA38.8934.8516.5724.24
MMLU-Pro44.3542.9120.8014.04
IFEval86.2373.6852.3763.25
AIME2514.009.330.331.00

With 1.17B parameters, it beats Qwen3-1.7B, which is 47% larger, on almost every metric.

Japanese (LFM2.5-1.2B-JP)

BenchmarkLFM2.5-JPLFM2.5-InstructQwen3-1.7BLlama 3.2 1B
JMMLU50.747.747.734.0
M-IFEval (ja)58.141.840.324.1
GSM8K (ja)56.046.846.025.2

The Japanese-specific version beats Qwen3-1.7B on every metric. For a local Japanese model in the 1.2B class, that is very strong.

Inference speed on edge devices

DeviceFrameworkPrefillDecodeMemory
AMD Ryzen AI 9 HX 370 (CPU)llama.cpp-239 tok/s<1GB
Snapdragon X Elite (NPU)NexaML2,591 tok/s63 tok/s0.9GB
Galaxy S25 Ultra (CPU, Q4)llama.cpp335 tok/s70 tok/s719MB

For comparison on the Galaxy S25 Ultra, Qwen3-1.7B does 181 tok/s prefill, 40 tok/s decode, and uses 1,306MB of memory. LFM2.5 is almost 2x faster with half the memory.

Getting usable speed in under 700MB on a phone is a big deal for on-device AI work.

Community use case: Z-Image-Engineer V4

One community model fine-tuned from LFM2.5-1.2B-Base is BennyDaBall/LFM2.5-1.2B-Z-Image-Engineer-V4.

It specializes in automatic expansion of image-generation prompts. If you enter a short phrase like “neon samurai,” it turns that into a 200-250 word prompt with lighting, lens settings, composition, and mood.

It is used mainly in workflows for Z-Image Turbo and Flux2 Klein, and there is even a ComfyUI custom node. It also works in LM Studio and Ollama.

The original series was developed on a Qwen3-4B base, but the LFM2.5-1.2B version is about 3x faster than the Qwen3 version and only about 700MB when quantized to Q4. It was fully fine-tuned on a 55,000-item dataset and uses a custom regularization method called SMART Training.

This is a good example of LFM2.5’s strength: even a 1.2B model can be very practical when the task is specialized.

License note

LFM2 was Apache 2.0-based, but LFM2.5 switched to the proprietary LFM Open License v1.0. Check the terms before using it.