Tech 17 min read

LFM2.5 1.2B JP on M1 Max 64GB: 208 tok/s decode, JSON OK, name hallucinated

IkesanContents

I ran Liquid AI’s LFM2.5-1.2B-JP-202606 on M1 Max 64GB.
llama.cpp (GGUF Q4_K_M) hits 208 tok/s decode; short-text reformatting finishes in 0.2 seconds.
JSON structure is preserved but the model hallucinates its own name as “FDM2.5”, and tool calls don’t fire through GGUF.

I previously wrote up the LFM2.5 hybrid architecture.
That article focused on the 10-layer double-gated LIV convolution + 6-layer GQA stack—neither Transformer nor Mamba.
This time it’s not about structure; I ran the 1.2B locally to see speed and practical limits.

LFM2.5, not the old LFM2-1.2B

Liquid’s documentation marks LFM2-1.2B as “deprecated” and directs users to LFM2.5-1.2B-Instruct.
No reason to dig into the old version.
Search results still surface the old repo, so watch out when downloading.

The LFM2.5 1.2B family splits into Base, Instruct, Thinking, JP, VL, and Audio.
For local Japanese chat and lightweight tool use, LFM2.5-1.2B-JP-202606 is the one.
The Hugging Face model card positions the 202606 release as the latest, with improvements in knowledge, instruction-following, math, code, and tool use over the prior LFM2.5-1.2B-JP.

Numbers from the card: 1.17B parameters, 16 layers (10 LIV convolution + 6 GQA), vocabulary 65,536, trained on 31.5T tokens, 32,768-token context, English + Japanese.
Training data cutoff is mid-2024, so recent events are out.
Recommended generation settings: temperature=0.1, top_k=50, repetition_penalty=1.05.
The card itself warns against fine-grained knowledge tasks, so the right test inputs are explicit instructions, structured output, and bilingual short assists—not trivia.

What the JP benchmarks say

The card compares against Qwen3-1.7B, Llama-3.2-1B, and Gemma-3-1B on Japanese benchmarks.
The score breakdown shows strengths and weaknesses directly.

BenchmarkScoreWhat it measures
J-MIFEval79.08Japanese instruction-following
J-GSM8K62.20Japanese arithmetic word problems
JMMLU54.19Japanese knowledge (multiple choice)
JHumanEval+49.39Code generation
JMMLU-ProX36.23Harder knowledge/reasoning
Domain average53.11

Instruction-following (J-MIFEval) at 79 is high; knowledge-heavy (JMMLU-ProX) drops to 36.
The card’s “not recommended for knowledge tasks” refers to this gap.
So my tests focus on whether the model holds structure under formatting/restructuring instructions, not knowledge queries.

GGUF and MLX at ~700 MB

There’s a GGUF for the Japanese version.
LiquidAI/LFM2.5-1.2B-JP-GGUF offers Q4_0 at 696 MB, Q4_K_M at 731 MB, Q8_0 at 1.25 GB.
The 202606 version also has LiquidAI/LFM2.5-1.2B-JP-202606-GGUF.

On Mac, MLX is worth checking too.
Liquid states LFM2.5 supports llama.cpp, MLX, vLLM, ONNX, etc.
When I ran Qwen3.6-35B-A3B on Ollama with M1 Max, even a large MoE fit in GPU memory on M1 Max 64GB.
This time at 1.2B, memory isn’t the constraint—I’m looking at startup lightness, short-response latency, and whether longer Japanese input degrades speed.

With M1 Max 64GB memory isn’t a bottleneck, so besides the smartphone-oriented Q4_K_M, there’s room to test Q8_0 (1.25 GB) for quality ceiling.
Conversely, to match smartphone conditions, fix on Q4_K_M and compare.

For llama.cpp, load the GGUF directly.

llama-server -hf LiquidAI/LFM2.5-1.2B-JP-202606-GGUF:Q4_K_M \
  -c 4096 \
  --jinja \
  --temp 0.1 \
  --top-k 50 \
  --repeat-penalty 1.05 \
  --port 8765

If your Ollama supports HF GGUF directly, try the same model from Ollama.
Whether Ollama’s chat template works as expected with this model is itself part of the test.
LFM2.5 uses a ChatML-style template with <|tool_call_start|> / <|tool_call_end|> tokens for tool calls.

ollama run hf.co/LiquidAI/LFM2.5-1.2B-JP-202606-GGUF:Q4_K_M

For MLX, use the 202606 MLX repo and compare against GGUF.
On Apple Silicon, GGUF/Metal and MLX sometimes behave differently in speed and memory, so I send the same prompts.

Actual test results

Environment

ItemValue
MachineM1 Max 64GB
OSmacOS 26.5 (Darwin 25.5.0)
llama.cpp (ggml)0.10.1 (Homebrew)
Ollama0.17.7
mlx-lm0.31.3
ModelLiquidAI/LFM2.5-1.2B-JP-202606-GGUF Q4_K_M (694MB) / Q8_0 (1.25GB)
Context4096 / 32768

Startup and memory

From llama.cpp logs.

llama-server -hf LiquidAI/LFM2.5-1.2B-JP-202606-GGUF:Q4_K_M \
  -c 4096 --jinja --temp 0.1 --top-k 50 --repeat-penalty 1.05 --port 8765
Itemctx 4096ctx 32768
Resident memory (RSS)838 MB1,171 MB
KV cache384 MB (f16)
Metal placementAll 17 layers GPU (MTL0)All 17 layers GPU (MTL0)
Flash Attentionauto → enabledauto → enabled
Fused Gated Delta Netenabled (autoregressive + chunked)enabled (autoregressive + chunked)

GGUF metadata has n_ctx_train = 128,000.
The model card says 32,768, but the GGUF has 128k baked in.
Starting with ctx 32768 triggers a note: n_ctx_seq (32768) < n_ctx_train (128000).

Architecture name is lfm2, tokenizer type is gpt2 (pre: lfm2).
Vocabulary 65,536, template is ChatML format (<|im_start|> / <|im_end|>).

Short-text speed feel

Prompt sent

次の文章を、意味を変えずに短くして。

M1 MaxでローカルLLMを試すときは、速度だけでなく起動時間とテンプレート崩れも見る。

Output (llama.cpp, Ollama, MLX all produced the same result)

M1 MaxでローカルLLMを試す際は、速度・起動時間・テンプレート崩れを確認する。

Deterministically identical output. At temperature 0.1, reproducibility is high.

Runtimeprompt tok/sdecode tok/stotal ms
llama.cpp ctx 4096 (1st)1,544208164
llama.cpp ctx 4096 (2nd)1,194209192
llama.cpp ctx 327681,175204197
Ollama (Q4_K_M)664208285
MLX (safetensors, f32 load)571261,220

llama.cpp GGUF Q4_K_M is fastest on M1 Max.
Ollama matches decode speed but halves prompt processing.
MLX is clearly slower at 126 tok/s decode. Likely because it loads safetensors directly without quantization.

Does it break JSON?

Prompt sent

次の入力からJSONだけを返して。
キーは title, tags, summary。
tagsは日本語の短い文字列配列。

入力:
LFM2.5 1.2B JP 202606をM1 Maxで試す。GGUFとMLXで速度を比べ、JSON出力と日英翻訳も見る。

Q4_K_M output

{
  "title": "FDM2.5 速度比較実験",
  "tags": ["FDM2.5", "速度比較", "M1 Max"],
  "summary": "GGUFとMLXを用いてFDM2.5の速度を比較し、JSON出力と日本語翻訳を提供する。"
}

Q8_0 output

{
  "title": "LFM2.5 1.2B JP 202606をM1 Maxで試す",
  "tags": ["Firmware", "Performance", "JP", "202606", "M1 Max"],
  "summary": "GGUFとMLXを使用して速度を比較し、JSON出力と日英翻訳を提供します。"
}

Both parse successfully as valid JSON. Structure is preserved.
However, Q4_K_M hallucinates “LFM2.5” as “FDM2.5” (reproduced in both llama.cpp and Ollama).
Q8_0 produces the correct “LFM2.5”. Quantization degrades precision on alphanumeric proper nouns.

Tag quality is mediocre for both. Q4_K_M follows the “Japanese tags” instruction but mangles the model name; Q8_0 produces English tags.
At 1.2B, accurately preserving proper nouns while perfectly following instructions hits a wall.

Bilingual input

Prompt sent

次の作業メモを、あとで再実行できる短い手順に直して。

llama-server was started with --jinja, but the response still included tool_call tokens.
Maybe the chat template is not applied by the client.
M1 Max 64GB / Q4_K_M / ctx 4096.

Output (Q4_K_M, Ollama)

**修正手順:**

1. **設定ファイルの確認**
   - `--jinja` オプションが正しく適用されているか再確認する。
   - チャットテンプレートがクライアント側で正しく読み込まれているか検証する。

2. **応答ログの分析**
   - 出力ログを抽出し、`tool_call tokens` が含まれる箇所を特定する。
   - 該当部分がテンプレート未適用の結果かどうかを判断する。

3. **設定パラメータの調整**
   - `--jinja` オプションの使用方法を再確認し、必要に応じて正しい構文に修正する。
   - 例:`--jinja` の代わりに `--template` を使用するなど。

(以下省略、全376トークン)

It can take bilingual notes and convert them to steps.
Extremely verbose though. The 1.2B model tends to write long unless you explicitly say “keep it short.”
Technical terms (--jinja, tool_call tokens) are preserved intact, so there’s potential as a log-cleaning utility.

Decode speed was 211 tok/s, same as short text. Longer output just extends wall time (1.7 seconds).

Tool call behavior

Text-prompt format test.

利用可能なツール:
get_weather(city: string) -> string

ユーザー:
東京の天気を調べて、短く答えて。

Output

東京の天気は晴れで、気温は約20℃です。

It answers directly instead of calling the tool.
Same behavior when passing tools via the OpenAI-compatible API.

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "指定した都市の現在の天気を返す",
      "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}
    }
  }]
}

Result: no tool_call field, content directly outputs “東京の現在の天気は晴れで、気温は約20℃です。”

The model card lists tool calling as a supported use case, but via llama.cpp/Ollama the tool call format never appeared.
Either the <|tool_call_start|> / <|tool_call_end|> special tokens don’t fire through GGUF, or it’s a template issue.
For agent use, you’d need to verify whether special tokens work via HuggingFace transformers separately.

Does it actually use 32768 tokens?

Restarted with ctx 32768 and fed an entire blog post: 8,867 bytes (2,529 tokens).

以下の記事の要点を3行で抽出して。

(LFM2.5ハイブリッドアーキテクチャ記事の全文)

Output

1. Liquid AIはMIT CSAIL由来の「液体ニューラルネットワーク」を基に、エッジデバイス向けに
   Attention+短距離畳み込みのハイブリッドアーキテクチャを採用したスタートアップ。
2. LFM2.5の核心はAttentionと短距離畳み込みの組み合わせで、1.2Bモデルでは全16レイヤーの
   うち10層がDouble-gated LIV convolution、6層がGrouped Query Attentionを用い、
   CPUでのprefill/デコード速度がTransformerの約2倍、メモリ効率も優れる。
3. 従来のSSM系に代わる「どっちでもないやつ」として、GQA・短距離畳み込み・線形Attentionの
   最適なトレードオフを実現。(256トークンで切れ)
ItemResult
input tokens2,529
prompt processing2,628 tok/s (962ms)
decode204 tok/s
Summary qualityCaptures the article’s core points. No hallucination

No speed degradation at ~2,500 input tokens.
Prefill (input processing) is actually faster here (2,628 tok/s) than with short text (1,200 tok/s at ~50 tokens).
Possibly batch processing is more efficient, or caching effects.

Short-text speed at ctx 4096 vs ctx 32768 is essentially identical (decode 208 vs 204 tok/s), so widening the context window adds minimal overhead.

Q4_K_M vs Q8_0

ItemQ4_K_MQ8_0
File size694 MB1.25 GB
decode (Ollama)208 tok/s157 tok/s
JSON syntaxOKOK
Model name hallucinationYes (LFM→FDM)No
prompt processing (Ollama)664 tok/s516 tok/s

Q8_0 is 25% slower but eliminates proper-noun hallucination.
With M1 Max 64GB, 1.25 GB is trivial, so Q8_0 is viable for accuracy.
Perceived difference is “instant” vs “almost instant.” Both finish short text in a few hundred milliseconds.

Conversation naturalness

I saw claims that “LFM2.5-1.2B-JP has natural conversation,” so I checked.

Without a system prompt, casual chat returns bullet-pointed polite Japanese.

入力: 今日めっちゃ暑いんだけど、なんか涼しくなる方法ない?

出力:
暑いときの涼しさ対策は、以下の方法が有効です:
1. **水分補給** こまめに水や水分補給できる飲み物を摂ることで体温調節を助けます。
2. **風通しの良い場所へ移動** 窓を開ける、扇風機を使う(以下略)

With system prompt “カジュアルに話す友達です。敬語は使わないで。” (Be a casual friend. Don’t use polite language.), it switches.

出力: 日差し遮るやつ持って行けばいいよ!日傘とか帽子で直接光を避けよう。水分補給も忘れずに、
冷たい飲み物飲んで体を冷やすのが一番だよ。窓の外を見て、風通しのいい場所へ移動するのもアリ。
ちょっとした工夫で楽になるはず!

Multi-turn context carry-over.

入力: 家の中にいるんだけどエアコンつけたくないんだよね、電気代やばくて

出力: 窓を開けて通風させるか、薄手のカーテンで風通し作るのがいい。湿気取りの除湿機使えば
快適になるよ。自然の風を活かすだけでも効果あるから、それがベスト!

Instruction-following on system prompts is good. “Don’t use polite language” and it actually switches to casual Japanese.
For 1.2B, the tone switching is smooth—no stray polite forms leaking in.
The “natural” claim probably refers to this: small models often ignore tone instructions and revert to polite forms. This one doesn’t.

Content depth is shallow though. As a chat companion, reactions are thin and repetitive.
”Natural conversation” means “tone control works,” not “interesting replies.”

Knowledge limits and hallucination

Training data cutoff is mid-2024.
I asked “What’s Claude 4 Opus?” and got this.

**Claude 4 Opus** は、Microsoftが開発したオープンソースの自然言語処理(NLP)モデルで…
- **開発者**: Microsoft Research
- **リリース日**: 2023年(オープンソースとして公開)
- **パラメータ数**: 約1.5億パラメータ

Complete nonsense. It can’t say “I don’t know” and confidently fabricates.
1.2B lacks both knowledge capacity and the ability to express uncertainty.
Knowledge queries are outside this model’s scope—the card’s “not recommended for fine-grained knowledge tasks” means exactly this.

Fine-tuning feasibility

Is 1.2B a size an individual can fine-tune?
With LoRA/QLoRA, absolutely.

  • Full precision ~2.4 GB (fp16), quantized ~700 MB
  • QLoRA training VRAM: ~4–8 GB
  • M1 Max 64GB Unified Memory makes MLX-based training realistic

What LLM fine-tuning actually involves breaks into three broad approaches.

MethodWhat it doesUse case
SFT (Supervised Fine-Tuning)Train on input/expected-output pairsStrengthening instruction-following (“only return JSON”, “use this tone”)
DPO / RLHFDirection-correct with good/bad response pairsSafety tuning, tone adjustment
Continued pre-trainingRe-inject knowledge via additional corpusDomain knowledge (legal, medical, internal terminology)

For 1.2B fine-tuning, SFT to embed “response patterns specific to your use case” is realistic.
For example: specialize it for extracting metadata from your blog posts, or train it to only emit a specific JSON format.
Continued pre-training at 1.2B risks destroying existing knowledge (catastrophic forgetting), so caution is needed.

Liquid AI’s LFM2.5 license is lfm1.0: commercial use allowed, derivative distribution allowed.
No license barrier to fine-tuning and using it as your own model.

Horse racing prediction (thought experiment)

If it’s a size you can fine-tune yourself, could you feed it horse racing data and have it output predictions?
Run continued pre-training on race results, bloodlines, and track conditions, then SFT the pattern “input race info → output prediction format.” Put it on a phone and get local predictions while watching the paddock.

There are many walls though.

  • 1.2B has limited knowledge capacity. Memorizing entire bloodline tables is impossible—you’d pass them at inference time (RAG-like approach)
  • Continued pre-training with racing knowledge risks destroying Japanese ability (catastrophic forgetting)
  • Horse racing “correct answers” are probabilistic. Past winning patterns don’t transfer when future races have different track/condition variables
  • Real-time odds and day-of horse weight don’t exist at training time, requiring a separate mechanism to pass them at inference

The practical approach would be not to fine-tune the 1.2B model with racing knowledge, but to use 1.2B as a “structured output engine”—pass all race info as text input and have it output predictions in a fixed format.
The prediction logic doesn’t need to live inside the model. You’re betting on input data volume and prompt engineering.
With 32,768 tokens of context, one race’s entry table + past records + bloodlines all fit.

Can the same 1.2B run on a smartphone?

Separate from the M1 Max story, this size was designed for phones in the first place.
LFM2.5-1.2B-JP-202606’s model card lists “on-device personal assistant” as a recommended use case.
Liquid AI designed LFM2 from the start for edge (phones and embedded devices, not servers), and ships the LEAP SDK and Apollo app for mobile execution.

Memory is fine.
The Japanese GGUF is 696 MB at Q4_0, 731 MB at Q4_K_M—under 1 GB at 4-bit quantization.
LEAP’s Android SDK requires 3 GB+ device RAM; the Apollo app requires 4 GB+. Any phone from the past few years fits.

Smartphone speed (Thinking variant benchmarks, same 1.2B architecture)

The Thinking variant (same 1.2B architecture) has published benchmarks that give a ballpark.
Not JP-202606’s exact numbers, but same layer structure and size, so the order of magnitude transfers.

Chip (device)RuntimeprefilldecodeMemory
Snapdragon 8 Elite for Galaxy (S25 Ultra)llama.cpp / CPU / Q4_0336 tok/s70 tok/s720MB
Snapdragon 8 Elite (ROG Phone 9 Pro)NexaML / NPU4,391 tok/s82 tok/s0.9GB
Apple A19 ProCactus / CPU / INT8420 tok/s64 tok/s1,128MB

Decode (generation) lands in the 60–82 tok/s range on both CPU and NPU—generation speed doesn’t stretch much regardless of runtime.
The difference shows in prefill (input processing), where NPU (the phone’s dedicated AI processing unit) gains an order of magnitude, meaning long inputs start generating faster.

NPU-equipped phones are common

“Do phones even have NPUs?” you might think, but since 2024 virtually every flagship has one.

ChipExample devicesNPU name
Snapdragon 8 Gen 3 / 8 EliteGalaxy S24 Ultra / S25 Ultra, ROG Phone 9Hexagon NPU
Apple A17 Pro / A18 Pro / A19 ProiPhone 15 Pro / 16 Pro / 17 ProNeural Engine (16 cores)
Google Tensor G4Pixel 9 ProGoogle TPU
MediaTek Dimensity 9300vivo X100 ProAPU 790

These NPUs handle not just image processing and camera AI but also LLM inference.
However, “running LLMs on NPU” requires a dedicated runtime—it’s a different path from llama.cpp CPU execution.
For LFM2.5, Liquid AI itself provides NPU support via NexaML and LEAP SDK, so supported chips can use it.

M1 Max measured 208 tok/s (decode) is about 3× smartphone CPU execution (64–70 tok/s).
Even on NPU, decode is ~82 tok/s, so pure generation speed goes to M1 Max.
But the phone’s advantage is that it fits in your pocket and is always on.

The problem is how to load the JP version

“Can 1.2B run on a phone?” is mostly solved; what remains is “how to get JP-202606 specifically on the phone.”
Two main execution routes on the phone side.

One is the official route: download from the Apollo app (iOS/Android, fully local) or LEAP SDK model library.
However, the library mainly lists generic versions like LFM2-1.2B, and LFM2.5-1.2B-JP-202606 may not be listed directly.

The other is loading GGUF directly.
LEAP’s Android SDK has loadSimpleModel for arbitrary on-disk GGUFs, and generic llama.cpp-based GGUF runner apps also work with the same ~700 MB file.
To try the JP version, this is the route.

flowchart TD
    A[LFM2.5-1.2B-JP-202606<br/>をスマホで動かす] --> B{正規ライブラリに<br/>JP版がある?}
    B -- ある --> C[Apollo / LEAPで<br/>直接ダウンロード]
    B -- ない --> D[JP版GGUF Q4_K_M<br/>を用意]
    D --> E[loadSimpleModel か<br/>汎用GGUFアプリで読む]

Minimum requirements: Android 12+, device RAM 3 GB+ (LEAP SDK spec).
iOS goes through Apollo or Swift SDK.

Size and architecture are phone-native from the start, so no performance bottleneck.
The only friction is whether to wait for JP-202606 to appear in the official library or manually place the GGUF yourself.
Running the same Q4_K_M on a phone gives a direct order-of-magnitude comparison against M1 Max decode speed.

Browser execution

At ~700 MB, it technically runs in a browser via WebGPU/WebAssembly.
Routes include wllama (llama.cpp Wasm build) loading the GGUF directly, or MLC LLM’s WebGPU version.
The initial 700 MB download and 3–5× speed penalty vs native make it impractical.
It does work as a “no-install LLM” tech demo at this size.

Comparison target: probably TinySwallow, not Qwen

The official bench compares against Qwen3-1.7B and Llama-3.2-1B-Instruct.
But for a small local Japanese helper, the comparison target isn’t a large model like Qwen3.6 35B.
TinySwallow, Gemma 3 1B, Sarashina 1B-class, Hy-MT2 1.8B—“light but different purpose” models are closer.

In my Hy-MT2 1.8B on M1 Max article, I tested it as a translation-specific model with JSON, SRT, HTML, and terminology constraints.
LFM2.5-1.2B-JP-202606 isn’t translation-specific, so rather than head-to-head comparison, I’m looking at short-text formatting, structured output, bilingual memo handling, and light tool calls.

What matters most at 1.2B is not the ceiling of intelligence but the lightness of failure.
If it starts at 700 MB, responds quickly for tens of tokens, and mostly preserves JSON, it has a place as a resident local auxiliary model.

References