Ling-flash-2.0 MXFP4 (bailing_moe) on SwiftLM + M1 Max 64GB: working config, support check, --stream-experts notes

The previous hands-on, Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm, only covered Qwen. Both Qwen3.6-35B-A3B and Qwen3.5-122B-A10B shipped in MLX 4-bit form under mlx-community/ and were effectively guaranteed to load under mlx-swift-lm.

This time I wanted to see how far SwiftLM goes with a non-Qwen MoE. The candidate that came up was Ling-flash-2.0, the predecessor of the new Ling-2.6-flash covered in the Tencent Hy3-preview and Ant Ling-2.6-flash same-week release news. Ling-2.6-flash itself is API-only as of now, but the 2.0 version has been open since last year under MIT with the same bailing_moe family. It’s a 104B total / 7.4B active MoE with a 32K context (YaRN-extendable to 128K). An official 4-bit version under mlx-community/ didn’t exist yet, but the personal repo exdysa/Ling-flash-2.0-MLX-MXFP4 has a 54.7GB MXFP4-quantized MLX build. That’s what I ran.

Pre-flight check

SwiftLM delegates to mlx-swift-lm for everything model-related, so the run is a hard no if mlx-swift-lm doesn’t support bailing_moe and MXFP4. I already knew SwiftLM is a thin layer over mlx-swift from the previous Qwen write-up, but architecture-level support lives in mlx-swift-lm.

I checked both before starting. Fortunately both were there.

`bailing_moe` support

$ ls ~/projects/SwiftLM/mlx-swift-lm/Libraries/MLXLLM/Models/ | grep -i bail
BailingMoe.swift

BailingMoe.swift is present, 364 lines. In LLMModelFactory.swift it’s registered as:

"bailing_moe": create(BailingMoeConfiguration.self, BailingMoeModel.init),

The comment at the top says this is ported from the Python bailing_moe.py in mlx-lm. Reading through, the V2-specific features are implemented too.

V2 feature	HF config key	Swift implementation
Expert grouping	`n_group`	✅ `nGroup`
Router bias correction	`moe_router_enable_expert_bias`	✅ `expertBias` as `MLXArray`
Shared experts	`num_shared_experts`	✅ `sharedExperts`
First K dense replace	`first_k_dense_replace`	✅ `firstKDenseReplace`

Ling-flash-2.0’s config.json declares architectures: ["BailingMoeV2ForCausalLM"] but model_type: "bailing_moe". mlx-swift-lm switches on model_type, so V1/V2 differences are absorbed on the implementation side.

MXFP4 support

$ grep -rE "mxfp4|MXFP4" ~/projects/SwiftLM/mlx-swift/ | head -5
Source/Cmlx/mlx/mlx/backend/metal/kernels/fp_quantized.metal
Source/Cmlx/mlx/mlx/primitives.cpp
Source/python/tests/test_quantized.py

fp_quantized.metal is present at the Metal kernel level, and both quantize(mode="mxfp4") and dequantize pass their tests. Also, LLMModelFactory.swift already ships with mlx-community/gpt-oss-20b-MXFP4-Q8 registered as a ModelConfiguration — there’s prior evidence of MXFP4 being exercised in practice, just not with bailing_moe.

The bailing_moe × MXFP4 combination specifically wasn’t validated anywhere I could find, but the building blocks are all present.

Downloading the model

54.7GB across 11 shards.

HF_HUB_ENABLE_HF_TRANSFER=1 hf download exdysa/Ling-flash-2.0-MLX-MXFP4

hf_transfer pulled at 60-70 MB/s. Done in about 14 minutes.

Key fields of config.json:

{
  "architectures": ["BailingMoeV2ForCausalLM"],
  "model_type": "bailing_moe",
  "num_experts": 256,
  "num_experts_per_tok": 8,
  "num_shared_experts": 1,
  "n_group": 8,
  "first_k_dense_replace": 1,
  "moe_router_enable_expert_bias": true,
  "hidden_size": 4096,
  "num_hidden_layers": 32,
  "num_attention_heads": 32,
  "num_key_value_heads": 4,
  "max_position_embeddings": 32768,
  "quantization": { "group_size": 32, "bits": 4, "mode": "mxfp4" }
}

Top-8 of 256 experts, plus one shared. It also carries the DeepSeek-V3-style n_group grouped routing. Qwen3.6-35B-A3B and Qwen3.5-122B-A10B run a looser 8-of-128/top-k=8 layout, so Ling’s 256 experts with group-8 routing is a first for SwiftLM here.

Starting the server

Reusing the --product SwiftLM binary from the previous run. I deliberately did not pass --stream-experts this time — 54.7GB sits below the 64GB unified memory, so I was betting no overcommit mitigation was needed.

cd ~/projects/SwiftLM
nohup .build/release/SwiftLM \
  --model exdysa/Ling-flash-2.0-MLX-MXFP4 \
  --port 5413 > /tmp/swiftlm-ling.log 2>&1 &

Startup log (the important parts):

[SwiftLM] Loading model: exdysa/Ling-flash-2.0-MLX-MXFP4
[SwiftLM] ⚠️ Memory strategy: SWAP-ASSISTED (1.0× overcommit, cache limited to 2MB)
[SwiftLM]    Model exceeds RAM by 1%. macOS swap will be used. Expect 2-4× slowdown.
[SwiftLM] Auto-partitioning: 31/32 layers on GPU
[SwiftLM] Loading LLM (large language model)...
[SwiftLM] Loaded model configuration. Inferred tool call format: nil
[SwiftLM] ⚠️  Model does not support layer partitioning (architecture not yet adapted)
[SwiftLM] Model loaded. Starting HTTP server on 127.0.0.1:5413
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413
{"event":"ready","engine":"mlx","partition":{
 "strategy":"swap_assisted","model_weight_gb":54.7,"kv_cache_gb":0.3,
 "total_required_gb":65.9,"system_ram_gb":68.7,"overcommit_ratio":1.02,
 "gpu_layers":31,"cpu_layers":1,"total_layers":32,
 "estimated_tok_s":3.3}}

Up in about two minutes. The model loads fine. What’s unexpected is that the strategy isn’t full_gpu but swap_assisted. The computed total is 54.7GB + 0.3GB KV = 65.9GB against 68.7GB system RAM, an overcommit_ratio of 1.02. SwiftLM’s policy is conservative — “over budget by 1% means swap-assisted”.

And then there’s this warning:

⚠️  Model does not support layer partitioning (architecture not yet adapted)

bailing_moe has a model implementation in BailingMoe.swift, but it’s not registered in SwiftLM’s layer-partitioning strategy (the one that splits layers between GPU and CPU). The post-startup config says gpu_layers: 31 / cpu_layers: 1, but that’s the auto-partitioner’s nominal calculation; at actual model execution time it almost certainly isn’t applied.

On top of that, ssd_stream=disabled and turbo_kv=disabled. Neither kicks in unless you explicitly pass --stream-experts or --turbo-kv.

Smoke test: swap goes full, prefill drops to 0.3 tok/s

The server was up, so I sent 1+1は? as the minimum prompt:

curl -s http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "exdysa/Ling-flash-2.0-MLX-MXFP4",
    "stream": false,
    "messages": [{"role":"user","content":"1+1は?"}]
  }'

Server log at this point:

srv  slot_launch: id 0 | prompt=23t | thinking=false | prefilling...
srv  slot update: id 0 | prefill done | n_tokens=23, t=74.85s, 0.3t/s |
  OS_RAM=51.1GB | MEM_DEMAND=99.5GB | GPU_MEM=50.9GB

Prefill of 0.3 tok/s (23 tokens in 74.85 seconds). Qwen3.6-35B-A3B did around 40 tok/s in the previous run, so this is two orders of magnitude slower. MEM_DEMAND=99.5GB is wild — model 54.7GB + KV 0.3GB = 55GB should have been the ceiling, but actual demand almost doubled.

vm_stat and sysctl vm.swapusage confirmed:

vm.swapusage: total = 14336.00M  used = 13176.50M  free = 1159.50M
Pages wired down: 2,628,154 (≈ 42 GB)

Swap is maxed out. The entire 14GB macOS swap pool is used and it’s still not enough. Once M1 Max hits its swap ceiling it’s a strong candidate for a Watchdog kernel panic. The server keeps responding, but each token now takes seconds.

The most plausible explanation for MEM_DEMAND ballooning to almost double the model size is the MXFP4 dequantization path. MXFP4 stores 4 bits plus a shared scale per 32-element block; at inference time it’s expanded to fp16. 54.7GB of MXFP4 dequantized to fp16 would be 218GB in the naive case. Per-layer lazy expansion should be the real behavior — but the Model does not support layer partitioning warning is exactly saying that didn’t happen: the per-layer strategy isn’t wired up, so the expansion buffers got allocated too aggressively.

I killed the first run here.

Restarting with `--stream-experts` flips everything

Same binary, this time with --stream-experts:

pkill -f "SwiftLM --model"
nohup .build/release/SwiftLM \
  --model exdysa/Ling-flash-2.0-MLX-MXFP4 \
  --stream-experts \
  --port 5413 > /tmp/swiftlm-ling-stream.log 2>&1 &

Startup log:

[SwiftLM] Enabled Async SSD Streaming on directory: 494c709b...
[SwiftLM] 💾 Memory strategy: SSD STREAMING (page-cache managed, 50GB RAM budget, no swap)
[SwiftLM] SSD Streaming active: Bypassing CPU auto-partitioning (forcing all layers to GPU)
[SwiftLM] Loaded model configuration. Inferred tool call format: nil
[SwiftLM] ⚠️  Model does not support SSD expert streaming
[SwiftLM] 🧠 Auto-calibration (Wisdom) bypassed for SSD Streaming
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413
{"event":"ready","partition":{
 "strategy":"ssd_streaming","model_weight_gb":54.7,"kv_cache_gb":0.3,
 "total_required_gb":65.9,"system_ram_gb":68.7,"overcommit_ratio":1.02,
 "gpu_layers":32,"cpu_layers":0,"ssd_stream":true,
 "estimated_tok_s":3.3}}

Yet another warning: ⚠️ Model does not support SSD expert streaming. So bailing_moe is unsupported both for layer partitioning and for SSD expert streaming. And yet — the observed behavior is completely different from the first run.

Same 1+1は? with max_tokens=30:

srv  slot update: id 0 | prefill done | n_tokens=23, t=4.64s, 5.0t/s |
  OS_RAM=6.1GB | MEM_DEMAND=6.1GB | GPU_MEM=5.9GB
srv  slot done: id 0 | gen_tokens=30 | OS_RAM=6.0GB | MEM_DEMAND=6.0GB

Everything flipped on its own.

Item	Without `--stream-experts`	With `--stream-experts`
strategy	swap_assisted	ssd_streaming
prefill (cold)	0.3 t/s (75s)	5.0 t/s (4.6s)
OS_RAM	51.1 GB	6.1 GB
MEM_DEMAND	99.5 GB	6.1 GB
GPU_MEM	50.9 GB	5.9 GB

⚠️ Model does not support SSD expert streaming says one thing, but SSD STREAMING (page-cache managed, 50GB RAM budget, no swap) is a hint the lower layers absolutely act on. In other words, the real effect of --stream-experts is not “fire the SSD expert streaming path”, but “don’t eager-load the whole model into RAM — run under mmap and let the page cache handle it” as a pressure setting for the macOS kernel. Models like Ling — which, left alone, end up blowing up through MXFP4 dequantization — need that hint. Qwen didn’t.

Measuring speed on BST insert

Warm, using the same prompt as the previous Qwen hands-on:

curl -s http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"exdysa/Ling-flash-2.0-MLX-MXFP4",
    "stream":false,
    "max_tokens":300,
    "messages":[{"role":"user","content":"Pythonで、二分探索木に値を挿入する関数 insert(root, val) を書いて。短く。"}]
  }'

Output (excerpt):

class Node:
    def __init__(self, val):
        self.val = val
        self.left = None
        self.right = None

def insert(root, val):
    if not root:
        return Node(val)
    if val< root.val:
        root.left = insert(root.left, val)
    else:
        root.right = insert(root.right, val)
    return root

Server log:

srv  slot update: id 0 | prefill done | n_tokens=46, t=2.27s, 20.2t/s
srv  slot done: id 0 | gen_tokens=248 | OS_RAM=6.9GB | GPU_MEM=5.9GB

46 prompt tokens + 248 generated tokens, wall 38.0 seconds
prefill 20.2 t/s (warm)
generation 6.95 tok/s (248 tokens across roughly 35.7 seconds)
memory stable at OS_RAM 6.9GB / GPU_MEM 5.9GB

What’s interesting is the val< root.val in the generated code — the same missing-space quirk the Qwen3.6-35B-A3B run on SwiftLM produced. Same SwiftLM sampler, same glitch, regardless of the underlying model. That adds one more data point to the hypothesis from the previous article that this is a SwiftLM-side sampler thing, not something specific to mlx-lm.

Speed comparison

Stacking today’s numbers on the existing measurements:

Model	Runtime	generation
Qwen3.6-35B-A3B (UD-MLX-4bit)	Ollama GGUF	27 tok/s
Qwen3.6-35B-A3B (UD-MLX-4bit)	MLX-lm	54 tok/s
Qwen3.6-35B-A3B (UD-MLX-4bit)	SwiftLM	20 tok/s
Qwen3.5-122B-A10B-4bit	SwiftLM `--stream-experts`	4.25 tok/s
Ling-flash-2.0 MXFP4 (100B-A6.1B)	SwiftLM `--stream-experts`	6.95 tok/s

Active 6.1B and total 54.7GB mean less memory pressure than Qwen3.5-122B-A10B, hence the higher number there. Against Qwen3.6-35B-A3B’s 20 tok/s on SwiftLM, the drop is roughly proportional to the total parameter difference (100B vs 35B). The broader pattern — SwiftLM getting half or less of mlx-lm’s speed — reproduces on a non-Qwen architecture too. That makes it more likely the gap between mlx-swift and mlx-lm is model-agnostic.

`/metrics` and the mysterious zero SSD I/O

After BST + persona, /metrics reports:

swiftlm_tokens_per_second 5.84
swiftlm_memory_active_bytes 6,327,000,048  (6.3 GB)
swiftlm_memory_peak_bytes   7,155,477,540  (7.2 GB)
swiftlm_ssd_throughput_mbps 0.0
swiftlm_ssd_bytes_read_total 0
swiftlm_ssd_chunks_total 0

swiftlm_ssd_bytes_read_total: 0. Same conclusion as Qwen3.5-122B-A10B in the previous hands-on: the ssd_stream=true flag is set, but SwiftLM’s native SSD chunk path never fires.

The implication is different this time though.

Previous: Qwen3.5-122B-A10B — ran fine without --stream-experts, but even with the flag set, the mmap + page cache was enough; SSD streaming was effectively nominal.
Today: Ling-flash-2.0 — without --stream-experts it dies in swap. With the flag it runs, but it’s still mmap + page cache, no real SSD reads.

So the real role of --stream-experts is not to stream experts from SSD, but to switch SwiftLM’s memory allocation strategy into ssd_streaming (50GB RAM budget + no swap + page-cache-managed). Ling needs that hint; Qwen didn’t.

Kana-chan persona, two turns

Same system prompt as the previous hands-on, two turns (skipping the short-answer Turn 3 this time).

Turn 1: self-introduction

はいっ！私はかなちゃんよ～！いつも元気にしてるかな？
最近はAIの仕組みを色々と調べてたり、ゲームの企画を考えたりしてるの～！特に「AIとゲームを融合させた面白いやつ」作りたかったんだよね～💡
…
（語尾に「〜だよ」「〜かな」を自然に使ってみました～！）

prompt 89 / completion 142, wall 22.18s, prefill 42.3 t/s, generation ~7.1 t/s.

Three things jump out:

The first-person pronoun is 私は instead of the requested わたし
Emojis (💡, 😊) appear even though the prompt didn’t ask for them
At the end, a meta comment: (I used 〜だよ and 〜かな naturally in the sentence endings!) — it’s narrating that it satisfied the instruction

Qwen3.6-35B-A3B on SwiftLM held the わたし pronoun correctly, so this reads as Ling-flash-2.0’s Japanese instruction-following being weaker (or the MXFP4 quantization taking a slice off that specific capability).

Turn 2: three favorite games

わたしの好きなゲーム3つを紹介するね！

「Stardew Valley」——田舎でんぼん稼ぎと冒険、とても癒やしいよ〜！

「AI: The Somnium Files」——未来とAIが絡んだ謎解きADV、考えさせられるんだよ〜

「Undertale」——感動的なストーリーと面白い戦い方、何度でも遊べちゃうの♪

prompt 137 / completion 130, wall 18.36s, prefill 101.6 t/s, generation ~7.6 t/s.

The pronoun returns to わたし
Proper nouns (Stardew Valley / AI: The Somnium Files / Undertale) are typographically clean — the Elden Ring → elden ring case mismatch Qwen produced on SwiftLM does not appear
でんぼん稼ぎ (probably a mis-generation of でん農稼ぎ or 田んぼ稼ぎ meaning something like “paddy-field earnings”) sneaks in as a broken Japanese word

Prefill hits 101.6 t/s on Turn 2, prompt-cache hit. SwiftLM’s cache pathway works fine on Ling too.

Going in I was counting on BailingMoe.swift being enough for “it’ll just work”. What actually happened was two ⚠️ architecture not yet adapted warnings and an instant swap-death on the first run. The warnings lie (the model still runs), while the flag that looks the loosest (--stream-experts) turns out to save memory allocation. That double inversion is the most interesting part of the Ling run. The actual role of --stream-experts isn’t SSD streaming — it’s a memory-model hint for SwiftLM. That’s the one takeaway I wouldn’t have seen without going off the Qwen reservation.

On speed, generation of 6.95 tok/s is clearly slower than the 20 tok/s Qwen3.6-35B-A3B produced on SwiftLM — the 100B vs 35B total parameter gap explains it. Matching mlx-lm speeds through SwiftLM is still elusive, even on a different architecture. The Japanese instruction drift (pronoun slip, broken words, meta commentary) is a Ling-flash-2.0 property and is independent of SwiftLM.

When Ling-2.6-flash’s weights ship, I want to run them through the same setup. 2.6 is supposed to have hybrid linear attention on top, and whether the existing bailing_moe path in mlx-swift-lm covers that is the next question.