Qwen3.5-35B-A3B on llama-server (Vulkan + Strix Halo): 4K → 65K context for only 800MB more VRAM

In the previous article, I got Qwen3.5-35B-A3B running on llama-server + Vulkan. It hit 53 t/s with Q6_K, which was perfectly usable, but since I was just testing things out, I left ctx-size at the default 4096.

I’ve been running this model as an API, accessed from a VPS over Tailscale. Full conversation logs get sent with each request, so at ctx-size 4096, the context overflows mid-conversation and responses start going off the rails. Normally, bumping up the context length means KV cache eats into VRAM and speed tanks — that’s how it works with standard Transformer models. But Qwen3.5-35B-A3B doesn’t play by those rules.

Setup

Component	Spec
PC	GMKtec EVO-X2
CPU	AMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T)
GPU	Radeon 8060S (RDNA 4 / gfx1151)
Memory	64GB unified memory (BIOS: 32GB VRAM / 32GB system)
llama.cpp	b8183 Vulkan + `--no-mmap`
Model	Qwen3.5-35B-A3B Q6_K (26.55 GiB)

--no-mmap is required to avoid the mmap double-mapping issue on unified memory APUs, as verified in the previous article.

Why Qwen3.5-35B-A3B’s Architecture is KV Cache Friendly

Standard Transformer models have Attention in every layer, and each layer consumes KV cache. KV cache grows proportionally with context length, eating up VRAM.

Qwen3.5-35B-A3B is different. It’s an SSM (State Space Model, Mamba-family) + Attention hybrid, where 30 of 40 layers are SSM and only 10 layers are full Attention. With full_attention_interval = 4, Attention kicks in once every 4 layers.

graph TD
    A["Layer 1-3: SSM"] --> B["Layer 4: Attention"]
    B --> C["Layer 5-7: SSM"]
    C --> D["Layer 8: Attention"]
    D --> E["..."]
    E --> F["Layer 37-39: SSM"]
    F --> G["Layer 40: Attention"]

    style B fill:#f96,stroke:#333
    style D fill:#f96,stroke:#333
    style G fill:#f96,stroke:#333

Only Attention Layers Consume KV Cache

SSM layers maintain a recurrent state, but it’s a fixed-size buffer (251 MiB) that doesn’t depend on context length. Whether the context is 4K or 256K tokens, SSM layer memory consumption stays the same.

Only the 10 Attention layers consume KV cache. Compared to a standard 40-layer all-Attention model, KV cache memory consumption is roughly 1/4.

Buffer	Size	ctx-size dependent
KV cache (10 Attention layers)	Proportional to ctx-size	Yes
RS buffer (recurrent state for 30 SSM layers)	251 MiB fixed	No

What’s the RS Buffer?

The RS buffer (Recurrent State Buffer) is the memory region that holds the internal state of SSM layers. While Attention layers keep the Key/Value of all input tokens, SSM layers compress past information into a fixed-size state vector.

SSM processes input tokens one by one, updating the state vector as it goes. The state vector size is determined at model design time (for Qwen3.5-35B-A3B: d_state=128, d_inner=4096). No matter how long the context gets, the state vector size stays the same — old information gets overwritten by new information. This is fundamentally different from KV cache, and one of the reasons SSM has an advantage with long contexts.

RS buffer size depends on n_parallel (number of concurrent processing slots). More slots means more state vectors, but increasing ctx-size doesn’t change anything.

Benchmarks

Measured with 4 parallel slots at different ctx-sizes.

ctx-size	KV Cache	Total VRAM	Generation Speed
4,096	80 MiB	27.6 GB	53.7 t/s
32,768	640 MiB	28.2 GB	49.9 t/s
65,536 (q8_0 KV)	~870 MiB (estimated)	~28.9 GB	53.6 t/s

Increasing ctx-size from 4K to 32K (8x) only added 560 MiB of KV cache. Speed drop was minimal at 53.7 to 49.9 t/s. With a standard Transformer model, scaling like this would pile on several GB of KV cache and tank throughput.

At 65K with q8_0 KV quantization, KV cache stayed around 870 MiB while maintaining 53.6 t/s — essentially identical to the 4K baseline.

Memory Budget Estimates

Actual measurements with 32GB/32GB BIOS split.

Category	Status
System RAM	~13GB in use (Parsec + Tailscale + OS) -> 19GB remaining
VRAM	Model + ctx + compute at ~29GB / 32GB -> ~3GB remaining

The Vulkan0 driver reports free: 46,522 MiB, but this reflects the total physical free memory across unified memory. It doesn’t account for the logical VRAM/system boundary set in BIOS. The actual available headroom is much less, as shown above.

KV Cache Size Estimates

KV cache size estimates at f16 (default) with 4 parallel slots.

ctx-size	KV Cache (f16)	KV Cache (q8_0)
32,768	640 MiB	~320 MiB
65,536	1,280 MiB	~640 MiB
131,072	2,560 MiB	~1,280 MiB
262,144	5,120 MiB	~2,560 MiB

The model’s n_ctx_train = 262,144, so it natively supports up to 256K tokens. Even at 131K, KV cache is only 2.5 GB (f16). VRAM-wise, even 262K is feasible.

This is where the SSM+Attention hybrid with only 10 Attention layers really pays off. A standard 40-layer full-Attention model would need 20+ GB of KV cache at 262K.

KV Cache Quantization

llama-server has options to quantize the Key and Value caches separately.

--cache-type-k q8_0   # Quantize Key to q8_0
--cache-type-v q8_0   # Quantize Value to q8_0

q8_0 is half the size of fp16. Results with q8_0 KV at ctx=65536:

Metric	f16 KV (ctx=32K)	q8_0 KV (ctx=65K)
KV Cache	640 MiB	~870 MiB
Generation Speed	49.9 t/s	53.6 t/s

Doubled the context length, but KV cache only grew by ~230 MiB. Speed actually went up. No perceptible quality degradation from q8_0 quantization.

The quality impact of KV cache quantization depends on the model architecture and task. Tasks that are sensitive to KV cache precision — like long document summarization or mathematical reasoning — tend to be more affected. For conversational use, q8_0 is fine in practice.

TurboQuant

As covered in the TurboQuant article, TurboQuant (Google Research, ICLR 2026) compresses KV cache down to 3-4 bits — less than half the size of q8_0.

Implementation Status by Backend

Backend	Status
Metal	Working (tq3/tq4)
CUDA	Working (98.8% of q8_0 speed)
CPU	Working (zero speed penalty)
Vulkan	Early prototype stage
upstream llama.cpp	Not merged

TurboQuant isn’t usable yet on the Radeon 8060S (Vulkan). The Vulkan implementation is in its early stages and hasn’t been merged into llama.cpp mainline. Development is ongoing in llama.cpp Discussion #20969.

QJL vs WHT Verification (Discussion #20969)

In the same Discussion #20969, three independent implementations reproduced and verified TurboQuant’s algorithm. scos-lab (Python, 8 models tested), Arclabs001 (YATQ, PyTorch), and TheTom (turboquant_plus, Metal) all reported that “QJL (Quantized Johnson-Lindenstrauss) is counterproductive in practice, and MSE-only + WHT (Walsh-Hadamard Transform) works best.” In scos-lab’s GPT-2 (head_dim=64) testing with 3-bit quantization, MSE-only showed a perplexity degradation of +7.6%, while the paper’s QJL approach spiked to +300%. QJL removes bias but amplifies variance, and softmax then magnifies that variance.

However, this conclusion was revised on March 30. Arclabs001 and AmesianX showed that replacing the initial implementors’ random orthogonal matrices (QR decomposition) with WHT (Walsh-Hadamard Transform) and combining QJL’s SRHT with independent sign patterns reversed the results. On Qwen3.5 (head_dim=256), WHT+QJL achieved a PPL of 6.54 (just +2.3% over the f16 baseline of 6.39), while QJL-free MSE-only landed at 7.54 (+18%). On the other hand, with head_dim=128, adding QJL degraded results by +65%, confirming the initial reports.

The takeaway is head_dim-dependent: WHT+QJL for 256+, MSE-only + WHT for 128 and below. The TurboQuant paper’s adoption of WHT-based methods was the right call — the initial implementors’ use of random orthogonal matrices was the root cause of poor performance.

What This Means for This Setup

Currently running q8_0 KV at ctx=65K at 53.6 t/s with no practical issues. When TurboQuant (3-4bit KV) becomes available, KV cache memory drops to less than half, making 131K or 262K context lengths comfortable within VRAM limits.

That said, Qwen3.5-35B-A3B’s KV cache is already small (only 10 Attention layers), so TurboQuant’s benefit isn’t as dramatic as it would be for a full-Attention model. At 262K, f16 KV is only 5.1 GB, so keeping it at 2.5 GB with q8_0 is plenty for now.

Production Config

Ended up running with ctx-size 65536 and q8_0 KV quantization.

llama-server.exe \
  -m "Qwen3.5-35B-A3B-Q6_K.gguf" \
  --port 8080 \
  --ctx-size 65536 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning-budget 0 \
  --n-gpu-layers 99 \
  --no-mmap

Fits within ~29GB VRAM while maintaining 53 t/s. No more context overflow on API calls with conversation logs.