Scaling Qwen3.5-35B-A3B from 4K to 65K Context with Only 800MB Extra VRAM
In the previous article, I got Qwen3.5-35B-A3B running on llama-server + Vulkan. It hit 53 t/s with Q6_K, which was perfectly usable, but since I was just testing things out, I left ctx-size at the default 4096.
I’ve been running this model as an API, accessed from a VPS over Tailscale. Full conversation logs get sent with each request, so at ctx-size 4096, the context overflows mid-conversation and responses start going off the rails. Normally, bumping up the context length means KV cache eats into VRAM and speed tanks — that’s how it works with standard Transformer models. But Qwen3.5-35B-A3B doesn’t play by those rules.
Setup
| Component | Spec |
|---|---|
| PC | GMKtec EVO-X2 |
| CPU | AMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T) |
| GPU | Radeon 8060S (RDNA 4 / gfx1151) |
| Memory | 64GB unified memory (BIOS: 32GB VRAM / 32GB system) |
| llama.cpp | b8183 Vulkan + --no-mmap |
| Model | Qwen3.5-35B-A3B Q6_K (26.55 GiB) |
--no-mmap is required to avoid the mmap double-mapping issue on unified memory APUs, as verified in the previous article.
Why Qwen3.5-35B-A3B’s Architecture is KV Cache Friendly
Standard Transformer models have Attention in every layer, and each layer consumes KV cache. KV cache grows proportionally with context length, eating up VRAM.
Qwen3.5-35B-A3B is different. It’s an SSM (State Space Model, Mamba-family) + Attention hybrid, where 30 of 40 layers are SSM and only 10 layers are full Attention. With full_attention_interval = 4, Attention kicks in once every 4 layers.
graph TD
A["Layer 1-3: SSM"] --> B["Layer 4: Attention"]
B --> C["Layer 5-7: SSM"]
C --> D["Layer 8: Attention"]
D --> E["..."]
E --> F["Layer 37-39: SSM"]
F --> G["Layer 40: Attention"]
style B fill:#f96,stroke:#333
style D fill:#f96,stroke:#333
style G fill:#f96,stroke:#333
Only Attention Layers Consume KV Cache
SSM layers maintain a recurrent state, but it’s a fixed-size buffer (251 MiB) that doesn’t depend on context length. Whether the context is 4K or 256K tokens, SSM layer memory consumption stays the same.
Only the 10 Attention layers consume KV cache. Compared to a standard 40-layer all-Attention model, KV cache memory consumption is roughly 1/4.
| Buffer | Size | ctx-size dependent |
|---|---|---|
| KV cache (10 Attention layers) | Proportional to ctx-size | Yes |
| RS buffer (recurrent state for 30 SSM layers) | 251 MiB fixed | No |
What’s the RS Buffer?
The RS buffer (Recurrent State Buffer) is the memory region that holds the internal state of SSM layers. While Attention layers keep the Key/Value of all input tokens, SSM layers compress past information into a fixed-size state vector.
SSM processes input tokens one by one, updating the state vector as it goes. The state vector size is determined at model design time (for Qwen3.5-35B-A3B: d_state=128, d_inner=4096). No matter how long the context gets, the state vector size stays the same — old information gets overwritten by new information. This is fundamentally different from KV cache, and one of the reasons SSM has an advantage with long contexts.
RS buffer size depends on n_parallel (number of concurrent processing slots). More slots means more state vectors, but increasing ctx-size doesn’t change anything.
Benchmarks
Measured with 4 parallel slots at different ctx-sizes.
| ctx-size | KV Cache | Total VRAM | Generation Speed |
|---|---|---|---|
| 4,096 | 80 MiB | 27.6 GB | 53.7 t/s |
| 32,768 | 640 MiB | 28.2 GB | 49.9 t/s |
| 65,536 (q8_0 KV) | ~870 MiB (estimated) | ~28.9 GB | 53.6 t/s |
Increasing ctx-size from 4K to 32K (8x) only added 560 MiB of KV cache. Speed drop was minimal at 53.7 to 49.9 t/s. With a standard Transformer model, scaling like this would pile on several GB of KV cache and tank throughput.
At 65K with q8_0 KV quantization, KV cache stayed around 870 MiB while maintaining 53.6 t/s — essentially identical to the 4K baseline.
Memory Budget Estimates
Actual measurements with 32GB/32GB BIOS split.
| Category | Status |
|---|---|
| System RAM | ~13GB in use (Parsec + Tailscale + OS) -> 19GB remaining |
| VRAM | Model + ctx + compute at ~29GB / 32GB -> ~3GB remaining |
The Vulkan0 driver reports free: 46,522 MiB, but this reflects the total physical free memory across unified memory. It doesn’t account for the logical VRAM/system boundary set in BIOS. The actual available headroom is much less, as shown above.
KV Cache Size Estimates
KV cache size estimates at f16 (default) with 4 parallel slots.
| ctx-size | KV Cache (f16) | KV Cache (q8_0) |
|---|---|---|
| 32,768 | 640 MiB | ~320 MiB |
| 65,536 | 1,280 MiB | ~640 MiB |
| 131,072 | 2,560 MiB | ~1,280 MiB |
| 262,144 | 5,120 MiB | ~2,560 MiB |
The model’s n_ctx_train = 262,144, so it natively supports up to 256K tokens. Even at 131K, KV cache is only 2.5 GB (f16). VRAM-wise, even 262K is feasible.
This is where the SSM+Attention hybrid with only 10 Attention layers really pays off. A standard 40-layer full-Attention model would need 20+ GB of KV cache at 262K.
KV Cache Quantization
llama-server has options to quantize the Key and Value caches separately.
--cache-type-k q8_0 # Quantize Key to q8_0
--cache-type-v q8_0 # Quantize Value to q8_0
q8_0 is half the size of fp16. Results with q8_0 KV at ctx=65536:
| Metric | f16 KV (ctx=32K) | q8_0 KV (ctx=65K) |
|---|---|---|
| KV Cache | 640 MiB | ~870 MiB |
| Generation Speed | 49.9 t/s | 53.6 t/s |
Doubled the context length, but KV cache only grew by ~230 MiB. Speed actually went up. No perceptible quality degradation from q8_0 quantization.
The quality impact of KV cache quantization depends on the model architecture and task. Tasks that are sensitive to KV cache precision — like long document summarization or mathematical reasoning — tend to be more affected. For conversational use, q8_0 is fine in practice.
TurboQuant
As covered in the TurboQuant article, TurboQuant (Google Research, ICLR 2026) compresses KV cache down to 3-4 bits — less than half the size of q8_0.
Implementation Status by Backend
| Backend | Status |
|---|---|
| Metal | Working (tq3/tq4) |
| CUDA | Working (98.8% of q8_0 speed) |
| CPU | Working (zero speed penalty) |
| Vulkan | Early prototype stage |
| upstream llama.cpp | Not merged |
TurboQuant isn’t usable yet on the Radeon 8060S (Vulkan). The Vulkan implementation is in its early stages and hasn’t been merged into llama.cpp mainline. Development is ongoing in llama.cpp Discussion #20969.
QJL vs WHT Verification (Discussion #20969)
In the same Discussion #20969, three independent implementations reproduced and verified TurboQuant’s algorithm. scos-lab (Python, 8 models tested), Arclabs001 (YATQ, PyTorch), and TheTom (turboquant_plus, Metal) all reported that “QJL (Quantized Johnson-Lindenstrauss) is counterproductive in practice, and MSE-only + WHT (Walsh-Hadamard Transform) works best.” In scos-lab’s GPT-2 (head_dim=64) testing with 3-bit quantization, MSE-only showed a perplexity degradation of +7.6%, while the paper’s QJL approach spiked to +300%. QJL removes bias but amplifies variance, and softmax then magnifies that variance.
However, this conclusion was revised on March 30. Arclabs001 and AmesianX showed that replacing the initial implementors’ random orthogonal matrices (QR decomposition) with WHT (Walsh-Hadamard Transform) and combining QJL’s SRHT with independent sign patterns reversed the results. On Qwen3.5 (head_dim=256), WHT+QJL achieved a PPL of 6.54 (just +2.3% over the f16 baseline of 6.39), while QJL-free MSE-only landed at 7.54 (+18%). On the other hand, with head_dim=128, adding QJL degraded results by +65%, confirming the initial reports.
The takeaway is head_dim-dependent: WHT+QJL for 256+, MSE-only + WHT for 128 and below. The TurboQuant paper’s adoption of WHT-based methods was the right call — the initial implementors’ use of random orthogonal matrices was the root cause of poor performance.
What This Means for This Setup
Currently running q8_0 KV at ctx=65K at 53.6 t/s with no practical issues. When TurboQuant (3-4bit KV) becomes available, KV cache memory drops to less than half, making 131K or 262K context lengths comfortable within VRAM limits.
That said, Qwen3.5-35B-A3B’s KV cache is already small (only 10 Attention layers), so TurboQuant’s benefit isn’t as dramatic as it would be for a full-Attention model. At 262K, f16 KV is only 5.1 GB, so keeping it at 2.5 GB with q8_0 is plenty for now.
Production Config
Ended up running with ctx-size 65536 and q8_0 KV quantization.
llama-server.exe \
-m "Qwen3.5-35B-A3B-Q6_K.gguf" \
--port 8080 \
--ctx-size 65536 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--reasoning-budget 0 \
--n-gpu-layers 99 \
--no-mmap
Fits within ~29GB VRAM while maintaining 53 t/s. No more context overflow on API calls with conversation logs.
Related Articles
- Qwen 3.5 Failed on Radeon 8060S Because of AMD Drivers
- Radeon 8060S Vulkan Broke After AMD Driver Update
- Optimizing VRAM and Memory Allocation on Strix Halo
- Exposing a Local LLM as an External API via VPN
- Hypura NVMe Streaming and TurboQuant KV Cache Quantization
- Running a 397B Parameter Model on a 48GB MacBook with Flash-MoE