Tech 5 min read

Not All Bits Are Equal: There is no universal solution for memory allocation in reasoning models

IkesanContents

About This Paper

I came across this paper on X and thought it was interesting enough to introduce here.

Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models (arXiv:2510.10964, October 2025)

The authors are a joint research team from Krafton, UW-Madison, and UC Berkeley. The lead author, Dimitris Papailiopoulos, is an associate professor at UW-Madison and also affiliated with Microsoft Research’s AI Frontiers Lab.

What the Paper Investigates

The paper asks a simple question: when you have a fixed memory budget, how should you allocate it to maximize the accuracy of a reasoning model?

There are three main places to spend that budget:

  • Model weights: parameter count and quantization bit width
  • KV cache: the memory used to store intermediate states during inference
  • Test-time compute: the number of inference tokens, or how long you let the model think

For example, with the same memory budget, is it smarter to run a 32B model quantized to 4 bits and give it 14k tokens, or to run an 8B model at 16-bit and give it 30k tokens? For non-reasoning models, “4-bit quantization is the universal answer” has often been a reasonable rule of thumb, but reasoning models are not that simple.

Experimental Setup

The authors ran 1,700 combinations of variables on the Qwen3 family (0.6B to 32B):

  • Model size: 0.6B / 1.7B / 4B / 8B / 14B / 32B
  • Weight quantization: 4-bit / 8-bit / 16-bit (GPTQ)
  • Inference token budget: 2k to 30k (budget forcing)
  • Parallel scaling: majority voting (Maj@K, K = 1 to 16)
  • KV cache compression: eviction (R-KV, StreamingLLM) / quantization (HQQ 2/4/8-bit)

The benchmarks were AIME25 for mathematical reasoning and GPQA-Diamond for knowledge-intensive reasoning.

Key Findings

The 8-bit 4B model becomes the boundary

The paper’s most important finding is that the optimal strategy flips around an effective size of 8-bit 4B (about 4.2 GB).

  • Below 8-bit 4B: memory should be spent on weight precision and size. Letting the model think longer gives little return.
  • At or above 8-bit 4B: memory should be shifted to the inference token budget. It is better to let the model think longer, even if that means reducing weight precision.

This threshold is not arbitrary. It lines up with the point where weight memory usage exceeds the KV cache.

The optimal choice changes with the task

For mathematical reasoning (AIME25), 4-bit quantization is almost always a bad choice. Running an 8B model at 16-bit outperforms running a 14B model at 4-bit. It looks as if numerical precision in the weights is directly tied to reasoning ability, and quantization itself reduces the model’s ability to take advantage of test-time compute.

For knowledge-intensive reasoning (GPQA-Diamond), 4-bit quantization is widely effective. In this setting, parameter count matters more than precision. A model that can store more knowledge in its parameters matters more than a longer reasoning budget.

In other words, the best quantization strategy changes depending on what you are asking the model to solve.

Majority voting only helps large models

Majority voting (Maj@K) consumes K times as much KV cache. The experiments show:

  • At or above 8-bit 4B: majority voting works efficiently from a memory perspective. The best K increases as the memory budget increases.
  • Below 8-bit 4B: it is better to let the model think sequentially for longer. Majority voting wastes memory.

KV cache compression improves the Pareto frontier

Not only weight quantization, but also KV cache compression is effective across all model sizes. The best method depends on model scale:

  • Smaller models (below 8-bit 8B): KV cache eviction is better. R-KV reduces memory with almost no loss.
  • Larger models (8-bit 8B and above): eviction and quantization are about equally effective.
  • 2-bit quantization needs caution: regardless of model size, the accuracy drop is large.

Latency is dominated by generation length

End-to-end latency is almost proportional to the number of generated tokens. For example:

  • 14B model at 4-bit: 130.1 seconds to generate 10k tokens
  • 14B model at 16-bit: 137.7 seconds to generate 6k tokens

If latency matters, 8-bit often sits in the best balance between speed and accuracy. 4-bit does not even show up on the latency Pareto frontier.

Batch inference changes the strategy

When the batch size is 16 and weights are shared:

  • The 0.6B model disappears entirely from the Pareto frontier
  • The 4B 8-bit model always remains on the Pareto frontier in the 1 to 2 GB per generation range, which is a good configuration for mobile devices
  • The optimal model size shifts overall toward larger models

Practical Decision Flow

Here is a condensed version of the paper’s decision rule.

If the effective size is below 8-bit 4B:

  • Spend memory on weight precision and size
  • Choose 8-bit or higher for math-heavy tasks
  • Compress the KV cache with eviction
  • Do not use majority voting; let the model think sequentially for longer

If the effective size is 8-bit 4B or above:

  • Increase the inference token budget until it saturates
  • Use majority voting and increase K as budget allows
  • Use either eviction or quantization for the KV cache, whichever you prefer

Common cautions:

  • Avoid 4-bit quantization for mathematical reasoning
  • Prioritize parameter count for knowledge-heavy tasks
  • If latency matters, 8-bit is often the sweet spot

Takeaways

I had been thinking in the lazy way that “4-bit quantization is good enough,” so the claim that this does not hold for reasoning models was striking. In particular, the idea that 4-bit quantization can degrade the model’s reasoning ability itself is obvious in hindsight, but much more convincing when shown quantitatively.

That said, these results are limited to the Qwen3 family and to the two benchmarks AIME25 and GPQA-D. The threshold will probably shift if the architecture or task changes, so it should not be applied blindly to other models. Even so, the broad conclusion that memory allocation optimization is scale-dependent is valuable when you are running a local LLM.