Power Sampling: unlocking LLM reasoning without reinforcement learning

Since OpenAI’s o1 and DeepSeek-R1 arrived, the industry default has been to boost LLM reasoning with reinforcement learning (RL). The usual recipe is to take a pretrained model and apply GRPO or RLHF so that it learns chain-of-thought style reasoning.

But a recent Medium post by AI researcher Haitham Bou Ammar, published at the end of January 2026, pushes back on that idea.

The claim is simple: RL may not be teaching LLMs a brand-new reasoning ability. It may just be making an existing ability easier to surface. If that is true, then maybe you can get similar gains without expensive RL, simply by changing the sampling strategy at inference time.

what RL is actually doing

Base models are trained on massive amounts of web text, and that corpus already contains logical reasoning. In other words, the “correct reasoning path” is already latent inside the model.

The problem is that the probability of the correct path and the probability of a safe but shallow answer are often close, so normal sampling tends to pick the shallow one. The web has far more ordinary conversation than deep reasoning, so the model often falls back to ordinary language.

RL, in this framing, is not creating new circuits. It is sharpening the existing probability distribution and making the correct reasoning path more likely.

how Power Sampling works

The idea behind Power Sampling is that the ideal distribution reached by RL can be approximated by taking the base model’s probability distribution and raising it to a power.

Normally, you sample the next token directly from the model’s output distribution. Power Sampling artificially widens the gap between likely and unlikely tokens before sampling.

	correct path	safe path	other
normal sampling	30%	25%	45%
after Power Sampling	80%	15%	~0%

It amplifies the small confidence gap the model already had and turns it into a much larger one. A good mental model is contrast enhancement in image processing.

This is not just “pick the most frequent token.” It is amplifying the weak signal the model already has for the correct answer. Unlike temperature tuning, it reshapes the whole distribution to look more like the RL target.

the results

Bou Ammar and colleagues backed this up with experiments.

Compared methods:

base model + Power Sampling, with no extra training
RL-trained model, post-trained with GRPO

Benchmarks:

MATH500
GPQA

Result: simply applying Power Sampling to the base model matched the RL-trained model, and in some cases beat it.

If RL were really teaching entirely new reasoning circuits from scratch, that result would be hard to explain. The fact that a non-trained model can catch up through sampling alone supports the idea that the correct reasoning path was already there all along.

as an amplifier for intuition

There is a nice cognitive-science analogy here. In the system 1 / system 2 framework, system 1 is fast intuition and system 2 is slow reasoning.

For LLMs, the base model already has a weak signal pointing to the right answer. That is the model’s “intuition.” The problem is that noise from the web overwhelms it.

Power Sampling works by stripping away some of that noise and exposing the model’s intuition more cleanly. The result is not “it became smarter.” It is closer to “it stopped hesitating.”

practical impact

This is not just an academic curiosity.

Lower RL training cost. If reasoning ability can be recovered by changing inference-time sampling, there may be less need for weeks of extra GPU-based RL training.

Better fit for local LLMs. Because the approach changes the inference algorithm rather than the model itself, it can also be applied to local models such as Llama and Qwen.

A shift toward inference-time compute. The trend in AI is moving from training to inference. Instead of making the model smarter ahead of time, you spend compute when the answer is generated.

what remains unclear

There are still open questions. The benchmark results are encouraging, but we do not yet know whether the method generalizes to harder multi-step tasks such as long code generation or messy real-world problems.

The runtime cost of Power Sampling itself also needs to be compared against RL, and its latency impact in production needs more study.

Even so, the bigger message matters: RL may be less about creating new reasoning and more about surfacing what the model already knows. That shifts attention from “how should we train?” to “how should we infer?”