TRL v1.0 is a major release that gives LLM post-training a stable foundation
Contents
Hugging Face’s LLM post-training library TRL (Transformers Reinforcement Learning) has released v1.0, its first official version, more than six years after the first code went public. A library with 3 million monthly downloads and more than 17,800 GitHub stars is signaling that it has moved from “work in progress” to “a stable component you can build on.”
The background is that tools built on top of TRL, such as Unsloth and Axolotl, are already in real use. The project has now made the implicit “please do not break this” rule explicit by turning it into versioning policy. Semantic versioning means the first digit changes when things change in a big way, the middle digit when features are added, and the last digit when bugs are fixed. You can tell the scale of a change just by looking at the number.
The Stable/Experimental Split
The core of v1.0 is a two-tier design that splits the library’s API into “stable” and “experimental.”
from trl import SFTTrainer # stable
from trl.experimental.orpo import ORPOTrainer # experimental
Stable includes SFT, DPO, Reward Modeling, RLOO, GRPO, and their derivatives. These are covered by semantic-versioning stability guarantees, so if behavior changes significantly, the major version will move and users will know in advance.
Experimental contains more than 75 methods. KTO, ORPO, CPO, SimPO, IPO, GKD, SDFT, SDPO, GOLD, and many others live there. The list is long, but they are all variations on how to steer LLM behavior. The tradeoff is that they may change without notice, but they also track cutting-edge research immediately.
That split reflects the whole design philosophy of the library.
Intentionally Allowing Code Duplication
TRL’s design goes against what many programming books teach. The usual advice is to centralize shared logic, but TRL intentionally lets each trainer keep its own slightly similar code.
# Pattern TRL avoids
class OfflineTrainer(Trainer):
def some_common_method(self): ...
class DPOTrainer(OfflineTrainer): ...
class KTOTrainer(OfflineTrainer): ...
# Pattern TRL uses
class DPOTrainer(Trainer):
def some_common_method(self): ...
class KTOTrainer(Trainer):
def some_common_method(self): ...
Why? Because the post-training field changes too quickly. What looked shared six months ago can become a liability today. DPO and KTO may look similar on the surface, but the input data shape differs: one uses pairs of good and bad answers, while the other may use a single answer. If you over-centralize that logic, fixing one trainer can break the other.
As the article puts it, in a domain where assumptions are overturned constantly, duplication can actually be the right design choice.
My earlier comparison of 16 open-source RL libraries showed how different those designs can be. TRL is making the same kind of design choice to survive that diversity.
What Post-Training Actually Means
The article throws around terms like SFT, DPO, and GRPO, so it helps to restate what post-training is for.
LLM training is usually split into two broad stages:
graph LR
A["Pre-training"] --> B["Post-training"]
B --> C[Deployment]
Pre-training is where a model learns language patterns from huge amounts of internet text. It takes trillions of tokens, thousands of GPUs, and months of compute. Models like Meta’s LLaMA and Mistral are born here. They can predict the next token, but they still do not really know how to answer questions or follow instructions.
Post-training is where you teach that pretrained model the behavior you actually want. TRL covers this stage. The five stable families do it in different ways:
| Method | What it does | Data required |
|---|---|---|
| SFT | Show the model a correct answer and make it imitate it | Question-answer pairs |
| Reward Modeling | Train a model that scores responses for quality | Human judgments that answer A is better than B |
| DPO | Learn directly from good/bad answer pairs without a scoring model | Good and bad response pairs |
| RLOO | Generate multiple answers and learn by comparing them to one another | Questions plus a scoring rule |
| GRPO | Generate answer groups and learn from relative ranking inside the group | Questions plus a scoring rule |
The classic workflow is to start with SFT so the model learns to follow instructions, then use RLHF-style methods to tune it toward more human-preferred responses.
graph TD
A[Base model] --> B["SFT teaches instruction following"]
B --> C{"Choose a finishing method"}
C -->|Build a reward model| D[Train reward model]
D --> E["Use PPO / RLOO / GRPO for RL"]
C -->|Skip reward model| F["Use DPO directly"]
E --> G[Usable model]
F --> G
Recently, models like DeepSeek-R1 and OpenAI o1 have pushed post-training toward “make the model reason better,” not just “make the model sound nicer.” GRPO’s inclusion in the stable tier reflects that shift.
What GRPO Is
Among the stable methods, GRPO (Group Relative Policy Optimization) is the most noteworthy. DeepSeek introduced it in the DeepSeekMath paper, and it later became widely known through DeepSeek-R1.
In PPO (Proximal Policy Optimization), the model that generates answers needs a separate critic model that scores those answers. That critic is often nearly as large as the main model, so GPU memory and compute costs are effectively doubled.
GRPO removes the critic entirely. Instead, it generates multiple answers to the same prompt and scores them relative to one another. Think of it like firing the teacher who says “80 points or you fail” and replacing them with one who ranks students relative to the rest of the class.
| Item | PPO | GRPO |
|---|---|---|
| Critic model | Required, usually as large as the main model | Not required |
| Evaluation basis | The critic’s predicted score | Group average score |
| GPU memory use | Large (two models’ worth) | Smaller (one model) |
| Cost efficiency | Baseline | Up to 18x more efficient |
DeepSeek reported that GRPO matched or beat PPO-style training in math benchmarks, improving GSM8K from 82.9% to 88.2% and MATH from 46.8% to 51.7%.
graph TD
A[Input a question] --> B[Model generates<br/>G answers in a batch]
B --> C[Score each answer]
C --> D[Convert scores to<br/>relative ranking inside the group]
D --> E[Compute which answers<br/>were better]
E --> F[Update the model to<br/>reinforce good patterns]
F --> G[Control drift so the model<br/>does not move too far]
G --> A
Positioning Against Other Libraries
TRL is aiming for ease of adoption, not maximum performance. The comparison table in the article is telling:
| Library | Monthly downloads | Semver | Required scale |
|---|---|---|---|
| TRL | 3 million | Stable | Low (single GPU possible) |
| OpenRLHF | 3,600 | Partial | High |
| veRL | 102k | Partial | High |
| LLaMA-Factory | 363k | Partial | Medium |
TRL wins on distribution because it fits naturally into the Hugging Face ecosystem. It supports LoRA/QLoRA, integrates with HF Hub, and handles VLM training too.
What it does not natively do is tensor parallelism across multiple GPUs. It can use DeepSpeed/FSDP for distributed training, but at very large scale, tools like verl or OpenRLHF still have an edge.
The Asynchronous GRPO Roadmap
The top item on the roadmap is asynchronous GRPO. Right now, TRL’s GRPO flow is sequential: generate answers, compute scores, update the model. If the reasoning chain is long, the model-update GPUs sit idle and utilization can fall to around 60%.
Asynchronous GRPO runs generation and model updates on separate GPUs at the same time. The generator keeps producing answers while the trainer keeps consuming them. It is the factory conveyor-belt model: the next stage does not wait for the previous one to fully stop.
If generation gets too far ahead, the generator is throttled so the model state does not drift. The design already exists, and it matches the async split architectures used by libraries like verl and AReaL.
Other roadmap items include promoting experimental methods like KTO, SDFT, SDPO, and GOLD into stable, adding MoE parallel support, and making it easier for AI agents to monitor training runs. The interesting bit there is automatic detection of anomalies, such as low GPU utilization or a collapsing reward standard deviation.
[TRL] WARNING: VRAM utilization at 34%. Consider increasing
per_device_train_batch_size from 4 to 16.
[TRL] WARNING: Group reward std is 0.01 (near zero).
Advantage signal has collapsed.
The goal is to let humans or AI agents catch training problems early without babysitting the run.
What v1.0 Means
Only SFT, DPO, GRPO, RLOO, and Reward Modeling made it into Stable. The other 75+ methods are still Experimental. Given how quickly the field has exploded since DeepSeek-R1 made GRPO popular in 2024, v1.0 is not saying “the field is done changing.” It is saying “here is what is stable and here is what is still moving.”
What Do People Actually Train?
Now that the methods are out of the way, it helps to talk about what TRL is used for in practice.
There are two common cases.
The first is specialization. Suppose you want a chatbot for internal documents. You can start from LLaMA 3.1 8B, use SFT on thousands to tens of thousands of internal Q&A examples, and then use DPO to reflect user preferences. With QLoRA, the whole workflow can run on a gaming PC with an RTX 4090 and 24GB of VRAM. The practical strength of TRL is that you can experiment without needing university- or lab-scale hardware.
The second is reasoning improvement. For tasks like math or programming, where correctness can be checked automatically, GRPO is a good fit.
from trl import GRPOTrainer, GRPOConfig
# Reward function: +1 for correct, -1 for incorrect
def reward_fn(completions, **kwargs):
return [1.0 if verify(c) else -1.0 for c in completions]
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-1.5B",
reward_funcs=[reward_fn],
args=GRPOConfig(
num_generations=8, # generate 8 responses per prompt
output_dir="./grpo_output",
),
)
trainer.train()
The reward function automatically checks whether the answer is correct, and reinforcement learning expands the model’s ability to follow the right chain of reasoning. If correctness can be checked mechanically, the whole thing can run end-to-end without human labeling.
DeepSeek reported more than a 5-point gain on GSM8K with this approach, which shows that even a small model can be pushed into stronger reasoning with the right post-training stack.