Design patterns for LLM asynchronous training seen in 16 open source RL libraries
Contents
A comparative analysis of 16 currently existing open source RL (reinforcement learning) training libraries was published on the HuggingFace blog. verl, PRIME-RL, AReaL, SkyRL, etc., and organizes design choices into seven axes.
Fundamental synchronous bottleneck
GRPO training using current TRL (Transformers Reinforcement Learning) works synchronously. Each step blocks the flow of sampling and generating prompts, computing rewards, updating gradients, and sending weights to the inference server.
The problem is the time it takes to generate. With the GRPO standard settings (G = 8 completed x 64 prompts = 512 rollouts), the required time for each output length is as follows.
| Output length | Total number of tokens | 7B/H100 | 32B/H100 |
|---|---|---|---|
| 2K | Approx. 1 million | 3 minutes | 14 minutes |
| 8K | Approx. 4 million | 11 minutes | 56 minutes |
| 32K | Approx. 16 million | 45 minutes | 3.7 hours |
Inference stops, the training GPU waits, and infers again. By repeating this process, the effective GPU usage rate drops to about 60%. The longer the inference trace is output in CoT, the worse it gets.
sequenceDiagram
participant P as プロンプトサンプリング
participant G as 生成 (数分〜3.7時間)
participant R as 報酬計算
participant T as 訓練ステップ
participant S as 重み同期
P->>G: 64プロンプト
note over G: GPU稼働
G->>R: 512ロールアウト
note over T: GPU待機
R->>T: 報酬付きサンプル
T->>S: 更新済み重み
note over G: GPU待機
S-->>G: 次のステップへ
An asynchronous design separates this generation and training and runs them in parallel. While the inference pool continues to generate, the training pool continues to draw from the buffer and update it. GPU utilization will be over 95%.
7 design axes
If you organize the differences between the 16 libraries, they fit into seven axes.
1. Fundamentals of orchestration
Eight of the 16 libraries use Ray. The rest is divided into asyncio+multiprocess, Redis queue-based pub/sub, and HTTP microservices.
There’s a reason why Ray is in the majority. The inference pool and training pool have a heterogeneous configuration with different GPU requirements, and Ray automates their resource management and failure recovery. Zero-copy transfers using shared memory object stores can also be effective in large-scale environments.
On the other hand, more and more libraries are choosing Native Python. PipelineRL (ServiceNow) and ART (OpenPipe) are implemented with asyncio and multi-process, and operate simply without depending on Ray. This is easier to handle in single to few node environments.
2. Rollout buffer depth
The design changes depending on how the generated samples are stored.
| Buffer type | Depth | Application example |
|---|---|---|
| No buffer (synchronous) | 0 | TRL current |
| double buffer | 1 | verifiers-rl, SLIME |
| Bounded queue | 2~K | SkyRL, verl, NeMo-RL |
| Unlimited Streams | ∞ | PipelineRL, Atropos |
The deeper the buffer, the higher the throughput, but the greater the problem of staleness of samples in the buffer.
3. Weight synchronization protocol
There are four main methods for transferring post-training weights to the inference server, and there are large differences in speed.
NCCL Broadcast (most common): Per-parameter broadcast, approximately 500ms for 32B models. Simple but slow.
Bucketed NCCL (verl): Send parameters in 1GB buckets. Approximately 20ms for the same 32B model. 25x improvement.
CUDA IPC (NeMo-RL, MILES): Inference and training on the same node is assumed. Ultra-low latency with zero copy.
LoRA adapter-only sync (AReaL, OAT, SkyRL): This is the most effective. The base weight remains unchanged to the inference server, and only the learned LoRA adapter is transferred.
Full weight sync(32Bモデル):
32B params × 2 bytes (bf16) = 64GB
NCCL転送: 約500ms
LoRA adapter-only sync(rank-32):
約50MB
NCCL転送: 約1ms
差: 500倍の削減
This nearly 1000x reduction eliminates much of the complexity surrounding weight synchronization.
4.Staleness management
There are three orthogonal strategies for dealing with samples in the buffer when they become stale.
Version rejection: Records the policy version at the time of generation in the sample, and discards it if the difference from the current version exceeds a threshold. Adopted by NeMo-RL (max_trajectory_age_steps), PipelineRL, etc.
Depth bounding: Limit the buffer size and automatically set a staleness upper limit. Adopted by SkyRL (max_staleness_steps), Atropos, etc.
Importance sampling (IS) correction: Apply weight correction to samples generated with the old policy and use them for training. Adopted by verl, MILES, ROLL, etc. PPO style clipped IS ratio is common.
The most robust configuration in a production environment is a hybrid configuration, where PRIME-RL (PrimeIntellect) has an implementation that combines version rejection + depth bounding + IS.
5. Partial rollout process
Another important design decision is how to handle rollouts that are in the process of being generated when updating weights.
- PipelineRL: Swap weights between forward passes. Generation continues uninterrupted with new weights (finest granularity)
- verl: save prefix in KV cache and resume after weight update
- SkyRL/SLIME: Abort ongoing request and resubmit from prompt
- NeMo-RL/ROLL: Block by training step (coarsest granularity)
6. LoRA support
Almost all of the 16 libraries are compatible with LoRA, but there are differences depending on whether or not they implement adapter-only sync. SLIME and TorchForge do not support LoRA themselves.
The MoE (Mixture of Experts) model of LoRA has a new problem. With Dense LoRA, there are about 20 adapters in the attention layer, but when applying LoRA to the MoE of 64 experts, more than 200 adapters are distributed with expert parallelism. Currently, only ART has a clear implementation of MoE LoRA.
7. Distributed training backend and parallelization
| Library | Backend | TP+PP+EP compatible | MoE |
|---|---|---|---|
| verl | FSDP+Megatron | ✅ | ✅ |
| NeMo-RL | FSDP2+Megatron | ✅ | ✅ |
| SkyRL | FSDP+Megatron | ✅ | ✅ |
| PRIME-RL | FSDP2 | ✅ (PP not supported) | ✅ |
| AReaL | FSDP2+Megatron | ✅ | ✅ |
| PipelineRL | DeepSpeed | ❌ | ❌ |
| open-instruct | DeepSpeed | ❌ | ❌ |
Since DeepSpeed ZeRO-3 does not support Expert Parallelism, the MoE model AllGathers all experts and loses the advantage of sparsity. If you want to train MoE models on a large scale, the Megatron path is essential.
List of key indicators for 16 libraries
ライブラリ | 組織 | Stars | オーケストレーション | LoRA | MoE
verl | ByteDance | 19.7K | Ray | ✅ | ✅
ART | OpenPipe | 9.0K | asyncio+mp | ✅ | ✅
SLIME | THUDM | 4.6K | Ray | ❌ | ✅
AReaL | inclusionAI | 4.3K | asyncio/Ray | ✅ | ✅
verifiers-rl | PrimeIntellect | 3.9K | asyncio+threading| ✅ | ❌
open-instruct | AI2 | 3.6K | Ray | ❌ | ❌
ROLL | Alibaba | 2.9K | Ray | ✅ | ✅
Tunix | Google | 2.2K | asyncio+JAX | ✅ | ❌
SkyRL | NovaSky-AI | 1.7K | Ray | ✅ | ✅
NeMo-RL | NVIDIA | 1.4K | Ray | ✅ | ✅
PRIME-RL | PrimeIntellect | 1.1K | asyncio+FS/ZMQ | ✅ | ✅
MILES | radixark | 1.0K | Ray | ✅ | ✅
Atropos | NousResearch | 0.9K | HTTP | ✅ | ❌
TorchForge | Meta | 0.6K | Monarch(PyTorch) | ❌ | ❌
OAT | SAIL-SG | 0.6K | Ray | ✅ | ❌
PipelineRL | ServiceNow | 0.4K | asyncio+Redis | ✅ | ❌
Open issues presented by DeepSeek-V3.2
There are open issues affecting all 16 libraries. The case of DeepSeek-V3.2 made this clear.
Keep Routing Problem: The calculation of expert routing is slightly different between inference (vLLM) and training (Megatron), and different experts may be selected due to floating point rounding differences. To solve this problem, it is necessary to record the routing paths during inference in a sample and forcefully reapply them during training.
Keep Sampling Mask problem: If some vocabulary is masked and generated using top-p sampling, the ratio of Importance Sampling will be undefined unless the same mask is applied during training. This also requires saving the mask during inference and reapplying it during training.
There are currently zero libraries that support either of these. Next-generation training frameworks remain a challenge to be solved.
Selection guidelines by application
If you need maximum throughput (more than 100M tokens/day), use verl (ByteDance). Fully compatible with Megatron + EP, 20ms weight synchronization using bucketed NCCL can be achieved.
If simplicity is your priority, choose PipelineRL or ART. Native Python implementation without Ray, easy to use even in a small GPU environment.
If you want to strictly manage staleness, use PRIME-RL (PrimeIntellect). It is a 3-layer hybrid of version rejection + depth bounding + IS, and also supports process rewards.
If you need MoE model training, use verl, NeMo-RL (NVIDIA official), or SLIME (THUDM is an organization close to DeepSeek).
If you want to support any inference server, use Atropos (NousResearch). You can use any vLLM/SGLang/OpenAI API compatible server with FastAPI-based HTTP microservices. The latency will be higher accordingly.