Tech 7 min read

Design patterns for LLM asynchronous training seen in 16 open source RL libraries

IkesanContents

A comparative analysis of 16 currently existing open source RL (reinforcement learning) training libraries was published on the HuggingFace blog. verl, PRIME-RL, AReaL, SkyRL, etc., and organizes design choices into seven axes.

Fundamental synchronous bottleneck

GRPO training using current TRL (Transformers Reinforcement Learning) works synchronously. Each step blocks the flow of sampling and generating prompts, computing rewards, updating gradients, and sending weights to the inference server.

The problem is the time it takes to generate. With the GRPO standard settings (G = 8 completed x 64 prompts = 512 rollouts), the required time for each output length is as follows.

Output lengthTotal number of tokens7B/H10032B/H100
2KApprox. 1 million3 minutes14 minutes
8KApprox. 4 million11 minutes56 minutes
32KApprox. 16 million45 minutes3.7 hours

Inference stops, the training GPU waits, and infers again. By repeating this process, the effective GPU usage rate drops to about 60%. The longer the inference trace is output in CoT, the worse it gets.

sequenceDiagram
    participant P as プロンプトサンプリング
    participant G as 生成 (数分〜3.7時間)
    participant R as 報酬計算
    participant T as 訓練ステップ
    participant S as 重み同期

    P->>G: 64プロンプト
    note over G: GPU稼働
    G->>R: 512ロールアウト
    note over T: GPU待機
    R->>T: 報酬付きサンプル
    T->>S: 更新済み重み
    note over G: GPU待機
    S-->>G: 次のステップへ

An asynchronous design separates this generation and training and runs them in parallel. While the inference pool continues to generate, the training pool continues to draw from the buffer and update it. GPU utilization will be over 95%.

7 design axes

If you organize the differences between the 16 libraries, they fit into seven axes.

1. Fundamentals of orchestration

Eight of the 16 libraries use Ray. The rest is divided into asyncio+multiprocess, Redis queue-based pub/sub, and HTTP microservices.

There’s a reason why Ray is in the majority. The inference pool and training pool have a heterogeneous configuration with different GPU requirements, and Ray automates their resource management and failure recovery. Zero-copy transfers using shared memory object stores can also be effective in large-scale environments.

On the other hand, more and more libraries are choosing Native Python. PipelineRL (ServiceNow) and ART (OpenPipe) are implemented with asyncio and multi-process, and operate simply without depending on Ray. This is easier to handle in single to few node environments.

2. Rollout buffer depth

The design changes depending on how the generated samples are stored.

Buffer typeDepthApplication example
No buffer (synchronous)0TRL current
double buffer1verifiers-rl, SLIME
Bounded queue2~KSkyRL, verl, NeMo-RL
Unlimited StreamsPipelineRL, Atropos

The deeper the buffer, the higher the throughput, but the greater the problem of staleness of samples in the buffer.

3. Weight synchronization protocol

There are four main methods for transferring post-training weights to the inference server, and there are large differences in speed.

NCCL Broadcast (most common): Per-parameter broadcast, approximately 500ms for 32B models. Simple but slow.

Bucketed NCCL (verl): Send parameters in 1GB buckets. Approximately 20ms for the same 32B model. 25x improvement.

CUDA IPC (NeMo-RL, MILES): Inference and training on the same node is assumed. Ultra-low latency with zero copy.

LoRA adapter-only sync (AReaL, OAT, SkyRL): This is the most effective. The base weight remains unchanged to the inference server, and only the learned LoRA adapter is transferred.

Full weight sync(32Bモデル):
  32B params × 2 bytes (bf16) = 64GB
  NCCL転送: 約500ms

LoRA adapter-only sync(rank-32):
  約50MB
  NCCL転送: 約1ms

差: 500倍の削減

This nearly 1000x reduction eliminates much of the complexity surrounding weight synchronization.

4.Staleness management

There are three orthogonal strategies for dealing with samples in the buffer when they become stale.

Version rejection: Records the policy version at the time of generation in the sample, and discards it if the difference from the current version exceeds a threshold. Adopted by NeMo-RL (max_trajectory_age_steps), PipelineRL, etc.

Depth bounding: Limit the buffer size and automatically set a staleness upper limit. Adopted by SkyRL (max_staleness_steps), Atropos, etc.

Importance sampling (IS) correction: Apply weight correction to samples generated with the old policy and use them for training. Adopted by verl, MILES, ROLL, etc. PPO style clipped IS ratio is common.

The most robust configuration in a production environment is a hybrid configuration, where PRIME-RL (PrimeIntellect) has an implementation that combines version rejection + depth bounding + IS.

5. Partial rollout process

Another important design decision is how to handle rollouts that are in the process of being generated when updating weights.

  • PipelineRL: Swap weights between forward passes. Generation continues uninterrupted with new weights (finest granularity)
  • verl: save prefix in KV cache and resume after weight update
  • SkyRL/SLIME: Abort ongoing request and resubmit from prompt
  • NeMo-RL/ROLL: Block by training step (coarsest granularity)

6. LoRA support

Almost all of the 16 libraries are compatible with LoRA, but there are differences depending on whether or not they implement adapter-only sync. SLIME and TorchForge do not support LoRA themselves.

The MoE (Mixture of Experts) model of LoRA has a new problem. With Dense LoRA, there are about 20 adapters in the attention layer, but when applying LoRA to the MoE of 64 experts, more than 200 adapters are distributed with expert parallelism. Currently, only ART has a clear implementation of MoE LoRA.

7. Distributed training backend and parallelization

LibraryBackendTP+PP+EP compatibleMoE
verlFSDP+Megatron
NeMo-RLFSDP2+Megatron
SkyRLFSDP+Megatron
PRIME-RLFSDP2✅ (PP not supported)
AReaLFSDP2+Megatron
PipelineRLDeepSpeed ​​
open-instructDeepSpeed ​​

Since DeepSpeed ​​ZeRO-3 does not support Expert Parallelism, the MoE model AllGathers all experts and loses the advantage of sparsity. If you want to train MoE models on a large scale, the Megatron path is essential.

List of key indicators for 16 libraries

ライブラリ      | 組織            | Stars | オーケストレーション | LoRA | MoE
verl           | ByteDance        | 19.7K | Ray              | ✅   | ✅
ART            | OpenPipe         |  9.0K | asyncio+mp       | ✅   | ✅
SLIME          | THUDM            |  4.6K | Ray              | ❌   | ✅
AReaL          | inclusionAI      |  4.3K | asyncio/Ray      | ✅   | ✅
verifiers-rl   | PrimeIntellect   |  3.9K | asyncio+threading| ✅   | ❌
open-instruct  | AI2              |  3.6K | Ray              | ❌   | ❌
ROLL           | Alibaba          |  2.9K | Ray              | ✅   | ✅
Tunix          | Google           |  2.2K | asyncio+JAX      | ✅   | ❌
SkyRL          | NovaSky-AI       |  1.7K | Ray              | ✅   | ✅
NeMo-RL        | NVIDIA           |  1.4K | Ray              | ✅   | ✅
PRIME-RL       | PrimeIntellect   |  1.1K | asyncio+FS/ZMQ   | ✅   | ✅
MILES          | radixark         |  1.0K | Ray              | ✅   | ✅
Atropos        | NousResearch     |  0.9K | HTTP             | ✅   | ❌
TorchForge     | Meta             |  0.6K | Monarch(PyTorch) | ❌   | ❌
OAT            | SAIL-SG          |  0.6K | Ray              | ✅   | ❌
PipelineRL     | ServiceNow       |  0.4K | asyncio+Redis    | ✅   | ❌

Open issues presented by DeepSeek-V3.2

There are open issues affecting all 16 libraries. The case of DeepSeek-V3.2 made this clear.

Keep Routing Problem: The calculation of expert routing is slightly different between inference (vLLM) and training (Megatron), and different experts may be selected due to floating point rounding differences. To solve this problem, it is necessary to record the routing paths during inference in a sample and forcefully reapply them during training.

Keep Sampling Mask problem: If some vocabulary is masked and generated using top-p sampling, the ratio of Importance Sampling will be undefined unless the same mask is applied during training. This also requires saving the mask during inference and reapplying it during training.

There are currently zero libraries that support either of these. Next-generation training frameworks remain a challenge to be solved.

Selection guidelines by application

If you need maximum throughput (more than 100M tokens/day), use verl (ByteDance). Fully compatible with Megatron + EP, 20ms weight synchronization using bucketed NCCL can be achieved.

If simplicity is your priority, choose PipelineRL or ART. Native Python implementation without Ray, easy to use even in a small GPU environment.

If you want to strictly manage staleness, use PRIME-RL (PrimeIntellect). It is a 3-layer hybrid of version rejection + depth bounding + IS, and also supports process rewards.

If you need MoE model training, use verl, NeMo-RL (NVIDIA official), or SLIME (THUDM is an organization close to DeepSeek).

If you want to support any inference server, use Atropos (NousResearch). You can use any vLLM/SGLang/OpenAI API compatible server with FastAPI-based HTTP microservices. The latency will be higher accordingly.