MegaTrain Trains a 120B-Parameter LLM on a Single GPU at Full Precision
Contents
A paper that directly overturns the assumption that LLM training requires GPU clusters has been posted to arXiv.
A research team from the University of Notre Dame and Lehigh University published MegaTrain, a system that trains a 120B-parameter LLM on a single GPU at full precision (BF16/FP32 mixed, no quantization).
The code is also available on GitHub.
This blog has previously covered building a LoRA training environment and async RL training architecture patterns, both of which took GPU VRAM constraints as a given.
MegaTrain removes that assumption entirely.
If it doesn’t fit in VRAM, you quantize or fall back to LoRA. MegaTrain adds full-precision training as a third option.
Breaking Down LLM Training Memory
Why does GPU VRAM become the bottleneck?
When you decompose training memory consumption into its components, optimizer state turns out to be far larger than the parameters themselves.
For BF16 mixed-precision training with the Adam optimizer, per-parameter memory consumption breaks down as follows:
| Component | Data Type | Bytes/Parameter | Description |
|---|---|---|---|
| Parameters (weights) | BF16 | 2 | Model weights |
| Gradients | BF16 | 2 | Computed during the backward pass |
| Adam first moment (m) | FP32 | 4 | Exponential moving average of gradients |
| Adam second moment (v) | FP32 | 4 | Exponential moving average of squared gradients |
| Master weights | FP32 | 4 | FP32 copy needed for updates |
| Total | 16 |
Adam maintains a moving average of gradients (first moment) and a moving average of squared gradients (second moment) for each parameter, adaptively adjusting the learning rate.
These two FP32 tensors alone consume 8 bytes per parameter.
Add the need to write BF16-computed weights back to FP32 master weights, and the total reaches 16 bytes per parameter.
Calculated by model size (excluding activations):
| Model | Parameters | Required Memory (16B/param) | H200 VRAM (141GB) | A100 VRAM (40/80GB) |
|---|---|---|---|---|
| Qwen2.5-7B | 7.6B | ~122 GB | barely fits | OOM |
| Qwen2.5-14B | 14.7B | ~235 GB | OOM | OOM |
| Qwen2.5-32B | 32.5B | ~520 GB | OOM | OOM |
| Llama-70B | 70B | ~1,120 GB | OOM | OOM |
| GPT-OSS-120B | 120B | ~1,920 GB | OOM | OOM |
Even at 7B, adding activations pushes past the H200’s 141GB.
Activations are intermediate results from the forward pass, needed for gradient computation in the backward pass.
At 14B and above, no single GPU’s VRAM can hold it.
Meanwhile, CPU memory on a single server can reach 1.5TB to 2TB.
Even 1,920GB for 120B fits in a 2TB server.
Then there’s the activation problem.
Activations grow proportionally with batch size and sequence length, reaching tens to hundreds of gigabytes.
Activation checkpointing (recomputing intermediate results on demand instead of storing them) trades compute time for memory savings.
MegaTrain’s layer-wise execution naturally limits resident activations to a single layer.
Limitations of Existing Approaches: ZeRO to FSDP
Multiple approaches have been proposed for this memory problem.
The most prominent are Microsoft’s DeepSpeed ZeRO series and PyTorch’s FSDP (Fully Sharded Data Parallelism).
ZeRO’s Three Stages
ZeRO (Zero Redundancy Optimizer) is built on the design philosophy of “eliminating redundant copies of the same data across multiple GPUs,” divided into three stages:
| Stage | Partitioned | Redundancy Eliminated | Memory Reduction |
|---|---|---|---|
| ZeRO-1 | Optimizer state | Adam’s m, v | Up to 4x |
| ZeRO-2 | + Gradients | Gradient tensors | Up to 8x |
| ZeRO-3 | + Parameters | Weights partitioned too | Up to Nx (GPU count) |
ZeRO-1 partitions only Adam’s two moments across GPUs.
Each GPU holds all parameter weights and gradients but only its share of optimizer state.
ZeRO-2 adds gradient partitioning, and ZeRO-3 partitions the parameters themselves.
ZeRO-3 theoretically requires only 1/N memory with N GPUs, but every computation needs an all-gather (a collective communication operation that assembles distributed data across all GPUs) to collect parameters from other GPUs, increasing communication costs.
ZeRO-Offload and ZeRO-Infinity
If ZeRO-3 still can’t fit the model with available GPUs, ZeRO-Offload and ZeRO-Infinity use CPU memory and NVMe storage as overflow destinations.
graph TD
subgraph "ZeRO-Offload / ZeRO-Infinity"
GPU["GPU VRAM<br/>Primary storage"]
CPU["CPU Memory<br/>Spill target"]
NVMe["NVMe SSD<br/>Further spill target"]
GPU -->|overflow| CPU
CPU -->|overflow| NVMe
end
subgraph "MegaTrain"
CPU2["CPU Memory<br/>Primary storage<br/>All parameters resident"]
GPU2["GPU VRAM<br/>Compute device<br/>Temporary buffers only"]
CPU2 -->|"Layer-wise streaming"| GPU2
GPU2 -->|"Gradient evacuation"| CPU2
end
ZeRO-Offload runs optimizer updates on the CPU while executing forward/backward passes on the GPU.
ZeRO-Infinity adds NVMe offloading to ZeRO-3, theoretically supporting unlimited model sizes.
But both retain the “GPU memory is primary storage” design.
When the design spills data that doesn’t fit on the GPU to the CPU, data transfers through the PCIe bus become random-access-like, failing to use bandwidth efficiently.
As a result, throughput drops sharply as model size grows.
The benchmark showing MegaTrain achieving 12.2x the throughput of ZeRO-3 on a 14B model stems from this structural problem.
FSDP
PyTorch’s official FSDP (Fully Sharded Data Parallelism) implements the same philosophy as ZeRO-3 natively in PyTorch.
It shards parameters, gradients, and optimizer state across GPUs.
Like ZeRO-3, it all-gathers the needed parameters at compute time, so GPU-to-GPU communication remains the structural bottleneck.
MegaTrain’s Design: CPU Memory as Primary Storage
MegaTrain inverts this primary/secondary relationship.
All parameters and optimizer state live in CPU memory; the GPU serves only as a “transient compute device.”
The core design consists of three elements:
- Pipelined double buffering (hiding CPU-GPU data transfers behind computation)
- Stateless Layer Templates (eliminating PyTorch’s computation graph to reduce memory overhead)
- Layer-wise streaming execution (limiting resident activations to a single layer)
graph TD
subgraph "CPU Memory (1.5-2TB)"
P["All Parameters<br/>BF16"]
O["Optimizer State<br/>FP32 m, v"]
M["Master Weights<br/>FP32"]
end
subgraph "GPU VRAM (buffers only)"
B0["Buffer 0"]
B1["Buffer 1"]
CS["ComputeStream"]
end
P -->|"WeightStream<br/>Layer i+1 transfer"| B1
B0 -->|"Compute"| CS
CS -->|"GradientStream<br/>Gradient evacuation"| O
O -->|"CPU-side<br/>parameter update"| P
Virtually no data persists in GPU VRAM.
With just two buffer slots, any model size can be trained.
In the scaling experiments discussed later, GPU memory allocation was fixed at just 3.83GB while scaling from 7.6B to 43B parameters.
Pipelined Double Buffering
Won’t CPU-to-GPU data transfer become a bottleneck?
MegaTrain’s answer is pipelined double buffering using three CUDA streams.
A CUDA stream is an asynchronous command queue executed on the GPU.
Operations submitted to different streams can run concurrently.
MegaTrain exploits this to run computation, transfer, and evacuation simultaneously.
| CUDA Stream | Role | Direction |
|---|---|---|
| ComputeStream | Forward/backward computation | Within GPU |
| WeightStream | Parameter transfer | CPU → GPU |
| GradientStream | Gradient evacuation | GPU → CPU |
Two buffer slots on the GPU alternate in ping-pong fashion:
graph LR
subgraph "Time t"
A["Buffer 0<br/>Layer i computing"] --> B["Buffer 1<br/>Layer i+1 transferring"]
end
subgraph "Time t+1"
C["Buffer 0<br/>Gradient evac + i+2 transfer"] --> D["Buffer 1<br/>Layer i+1 computing"]
end
B --> D
A --> C
While ComputeStream computes layer i on buffer 0, WeightStream transfers layer i+1’s parameters to buffer 1.
When computation finishes, GradientStream evacuates buffer 0’s gradients to the CPU, and the next computation begins on buffer 1.
This cycle minimizes the time the GPU spends idle waiting for data.
Does Bandwidth Suffice? Transfer Performance by Hardware
This pipeline works on the condition that “parameter transfer time is completely hidden behind the previous layer’s compute time.”
A single Transformer layer’s compute time ranges from tens to hundreds of milliseconds. Here’s how actual hardware transfer times compare:
| Hardware | CPU-GPU Link | Bandwidth (unidirectional) | 1 Layer Params (7B, ~600MB) | Transfer Time |
|---|---|---|---|---|
| GH200 | NVLink-C2C | 450 GB/s | ~600 MB | ~1.3 ms |
| H200 | PCIe Gen5 x16 | 64 GB/s | ~600 MB | ~9.4 ms |
| A100 PCIe | PCIe Gen4 x16 | 32 GB/s | ~600 MB | ~18.8 ms |
| RTX 3090 | PCIe Gen4 x16 | 32 GB/s | ~600 MB | ~18.8 ms |
GH200’s NVLink-C2C at 450GB/s (unidirectional) is an order of magnitude faster; transfers are hidden almost instantly.
Even on PCIe Gen4 with A100 or RTX 3090, larger batch sizes push single-layer compute times to hundreds of milliseconds, making the 18.8ms transfer easily concealable.
However, with extremely small batch sizes or small models where computation is light, transfer time may exceed compute time.
MegaTrain’s ablation study shows that disabling double buffering reduces throughput by 31.3% (266.3 → 182.91 TFLOPS).
The narrower the bandwidth, the larger this gap would grow.
Why GH200’s NVLink-C2C Is Special
GH200 uses an integrated architecture that differs from standard GPU servers.
A Grace CPU (Arm) and Hopper GPU are packaged together, connected directly via NVLink-C2C (Chip-to-Chip).
Standard PCIe connections route through CPU → PCIe switch → GPU, adding latency at each hop.
NVLink-C2C minimizes the physical distance between CPU and GPU, delivering 900GB/s bidirectional bandwidth.
That’s roughly 7x PCIe Gen5’s 128GB/s bidirectional.
MegaTrain’s “CPU primary, GPU secondary” design is particularly well-suited to the GH200 architecture.
With CPU-GPU transfers never becoming a bottleneck, GPU compute capacity can be used to near-full utilization.
Stateless Layer Templates
In standard PyTorch autograd, a computation graph is built during the forward pass and metadata is retained until the backward pass.
The graph includes inter-tensor dependencies, input/output shapes for each operation, and pointers to backward functions.
As models grow, the memory consumed by this graph itself becomes non-negligible.
Difference from Standard PyTorch Models
In a standard PyTorch model, nn.Module holds weight tensors directly as nn.Parameter:
# Standard PyTorch (conceptual)
Layer1.weight → Tensor(GPU) # Permanently resident on GPU
Layer1.bias → Tensor(GPU)
Layer2.weight → Tensor(GPU)
...
Layer80.weight → Tensor(GPU) # All layers consuming GPU memory simultaneously
With 80 layers, all 80 layers’ parameters occupy GPU memory at once.
MegaTrain’s Stateless Layer Template holds the computation logic (CUDA kernels) for Attention and MLP blocks as templates, but holds no pointers to weights:
# MegaTrain Stateless Template (conceptual)
TemplateA.compute_logic → CUDA kernels # Computation logic only
TemplateA.weight_slot → None # Parameters unbound
TemplateB.compute_logic → CUDA kernels
TemplateB.weight_slot → None
# Dynamic binding at runtime
TemplateA.Bind(Buffer0) → Execute layer 1
TemplateB.Bind(Buffer1) → Execute layer 2 (already transferred in parallel)
TemplateA.Bind(Buffer0) → Execute layer 3 (buffer 0 reused)
When parameters arrive via streaming, the Bind primitive is called to dynamically map buffer views to the template’s input slots.
While template A executes layer 1, layer 2’s parameters are being bound to template B.
This design has two benefits:
- No persistent computation graph is needed, eliminating metadata overhead
- GPU-resident parameters are fixed at two buffer slots, making memory consumption independent of model size
Training MoE Models: The GPT-OSS-120B Case
GPT-OSS-120B, used in the benchmarks, is a Mixture of Experts (MoE) model with 128 experts.
In MoE, a router network selects the optimal experts for each input token, activating only a subset (typically 2-8) rather than all experts.
Even with 120B parameters, only a small fraction of experts are used per token, so effective compute is less than a dense model.
The distinctive challenge of MoE training is that the number of experts inflates total parameter count.
The majority of 120B parameters are the weights of 128 experts; shared layers (attention, router, etc.) are comparatively small.
Fitting all 128 experts’ parameters simultaneously in GPU VRAM is virtually impossible on any single current GPU.
MegaTrain’s layer-wise streaming works the same way for each MoE expert.
After the router selects experts, only the selected experts’ parameters are transferred from CPU to GPU for computation, and gradients are returned to the CPU.
Unselected experts remain dormant in CPU memory, consuming no GPU VRAM.
The same idea on the inference side was implemented in Hypura’s NVMe expert streaming.
Hypura streams expert weights from NVMe SSDs to the GPU (Metal) on demand, achieving a 99.5% LRU cache hit rate.
MegaTrain streams from CPU memory on the training side.
Both avoid loading everything into GPU VRAM, streaming only the needed parameters on demand.
Benchmark Results
Experiments used the Qwen2.5 series (7B/14B/32B/72B) and GPT-OSS-120B (MoE, 128 experts).
The evaluation dataset was MetaMathQA (~395K English math problems).
Performance on GH200
| Model | TFLOPS | Notes |
|---|---|---|
| Qwen2.5-7B | 284 | |
| Qwen2.5-14B | 264 | 1.84x ZeRO-3 Offload |
| Qwen2.5-32B | 250+ |
A scaling experiment was also conducted, increasing from 28 layers (7.6B parameters) to 180 layers (43B parameters) while keeping device memory allocation fixed at just 3.83GB.
At 56 layers, FSDP drops to 43 TFLOPS while MegaTrain sustains 264 TFLOPS — a 6.14x gap.
FSDP’s all-gather communication overhead accumulates as layer count grows, degrading performance, whereas MegaTrain’s streaming pipeline maintains constant throughput regardless of layer count.
Performance on A100 PCIe (40GB)
The A100 has narrower bandwidth via PCIe Gen4 compared to GH200 or H200, but MegaTrain’s advantage was still overwhelming:
| Model | MegaTrain | Gemini | ZeRO-3 |
|---|---|---|---|
| 7B | 128 TFLOPS | 52.8 (2.42x) | 36.0 (3.56x) |
| 14B | 122 TFLOPS | 15.0 (8.13x) | 10.0 (12.20x) |
| 32B | 114 TFLOPS | OOM | OOM |
12.2x the throughput of DeepSpeed ZeRO-3 on the 14B model. At 32B, baselines ran out of memory while MegaTrain continued training at 114 TFLOPS.
Gemini takes the approach of “overlapping CPU-side optimizer updates with GPU-CPU transfers,” similar in lineage to ZeRO-Offload.
The large gap with MegaTrain comes from Gemini only overlapping optimizer updates, while MegaTrain pipelines the forward/backward computation itself.
Consumer GPU Experiments
| GPU | Model | Batch Size | TFLOPS |
|---|---|---|---|
| RTX A6000 (48GB) | 14B | 9 | 56.82 |
| RTX 3090 (24GB) | 14B | 3 | 30.19 |
| RTX A6000 | 7B | 12 | 55.73 |
| RTX 3090 | 7B | 5 | 35.09 |
Full-precision training of a 14B model runs on an RTX 3090 with 24GB VRAM.
Previously, this size would have required compromising with LoRA or QLoRA.
LoRA is a parameter-efficient method that updates only low-rank adapter matrices; QLoRA adds 4-bit quantization of the base model for further memory savings.
Accuracy Verification
If you claim full-precision training, you need to show that precision isn’t degraded.
Results on MetaMathQA:
| Model | ZeRO-3 Offload | ZeRO-Infinity | MegaTrain |
|---|---|---|---|
| 7B | 88.93% | 88.97% | 88.99% |
| 14B | 92.41% | 92.36% | 92.52% |
MegaTrain scores slightly higher.
This confirms that parameter streaming and buffering have no effect on numerical precision.
The result is expected — MegaTrain performs no quantization or approximation; it simply copies the same BF16/FP32 data between CPU and GPU.
Ultra-Long Context Training (GH200)
An experiment extending context length from 1K to 512K on Qwen2.5-7B:
| Context | Batch Size | Step Time | TFLOPS | Device Memory |
|---|---|---|---|---|
| 1K | 158 | 27.05s | 284.7 | 74.2 GB |
| 8K | 20 | 27.3s | 283.2 | 74.5 GB |
| 64K | 2 | 55.3s | 331.3 | 77.1 GB |
| 512K | 1 | 871.4s | 407.4 | 81.9 GB |
Attention computation scales quadratically with context length, but layer-wise execution limits resident activations to a single layer.
Device memory staying at 81.9GB for 512K is a result of leveraging GH200’s unified memory architecture.
TFLOPS increasing from 284.7 at 1K to 407.4 at 512K may seem counterintuitive.
This happens because longer contexts increase the proportion of attention computation (Q×K^T matrix multiplication), raising Tensor Core utilization.
At shorter contexts, communication and buffer operations account for a relatively larger share.
Full-Precision Training vs Parameter-Efficient Methods
MegaTrain makes “full-precision training” a realistic option, but that doesn’t make LoRA or QLoRA obsolete.
The two serve different purposes.
| Aspect | Full-Precision (MegaTrain) | LoRA / QLoRA |
|---|---|---|
| Update scope | All parameters | Low-rank adapters only (0.1-1% of total) |
| Precision | Highest (no quantization) | Slight degradation from base model quantization |
| Throughput | Must transfer all parameters | Fewer update parameters, faster |
| Hardware requirements | Large CPU memory (1-2TB) | Runs with limited GPU VRAM |
| Best suited for | Pretraining, domain adaptation | Task-specific fine-tuning |
As covered in building a LoRA training environment, LoRA runs on a Mac mini M4 Pro’s 24GB unified memory with ease.
But LoRA only updates a subset of parameters, making it unsuitable for training that fundamentally reshapes the base model’s knowledge structure — domain adaptation or continued pretraining.
MegaTrain fills this gap.
It meets the demand for “full-precision training with all parameters, but without a GPU cluster” using a consumer GPU plus ample CPU memory.
The Reality of “Single GPU” and Future Directions
Hacker News commenters noted that “calling H200 + 1.5TB host memory a single GPU is a stretch.”
Fair point — a GH200 with high-capacity memory is not a common setup, and 341 tokens/second for a 14B model is slow compared to inference-specialized optimization.
But the value of this research isn’t in speed competition.
First, it exposed the structural inefficiency of existing frameworks.
ZeRO-3 Offload managing only 10 TFLOPS on a 14B model while MegaTrain achieves 122 TFLOPS on the same A100 means the 12x performance gap was a “software design problem,” not a “hardware limitation.”
It wouldn’t be surprising to see DeepSpeed or FSDP incorporate this design going forward.
Second, the shift in memory hierarchy design philosophy itself.
Hypura’s NVMe streaming incorporated SSDs into the memory hierarchy on the inference side.
Async RL training separated generation from training, pushing GPU utilization above 95%.
MegaTrain makes CPU memory primary on the training side.
The paper mentions two extension directions.
One is multi-GPU deployment, combining tensor parallelism and expert parallelism.
The other is tiered storage with SSDs added to the memory hierarchy, targeting trillion-parameter training.
If 120B works on a single GPU, reaching 1T on a 4-8 GPU setup is entirely plausible.