Tech 14 min read

MegaTrain Trains a 120B-Parameter LLM on a Single GPU at Full Precision

IkesanContents

A paper that directly overturns the assumption that LLM training requires GPU clusters has been posted to arXiv.
A research team from the University of Notre Dame and Lehigh University published MegaTrain, a system that trains a 120B-parameter LLM on a single GPU at full precision (BF16/FP32 mixed, no quantization).
The code is also available on GitHub.

This blog has previously covered building a LoRA training environment and async RL training architecture patterns, both of which took GPU VRAM constraints as a given.
MegaTrain removes that assumption entirely.
If it doesn’t fit in VRAM, you quantize or fall back to LoRA. MegaTrain adds full-precision training as a third option.

Breaking Down LLM Training Memory

Why does GPU VRAM become the bottleneck?
When you decompose training memory consumption into its components, optimizer state turns out to be far larger than the parameters themselves.

For BF16 mixed-precision training with the Adam optimizer, per-parameter memory consumption breaks down as follows:

ComponentData TypeBytes/ParameterDescription
Parameters (weights)BF162Model weights
GradientsBF162Computed during the backward pass
Adam first moment (m)FP324Exponential moving average of gradients
Adam second moment (v)FP324Exponential moving average of squared gradients
Master weightsFP324FP32 copy needed for updates
Total16

Adam maintains a moving average of gradients (first moment) and a moving average of squared gradients (second moment) for each parameter, adaptively adjusting the learning rate.
These two FP32 tensors alone consume 8 bytes per parameter.
Add the need to write BF16-computed weights back to FP32 master weights, and the total reaches 16 bytes per parameter.

Calculated by model size (excluding activations):

ModelParametersRequired Memory (16B/param)H200 VRAM (141GB)A100 VRAM (40/80GB)
Qwen2.5-7B7.6B~122 GBbarely fitsOOM
Qwen2.5-14B14.7B~235 GBOOMOOM
Qwen2.5-32B32.5B~520 GBOOMOOM
Llama-70B70B~1,120 GBOOMOOM
GPT-OSS-120B120B~1,920 GBOOMOOM

Even at 7B, adding activations pushes past the H200’s 141GB.
Activations are intermediate results from the forward pass, needed for gradient computation in the backward pass.
At 14B and above, no single GPU’s VRAM can hold it.

Meanwhile, CPU memory on a single server can reach 1.5TB to 2TB.
Even 1,920GB for 120B fits in a 2TB server.

Then there’s the activation problem.
Activations grow proportionally with batch size and sequence length, reaching tens to hundreds of gigabytes.
Activation checkpointing (recomputing intermediate results on demand instead of storing them) trades compute time for memory savings.
MegaTrain’s layer-wise execution naturally limits resident activations to a single layer.

Limitations of Existing Approaches: ZeRO to FSDP

Multiple approaches have been proposed for this memory problem.
The most prominent are Microsoft’s DeepSpeed ZeRO series and PyTorch’s FSDP (Fully Sharded Data Parallelism).

ZeRO’s Three Stages

ZeRO (Zero Redundancy Optimizer) is built on the design philosophy of “eliminating redundant copies of the same data across multiple GPUs,” divided into three stages:

StagePartitionedRedundancy EliminatedMemory Reduction
ZeRO-1Optimizer stateAdam’s m, vUp to 4x
ZeRO-2+ GradientsGradient tensorsUp to 8x
ZeRO-3+ ParametersWeights partitioned tooUp to Nx (GPU count)

ZeRO-1 partitions only Adam’s two moments across GPUs.
Each GPU holds all parameter weights and gradients but only its share of optimizer state.
ZeRO-2 adds gradient partitioning, and ZeRO-3 partitions the parameters themselves.

ZeRO-3 theoretically requires only 1/N memory with N GPUs, but every computation needs an all-gather (a collective communication operation that assembles distributed data across all GPUs) to collect parameters from other GPUs, increasing communication costs.

ZeRO-Offload and ZeRO-Infinity

If ZeRO-3 still can’t fit the model with available GPUs, ZeRO-Offload and ZeRO-Infinity use CPU memory and NVMe storage as overflow destinations.

graph TD
    subgraph "ZeRO-Offload / ZeRO-Infinity"
        GPU["GPU VRAM<br/>Primary storage"]
        CPU["CPU Memory<br/>Spill target"]
        NVMe["NVMe SSD<br/>Further spill target"]
        GPU -->|overflow| CPU
        CPU -->|overflow| NVMe
    end
    subgraph "MegaTrain"
        CPU2["CPU Memory<br/>Primary storage<br/>All parameters resident"]
        GPU2["GPU VRAM<br/>Compute device<br/>Temporary buffers only"]
        CPU2 -->|"Layer-wise streaming"| GPU2
        GPU2 -->|"Gradient evacuation"| CPU2
    end

ZeRO-Offload runs optimizer updates on the CPU while executing forward/backward passes on the GPU.
ZeRO-Infinity adds NVMe offloading to ZeRO-3, theoretically supporting unlimited model sizes.

But both retain the “GPU memory is primary storage” design.
When the design spills data that doesn’t fit on the GPU to the CPU, data transfers through the PCIe bus become random-access-like, failing to use bandwidth efficiently.
As a result, throughput drops sharply as model size grows.

The benchmark showing MegaTrain achieving 12.2x the throughput of ZeRO-3 on a 14B model stems from this structural problem.

FSDP

PyTorch’s official FSDP (Fully Sharded Data Parallelism) implements the same philosophy as ZeRO-3 natively in PyTorch.
It shards parameters, gradients, and optimizer state across GPUs.
Like ZeRO-3, it all-gathers the needed parameters at compute time, so GPU-to-GPU communication remains the structural bottleneck.

MegaTrain’s Design: CPU Memory as Primary Storage

MegaTrain inverts this primary/secondary relationship.
All parameters and optimizer state live in CPU memory; the GPU serves only as a “transient compute device.”

The core design consists of three elements:

  1. Pipelined double buffering (hiding CPU-GPU data transfers behind computation)
  2. Stateless Layer Templates (eliminating PyTorch’s computation graph to reduce memory overhead)
  3. Layer-wise streaming execution (limiting resident activations to a single layer)
graph TD
    subgraph "CPU Memory (1.5-2TB)"
        P["All Parameters<br/>BF16"]
        O["Optimizer State<br/>FP32 m, v"]
        M["Master Weights<br/>FP32"]
    end
    subgraph "GPU VRAM (buffers only)"
        B0["Buffer 0"]
        B1["Buffer 1"]
        CS["ComputeStream"]
    end
    P -->|"WeightStream<br/>Layer i+1 transfer"| B1
    B0 -->|"Compute"| CS
    CS -->|"GradientStream<br/>Gradient evacuation"| O
    O -->|"CPU-side<br/>parameter update"| P

Virtually no data persists in GPU VRAM.
With just two buffer slots, any model size can be trained.
In the scaling experiments discussed later, GPU memory allocation was fixed at just 3.83GB while scaling from 7.6B to 43B parameters.

Pipelined Double Buffering

Won’t CPU-to-GPU data transfer become a bottleneck?
MegaTrain’s answer is pipelined double buffering using three CUDA streams.

A CUDA stream is an asynchronous command queue executed on the GPU.
Operations submitted to different streams can run concurrently.
MegaTrain exploits this to run computation, transfer, and evacuation simultaneously.

CUDA StreamRoleDirection
ComputeStreamForward/backward computationWithin GPU
WeightStreamParameter transferCPU → GPU
GradientStreamGradient evacuationGPU → CPU

Two buffer slots on the GPU alternate in ping-pong fashion:

graph LR
    subgraph "Time t"
        A["Buffer 0<br/>Layer i computing"] --> B["Buffer 1<br/>Layer i+1 transferring"]
    end
    subgraph "Time t+1"
        C["Buffer 0<br/>Gradient evac + i+2 transfer"] --> D["Buffer 1<br/>Layer i+1 computing"]
    end
    B --> D
    A --> C

While ComputeStream computes layer i on buffer 0, WeightStream transfers layer i+1’s parameters to buffer 1.
When computation finishes, GradientStream evacuates buffer 0’s gradients to the CPU, and the next computation begins on buffer 1.
This cycle minimizes the time the GPU spends idle waiting for data.

Does Bandwidth Suffice? Transfer Performance by Hardware

This pipeline works on the condition that “parameter transfer time is completely hidden behind the previous layer’s compute time.”
A single Transformer layer’s compute time ranges from tens to hundreds of milliseconds. Here’s how actual hardware transfer times compare:

HardwareCPU-GPU LinkBandwidth (unidirectional)1 Layer Params (7B, ~600MB)Transfer Time
GH200NVLink-C2C450 GB/s~600 MB~1.3 ms
H200PCIe Gen5 x1664 GB/s~600 MB~9.4 ms
A100 PCIePCIe Gen4 x1632 GB/s~600 MB~18.8 ms
RTX 3090PCIe Gen4 x1632 GB/s~600 MB~18.8 ms

GH200’s NVLink-C2C at 450GB/s (unidirectional) is an order of magnitude faster; transfers are hidden almost instantly.
Even on PCIe Gen4 with A100 or RTX 3090, larger batch sizes push single-layer compute times to hundreds of milliseconds, making the 18.8ms transfer easily concealable.

However, with extremely small batch sizes or small models where computation is light, transfer time may exceed compute time.
MegaTrain’s ablation study shows that disabling double buffering reduces throughput by 31.3% (266.3 → 182.91 TFLOPS).
The narrower the bandwidth, the larger this gap would grow.

GH200 uses an integrated architecture that differs from standard GPU servers.
A Grace CPU (Arm) and Hopper GPU are packaged together, connected directly via NVLink-C2C (Chip-to-Chip).

Standard PCIe connections route through CPU → PCIe switch → GPU, adding latency at each hop.
NVLink-C2C minimizes the physical distance between CPU and GPU, delivering 900GB/s bidirectional bandwidth.
That’s roughly 7x PCIe Gen5’s 128GB/s bidirectional.

MegaTrain’s “CPU primary, GPU secondary” design is particularly well-suited to the GH200 architecture.
With CPU-GPU transfers never becoming a bottleneck, GPU compute capacity can be used to near-full utilization.

Stateless Layer Templates

In standard PyTorch autograd, a computation graph is built during the forward pass and metadata is retained until the backward pass.
The graph includes inter-tensor dependencies, input/output shapes for each operation, and pointers to backward functions.
As models grow, the memory consumed by this graph itself becomes non-negligible.

Difference from Standard PyTorch Models

In a standard PyTorch model, nn.Module holds weight tensors directly as nn.Parameter:

# Standard PyTorch (conceptual)
Layer1.weight → Tensor(GPU)   # Permanently resident on GPU
Layer1.bias   → Tensor(GPU)
Layer2.weight → Tensor(GPU)
...
Layer80.weight → Tensor(GPU)  # All layers consuming GPU memory simultaneously

With 80 layers, all 80 layers’ parameters occupy GPU memory at once.

MegaTrain’s Stateless Layer Template holds the computation logic (CUDA kernels) for Attention and MLP blocks as templates, but holds no pointers to weights:

# MegaTrain Stateless Template (conceptual)
TemplateA.compute_logic → CUDA kernels  # Computation logic only
TemplateA.weight_slot   → None          # Parameters unbound

TemplateB.compute_logic → CUDA kernels
TemplateB.weight_slot   → None

# Dynamic binding at runtime
TemplateA.Bind(Buffer0) → Execute layer 1
TemplateB.Bind(Buffer1) → Execute layer 2 (already transferred in parallel)
TemplateA.Bind(Buffer0) → Execute layer 3 (buffer 0 reused)

When parameters arrive via streaming, the Bind primitive is called to dynamically map buffer views to the template’s input slots.
While template A executes layer 1, layer 2’s parameters are being bound to template B.

This design has two benefits:

  • No persistent computation graph is needed, eliminating metadata overhead
  • GPU-resident parameters are fixed at two buffer slots, making memory consumption independent of model size

Training MoE Models: The GPT-OSS-120B Case

GPT-OSS-120B, used in the benchmarks, is a Mixture of Experts (MoE) model with 128 experts.
In MoE, a router network selects the optimal experts for each input token, activating only a subset (typically 2-8) rather than all experts.
Even with 120B parameters, only a small fraction of experts are used per token, so effective compute is less than a dense model.

The distinctive challenge of MoE training is that the number of experts inflates total parameter count.
The majority of 120B parameters are the weights of 128 experts; shared layers (attention, router, etc.) are comparatively small.
Fitting all 128 experts’ parameters simultaneously in GPU VRAM is virtually impossible on any single current GPU.

MegaTrain’s layer-wise streaming works the same way for each MoE expert.
After the router selects experts, only the selected experts’ parameters are transferred from CPU to GPU for computation, and gradients are returned to the CPU.
Unselected experts remain dormant in CPU memory, consuming no GPU VRAM.

The same idea on the inference side was implemented in Hypura’s NVMe expert streaming.
Hypura streams expert weights from NVMe SSDs to the GPU (Metal) on demand, achieving a 99.5% LRU cache hit rate.
MegaTrain streams from CPU memory on the training side.
Both avoid loading everything into GPU VRAM, streaming only the needed parameters on demand.

Benchmark Results

Experiments used the Qwen2.5 series (7B/14B/32B/72B) and GPT-OSS-120B (MoE, 128 experts).
The evaluation dataset was MetaMathQA (~395K English math problems).

Performance on GH200

ModelTFLOPSNotes
Qwen2.5-7B284
Qwen2.5-14B2641.84x ZeRO-3 Offload
Qwen2.5-32B250+

A scaling experiment was also conducted, increasing from 28 layers (7.6B parameters) to 180 layers (43B parameters) while keeping device memory allocation fixed at just 3.83GB.

At 56 layers, FSDP drops to 43 TFLOPS while MegaTrain sustains 264 TFLOPS — a 6.14x gap.
FSDP’s all-gather communication overhead accumulates as layer count grows, degrading performance, whereas MegaTrain’s streaming pipeline maintains constant throughput regardless of layer count.

Performance on A100 PCIe (40GB)

The A100 has narrower bandwidth via PCIe Gen4 compared to GH200 or H200, but MegaTrain’s advantage was still overwhelming:

ModelMegaTrainGeminiZeRO-3
7B128 TFLOPS52.8 (2.42x)36.0 (3.56x)
14B122 TFLOPS15.0 (8.13x)10.0 (12.20x)
32B114 TFLOPSOOMOOM

12.2x the throughput of DeepSpeed ZeRO-3 on the 14B model. At 32B, baselines ran out of memory while MegaTrain continued training at 114 TFLOPS.

Gemini takes the approach of “overlapping CPU-side optimizer updates with GPU-CPU transfers,” similar in lineage to ZeRO-Offload.
The large gap with MegaTrain comes from Gemini only overlapping optimizer updates, while MegaTrain pipelines the forward/backward computation itself.

Consumer GPU Experiments

GPUModelBatch SizeTFLOPS
RTX A6000 (48GB)14B956.82
RTX 3090 (24GB)14B330.19
RTX A60007B1255.73
RTX 30907B535.09

Full-precision training of a 14B model runs on an RTX 3090 with 24GB VRAM.
Previously, this size would have required compromising with LoRA or QLoRA.
LoRA is a parameter-efficient method that updates only low-rank adapter matrices; QLoRA adds 4-bit quantization of the base model for further memory savings.

Accuracy Verification

If you claim full-precision training, you need to show that precision isn’t degraded.
Results on MetaMathQA:

ModelZeRO-3 OffloadZeRO-InfinityMegaTrain
7B88.93%88.97%88.99%
14B92.41%92.36%92.52%

MegaTrain scores slightly higher.
This confirms that parameter streaming and buffering have no effect on numerical precision.
The result is expected — MegaTrain performs no quantization or approximation; it simply copies the same BF16/FP32 data between CPU and GPU.

Ultra-Long Context Training (GH200)

An experiment extending context length from 1K to 512K on Qwen2.5-7B:

ContextBatch SizeStep TimeTFLOPSDevice Memory
1K15827.05s284.774.2 GB
8K2027.3s283.274.5 GB
64K255.3s331.377.1 GB
512K1871.4s407.481.9 GB

Attention computation scales quadratically with context length, but layer-wise execution limits resident activations to a single layer.
Device memory staying at 81.9GB for 512K is a result of leveraging GH200’s unified memory architecture.

TFLOPS increasing from 284.7 at 1K to 407.4 at 512K may seem counterintuitive.
This happens because longer contexts increase the proportion of attention computation (Q×K^T matrix multiplication), raising Tensor Core utilization.
At shorter contexts, communication and buffer operations account for a relatively larger share.

Full-Precision Training vs Parameter-Efficient Methods

MegaTrain makes “full-precision training” a realistic option, but that doesn’t make LoRA or QLoRA obsolete.
The two serve different purposes.

AspectFull-Precision (MegaTrain)LoRA / QLoRA
Update scopeAll parametersLow-rank adapters only (0.1-1% of total)
PrecisionHighest (no quantization)Slight degradation from base model quantization
ThroughputMust transfer all parametersFewer update parameters, faster
Hardware requirementsLarge CPU memory (1-2TB)Runs with limited GPU VRAM
Best suited forPretraining, domain adaptationTask-specific fine-tuning

As covered in building a LoRA training environment, LoRA runs on a Mac mini M4 Pro’s 24GB unified memory with ease.
But LoRA only updates a subset of parameters, making it unsuitable for training that fundamentally reshapes the base model’s knowledge structure — domain adaptation or continued pretraining.

MegaTrain fills this gap.
It meets the demand for “full-precision training with all parameters, but without a GPU cluster” using a consumer GPU plus ample CPU memory.

The Reality of “Single GPU” and Future Directions

Hacker News commenters noted that “calling H200 + 1.5TB host memory a single GPU is a stretch.”
Fair point — a GH200 with high-capacity memory is not a common setup, and 341 tokens/second for a 14B model is slow compared to inference-specialized optimization.

But the value of this research isn’t in speed competition.

First, it exposed the structural inefficiency of existing frameworks.
ZeRO-3 Offload managing only 10 TFLOPS on a 14B model while MegaTrain achieves 122 TFLOPS on the same A100 means the 12x performance gap was a “software design problem,” not a “hardware limitation.”
It wouldn’t be surprising to see DeepSpeed or FSDP incorporate this design going forward.

Second, the shift in memory hierarchy design philosophy itself.
Hypura’s NVMe streaming incorporated SSDs into the memory hierarchy on the inference side.
Async RL training separated generation from training, pushing GPU utilization above 95%.
MegaTrain makes CPU memory primary on the training side.

The paper mentions two extension directions.
One is multi-GPU deployment, combining tensor parallelism and expert parallelism.
The other is tiered storage with SSDs added to the memory hierarchy, targeting trillion-parameter training.
If 120B works on a single GPU, reaching 1T on a 4-8 GPU setup is entirely plausible.