MegaTrain flips the GPU-centric paradigm by treating CPU memory as primary storage and the GPU as a transient compute device, enabling full-precision training of 100B+ LLMs on a single GPU with up to 12.2x throughput over DeepSpeed ZeRO-3.
How to configure VRAM/main memory split on the GMKtec EVO-X2 (Strix Halo) for local LLM inference. A 29.6GB model ran fine with just 8GB of dedicated VRAM.
An explanation of why Qwen-Image-Edit's VAE is so heavy, how HunyuanImage 2.1 chose a 32x high-compression VAE instead, and how Kohya's memory-optimization work fits in.