#GPU

8 articles

Tech Apr 16, 2026 14 min

How Far Has AMD ROCm Come in Catching Up to CUDA?

Based on EE Times' interview with AMD AI Software VP Anush Elangovan, we assess the ROCm vs CUDA ecosystem gap. Includes hands-on experience with ROCm breaking four times on Strix Halo, plus practical guidance on choosing between NVIDIA, AMD, and Apple Silicon.

AMD NVIDIA ROCm CUDA GPU AI Infrastructure PyTorch MLX Apple Silicon

Tech Apr 9, 2026 15 min

MegaTrain Trains a 120B-Parameter LLM on a Single GPU at Full Precision

MegaTrain flips the GPU-centric paradigm by treating CPU memory as primary storage and the GPU as a transient compute device, enabling full-precision training of 100B+ LLMs on a single GPU with up to 12.2x throughput over DeepSpeed ZeRO-3.

LLM Machine Learning GPU DeepSpeed Memory Optimization

Tech Apr 3, 2026 8 min

Running Lemonade on Strix Halo (EVO-X2): Vulkan Shared Memory Leaks and ROCm Stability

Real-world testing of AMD Lemonade v10.0.1 on Ryzen AI Max+ 395. LLM, image generation, speech recognition, and TTS running simultaneously, NPU Hybrid execution, Vulkan vs ROCm benchmarks, and discovering shared memory leaks.

AMD Local LLM Vulkan ROCm NPU llama.cpp GPU Inference Optimization Benchmark Experiment

Tech Apr 3, 2026 8 min

AMD's Lemonade Local AI Server Bundles GPU, NPU, and Multi-Modal Inference Under One Roof

Lemonade is AMD's open-source local AI server that manages multiple backends like llama.cpp and FastFlowLM across GPU/NPU/CPU, serving text, image, and audio generation through an OpenAI-compatible API.

AMD Local LLM NPU GPU llama.cpp Inference Optimization ROCm Vulkan

Tech Mar 28, 2026 updated 14 min

Radeon 8060S (gfx1151) Vulkan Broke Again After AMD Driver Update

After updating to AMD Software 26.3.1 on a GMKtec EVO-X2 (Ryzen AI Max+ 395), Vulkan backend fails to allocate device memory properly and falls back to CPU. Investigation and workaround by changing BIOS VRAM allocation from 48GB/16GB to 32GB/32GB.

AMD Vulkan GPU llama.cpp LLM Experiment

Tech Mar 18, 2026 updated 8 min

ComfyUI on Blackwell GPUs (RTX 5090 / RTX PRO 6000): why sm_120 fails and the PyTorch Nightly fix that works

Why ComfyUI breaks on NVIDIA Blackwell (sm_120) GPUs with 'no kernel image is available for execution' errors, and a working setup using PyTorch Nightly, xformers removal, SageAttention, and NVFP4 quantization. Tested on RTX PRO 6000 Blackwell.

ComfyUI NVIDIA GPU Blackwell Image Generation

Tech Feb 26, 2026 updated 5 min

ComfyUI + WAI-Illustrious on RTX 4060 Laptop (8GB VRAM): 1024x1024 in 15s, no --lowvram, LoRA still fits

Setup notes for running WAI-Illustrious SDXL v16 on ComfyUI with an 8GB RTX 4060 Laptop. 1024x1024 generates in ~15 seconds without --lowvram, and a LoRA still loads. CUDA 12.8 portable build and path gotchas included.

ComfyUI Stable Diffusion Image Generation GPU Benchmark

Tech Feb 18, 2026 2 min

Rust async/await Runs on GPUs as VectorWare Demonstrates the First Implementation

VectorWare has announced the first implementation of Rust's Future trait and async/await running on GPUs by adapting the Embassy executor to a GPU environment.

Rust GPU async Programming