#AMD

11 articles

TechJun 26, 202613 min

Wan 14B & FastWan on Radeon 8060S: ZLUDA fails, TheRock gfx1151 wheel works

Tested local Wan video gen on a Radeon 8060S (Strix Halo, 48GB UMA, Windows). ZLUDA can't run stock PyTorch; AMD's TheRock gfx1151 wheel gives native ROCm. FastWan 1.3B in 4min, Wan 14B I2V in 13.6min — VAE decode and 16GB-RAM Segfaults are the real limits.

AI Video Generation Wan ROCm AMD PyTorch Experiment

TechApr 16, 202614 min

How Far Has AMD ROCm Come in Catching Up to CUDA?

Based on EE Times' interview with AMD AI Software VP Anush Elangovan, we assess the ROCm vs CUDA ecosystem gap. Includes hands-on experience with ROCm breaking four times on Strix Halo, plus practical guidance on choosing between NVIDIA, AMD, and Apple Silicon.

AMD NVIDIA ROCm CUDA GPU AI Infrastructure PyTorch MLX Apple Silicon

TechApr 6, 202611 min

LLM-jp-4-32B-A3B on ROCm + Strix Halo: 41% Faster Than Qwen3.5

Benchmarking NII's LLM-jp-4-32B-A3B-thinking on EVO-X2 (Ryzen AI Max+ 395) with ROCm. 62.9 t/s vs Qwen3.5-35B-A3B's 44.7 t/s. Covers thinking control issues, KV cache trade-offs, knowledge cutoff, Japanese quality comparisons, code generation tests, and training data composition.

AI LLM Local LLM llama.cpp AMD ROCm MoE Qwen Experiment

TechApr 3, 20268 min

Running Lemonade on Strix Halo (EVO-X2): Vulkan Shared Memory Leaks and ROCm Stability

Real-world testing of AMD Lemonade v10.0.1 on Ryzen AI Max+ 395. LLM, image generation, speech recognition, and TTS running simultaneously, NPU Hybrid execution, Vulkan vs ROCm benchmarks, and discovering shared memory leaks.

AMD Local LLM Vulkan ROCm NPU llama.cpp GPU Inference Optimization Benchmark Experiment

TechApr 3, 20268 min

AMD's Lemonade Local AI Server Bundles GPU, NPU, and Multi-Modal Inference Under One Roof

Lemonade is AMD's open-source local AI server that manages multiple backends like llama.cpp and FastFlowLM across GPU/NPU/CPU, serving text, image, and audio generation through an OpenAI-compatible API.

AMD Local LLM NPU GPU llama.cpp Inference Optimization ROCm Vulkan

TechMar 31, 2026updated8 min

Qwen3.5-35B-A3B on llama-server (Vulkan + Strix Halo): 4K → 65K context for only 800MB more VRAM

Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers consume KV cache. Going from ctx-size 4096 to 65536 on llama-server + Vulkan added just 800MB VRAM with zero throughput loss. Tested on Strix Halo (Ryzen AI Max+ 395), with q8_0 KV quant benchmarks.

LLM Local LLM llama.cpp AMD Vulkan KV Cache Qwen Benchmark

TechMar 28, 2026updated14 min

Radeon 8060S (gfx1151) Vulkan Broke Again After AMD Driver Update

After updating to AMD Software 26.3.1 on a GMKtec EVO-X2 (Ryzen AI Max+ 395), Vulkan backend fails to allocate device memory properly and falls back to CPU. Investigation and workaround by changing BIOS VRAM allocation from 48GB/16GB to 32GB/32GB.

AMD Vulkan GPU llama.cpp LLM Experiment

TechMar 1, 202611 min

The Reason Qwen 3.5 Failed on Radeon 8060S Was an Outdated AMD Driver

Isolating the cause of Qwen 3.5 failing on ROCm/Vulkan via CPU inference, llama-server, and LM Studio — an AMD driver update resolved everything.

AI LLM Local LLM AMD llama.cpp Ollama LM Studio Experiment

TechFeb 28, 2026updated12 min

Qwen 3.5 abliterated in Ollama: broken outputs, chat-template failures, and the official-model workaround

Hands-on test of huihui-ai Qwen 3.5 abliterated models in Ollama: garbage-token failures, GLM-4.7-Flash chat-template breakage, and why the official model with thinking disabled worked better.

AI LLM Ollama Local LLM AMD LM Studio Vulkan ROCm Experiment

TechFeb 15, 2026updated5 min

Optimizing VRAM and Memory Allocation on Strix Halo for Local LLMs

How to configure VRAM/main memory split on the GMKtec EVO-X2 (Strix Halo) for local LLM inference. A 29.6GB model ran fine with just 8GB of dedicated VRAM.

AI LLM Memory Optimization AMD LM Studio Experiment

TechFeb 15, 20266 min

Setting Up a Local LLM on the GMKtec EVO-X2 (Strix Halo)

Building an NSFW-capable local LLM on the GMKtec EVO-X2 (Strix Halo). Getting GPU inference at ~11 tokens/s with LM Studio and MS3.2-24B-Magnum-Diamond.

AI LLM Local LLM LM Studio AMD Experiment