#Benchmark

7 articles

Tech Apr 13, 2026 6 min

How 8 AI Agent Benchmarks Were Gamed to Near-Perfect Scores Without Solving a Single Task

UC Berkeley's RDI team demonstrated that major benchmarks including SWE-bench and WebArena can be manipulated to near-perfect scores without completing any tasks. They identified 7 vulnerability patterns and released BenchJack, an automated benchmark attack tool.

AI AI Agent Benchmark Security

Tech Apr 3, 2026 8 min

Running Lemonade on Strix Halo (EVO-X2): Vulkan Shared Memory Leaks and ROCm Stability

Real-world testing of AMD Lemonade v10.0.1 on Ryzen AI Max+ 395. LLM, image generation, speech recognition, and TTS running simultaneously, NPU Hybrid execution, Vulkan vs ROCm benchmarks, and discovering shared memory leaks.

AMD Local LLM Vulkan ROCm NPU llama.cpp GPU Inference Optimization Benchmark Experiment

Tech Mar 31, 2026 updated 8 min

Qwen3.5-35B-A3B on llama-server (Vulkan + Strix Halo): 4K → 65K context for only 800MB more VRAM

Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers consume KV cache. Going from ctx-size 4096 to 65536 on llama-server + Vulkan added just 800MB VRAM with zero throughput loss. Tested on Strix Halo (Ryzen AI Max+ 395), with q8_0 KV quant benchmarks.

LLM Local LLM llama.cpp AMD Vulkan KV Cache Qwen Benchmark

Tech Mar 26, 2026 6 min

ARC-AGI-3 announced, frontier AI in interactive inference less than 1%

François Chollet et al. publish new benchmark ARC-AGI-3. As of March 2026, all Frontier LLMs have achieved less than 1% of the interactive task of autonomously exploring an unknown environment with an unknown goal.

A.I. Benchmark A.G.I. Claude

Tech Feb 26, 2026 updated 5 min

ComfyUI + WAI-Illustrious on RTX 4060 Laptop (8GB VRAM): 1024x1024 in 15s, no --lowvram, LoRA still fits

Setup notes for running WAI-Illustrious SDXL v16 on ComfyUI with an 8GB RTX 4060 Laptop. 1024x1024 generates in ~15 seconds without --lowvram, and a LoRA still loads. CUDA 12.8 portable build and path gotchas included.

ComfyUI Stable Diffusion Image Generation GPU Benchmark

Tech Feb 24, 2026 8 min

Large-Scale Unauthorized Distillation of Claude and the Collapse of SWE-bench Hit on the Same Day

Anthropic accused three Chinese AI companies of distilling Claude, and on the same day OpenAI retired SWE-bench Verified. Training fraud and evaluation flaws exposed simultaneously on February 23, 2026.

AI Security Anthropic DeepSeek Benchmark LLM OpenAI SWE-bench

Tech Feb 19, 2026 updated 5 min

How IT-Bench and MAST expose enterprise AI agent failure modes

Using IBM and UC Berkeley's IT-Bench benchmark and the MAST failure taxonomy, this article examines why enterprise AI agents fail. It covers the reality of 11% SRE success and 0% FinOps success, plus the Replit production database deletion incident.

AI AI Agents IBM Benchmark Enterprise