UC Berkeley's RDI team demonstrated that major benchmarks including SWE-bench and WebArena can be manipulated to near-perfect scores without completing any tasks. They identified 7 vulnerability patterns and released BenchJack, an automated benchmark attack tool.
Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers consume KV cache. Going from ctx-size 4096 to 65536 on llama-server + Vulkan added just 800MB VRAM with zero throughput loss. Tested on Strix Halo (Ryzen AI Max+ 395), with q8_0 KV quant benchmarks.
François Chollet et al. publish new benchmark ARC-AGI-3. As of March 2026, all Frontier LLMs have achieved less than 1% of the interactive task of autonomously exploring an unknown environment with an unknown goal.
Setup notes for running WAI-Illustrious SDXL v16 on ComfyUI with an 8GB RTX 4060 Laptop. 1024x1024 generates in ~15 seconds without --lowvram, and a LoRA still loads. CUDA 12.8 portable build and path gotchas included.
Anthropic accused three Chinese AI companies of distilling Claude, and on the same day OpenAI retired SWE-bench Verified. Training fraud and evaluation flaws exposed simultaneously on February 23, 2026.
Using IBM and UC Berkeley's IT-Bench benchmark and the MAST failure taxonomy, this article examines why enterprise AI agents fail. It covers the reality of 11% SRE success and 0% FinOps success, plus the Replit production database deletion incident.