UC Berkeley's RDI team demonstrated that major benchmarks including SWE-bench and WebArena can be manipulated to near-perfect scores without completing any tasks. They identified 7 vulnerability patterns and released BenchJack, an automated benchmark attack tool.
Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers use KV cache. Expanding ctx-size from 4096 to 65536 on llama-server added just 800MB VRAM with zero speed loss. Includes q8_0 KV quantization benchmarks and TurboQuant status.
François Chollet et al. publish new benchmark ARC-AGI-3. As of March 2026, all Frontier LLMs have achieved less than 1% of the interactive task of autonomously exploring an unknown environment with an unknown goal.
Anthropic accused three Chinese AI companies of distilling Claude, and on the same day OpenAI retired SWE-bench Verified. Training fraud and evaluation flaws exposed simultaneously on February 23, 2026.
Using IBM and UC Berkeley's IT-Bench benchmark and the MAST failure taxonomy, this article examines why enterprise AI agents fail. It covers the reality of 11% SRE success and 0% FinOps success, plus the Replit production database deletion incident.