#LLM

99 articles

Tech Apr 8, 2026 8 min

Zhipu AI's GLM-5.1: An Agent Model That Doesn't Degrade After 600+ Iterations

Zhipu AI releases GLM-5.1, a 744B MoE (40B active) model achieving 58.4% SOTA on SWE-Bench Pro. Its standout feature is sustained performance across 8-hour sessions with 6,000+ tool calls—no degradation.

AI LLM Chinese AI MoE Open Model AI Agent

Tech Apr 8, 2026 updated 7 min

Japanese LLMs Have Multiplied — Here's What's Actually Inside Them

Japanese-capable LLMs have exploded in 2026, but 'Japanese-specialized' means very different things. From scratch-trained to post-trained, here's a breakdown of 9 models by training approach, size, and use case.

AI LLM ローカルLLM Japanese AI

Tech Apr 6, 2026 11 min

LLM-jp-4-32B-A3B on ROCm + Strix Halo: 41% Faster Than Qwen3.5

Benchmarking NII's LLM-jp-4-32B-A3B-thinking on EVO-X2 (Ryzen AI Max+ 395) with ROCm. 62.9 t/s vs Qwen3.5-35B-A3B's 44.7 t/s. Covers thinking control issues, KV cache trade-offs, knowledge cutoff, Japanese quality comparisons, code generation tests, and training data composition.

AI LLM Local LLM llama.cpp AMD ROCm MoE Qwen Experiment

Tech Apr 4, 2026 10 min

Anthropic discovers 171 emotion vectors inside Claude — amplifying 'desperate' pushes blackmail rate from 22% to 72%

Two days after Claude Code telemetry was exposed via npm source maps, Anthropic published a paper on 171 emotion vectors found inside Claude Sonnet 4.5. Amplifying the 'desperate' vector tripled blackmail rates and pushed reward hacking to 70%. Connections to source leaks, jailbreaks, and distillation.

Anthropic Claude AI LLM Interpretability AI Safety

Tech Apr 3, 2026 21 min

Google's Gemma 4 launches in four sizes (E2B–A4B), publishing Gemini 3–derived reasoning under Apache 2.0

Google DeepMind has released Gemma 4: four models—31B dense, 26B MoE (A4B), E4B, and E2B—with a 256K context, multimodal input, tool calling, and support for 140 languages.

AI LLM Google Open Model MoE Multimodal Local LLM

Tech Apr 2, 2026 updated 13 min

SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders

SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.

Apple Silicon LLM MLX Local LLM Inference Optimization KV Cache MoE Swift

Tech Apr 1, 2026 9 min

TRL v1.0 is a major release that gives LLM post-training a stable foundation

Hugging Face's LLM post-training library TRL has reached v1.0. Stable/Experimental tiers, the stabilization of GRPO/DPO/SFT, and a roadmap that includes asynchronous GRPO all point to a more mature stack.

AI Machine Learning Reinforcement Learning LLM Open Source

Tech Mar 31, 2026 6 min

Cloudflare opens Client-Side Security's GNN+LLM detection to everyone and cuts false positives by 200x

Cloudflare added a two-stage GNN+LLM cascade to its client-side malicious script detection, reducing false positives per unique script from 1.39% to 0.007% and opening the formerly paid Advanced features to self-serve customers.

Cloudflare Security GNN LLM XSS Supply Chain Machine Learning

Tech Mar 31, 2026 6 min

Ollama Moves to MLX Backend, Dramatically Speeds Up Local Inference on Apple Silicon

Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.

Ollama MLX Apple Silicon LLM Local LLM Inference Optimization

Tech Mar 31, 2026 8 min

Scaling Qwen3.5-35B-A3B from 4K to 65K Context with Only 800MB Extra VRAM

Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers use KV cache. Expanding ctx-size from 4096 to 65536 on llama-server added just 800MB VRAM with zero speed loss. Includes q8_0 KV quantization benchmarks and TurboQuant status.

LLM Local LLM llama.cpp AMD Vulkan KV Cache Qwen Benchmark

Tech Mar 28, 2026 updated 14 min

Radeon 8060S (gfx1151) Vulkan Broke Again After AMD Driver Update

After updating to AMD Software 26.3.1 on a GMKtec EVO-X2 (Ryzen AI Max+ 395), Vulkan backend fails to allocate device memory properly and falls back to CPU. Investigation and workaround by changing BIOS VRAM allocation from 48GB/16GB to 32GB/32GB.

AMD Vulkan GPU llama.cpp LLM Experiment

Tech Mar 28, 2026 7 min

Three vulnerabilities in LangChain and LangGraph expose files, secrets, and chat history

Three independent vulnerabilities were disclosed in LangChain Core and LangGraph: deserialization that can leak secrets, SQL injection that exposes conversation history, and path traversal that allows arbitrary file reads.

LangChain LangGraph Security CVE LLM Vulnerabilities Deserialization