OpenAI shipped GPT-5.5 and GPT-5.5 Pro on the API. A practical rundown of the 1M+ context, the new reasoning.effort default, image input behavior, prompt caching, and pricing.
Hands-on running inclusionAI Ling-flash-2.0 (100B / 6.1B active, MXFP4 quant, 54.7GB) on SwiftLM via mlx-swift-lm on an M1 Max 64GB. Covers bailing_moe + MXFP4 support check in mlx-swift, the startup surprise, and what --stream-experts actually saves.
A hands-on build and run of the Swift-based LLM inference server SwiftLM on an M1 Max 64GB. Covers Qwen3.6-35B-A3B and Qwen3.5-122B-A10B, with the same BST, BBS, and persona tests used in the existing Ollama and MLX-lm write-ups.
TRACER, a recent arXiv paper, takes the input/output logs of an LLM classification endpoint and reuses them as training data, then swaps in a lightweight surrogate only on regions that pass a parity gate to cut inference cost. The surrogate absorbs 83–100% of traffic on a 77-class intent dataset and 100% on a 150-class one, while correctly refusing to deploy on an NLI task — that refusal behavior is the interesting part.
Japan's Digital Agency released parts of Gennai, the generative AI platform it runs for central-government staff, on GitHub under MIT / CC BY 4.0. The web app and cloud-specific AI templates for AWS, Azure, and Google Cloud are bundled together so local governments and private companies can redeploy the same stack.
DeepSeek V4 Preview ships V4-Pro (1.6T/49B active) and V4-Flash (284B/13B active) as open weights under MIT, both with 1M context. CSA+HCA hybrid attention, mHC, and the Muon optimizer cut per-token FLOPs at 1M tokens to 27% of V3.2. Day-one API and chat.deepseek.com mode switch covered.
Two open-weight Chinese MoEs landed within 24 hours: Ant Ling-2.6-flash (104B/7.4B active, 7x token-efficiency claim) and Tencent Hy3-preview (295B/21B active, frontier-tier open weights). Specs, licenses, and how they line up against DeepSeek-V3 and GLM-4.5.
Xiaomi launched two MiMo-V2.5 models at once. MiMo-V2.5-Pro hits SWE-bench Pro 57.2, Claw-Eval 63.8, and τ3-Bench 72.9 — frontier-tier — while MiMo-V2.5 brings native omnimodality plus a 1M context. Both are API-only for now; open weights are promised but unscheduled.
NVIDIA's build.nvidia.com serves a free inference API that covers 100+ models including MiniMax M2.7, GLM-5, Kimi K2.5, DeepSeek, GPT-OSS, and Sarvam-M. Because integrate.api.nvidia.com/v1 is OpenAI-compatible, OpenClaw, OpenCode, Zed, and Cursor can call it directly.
The NotebookLM clone open-notebook assumes Docker and cloud APIs by default. I installed SurrealDB natively, ran four processes in tmux, and wired everything through Ollama's qwen3.6:35b and bge-m3. I fed it the Qwen3.6 benchmark article I wrote this morning, and it answered with the correct numbers.
Tried Qwen3.6-27B on both Ollama and MLX. Ollama couldn't load the VL-projector-embedded GGUF, MLX ran it at 11 tok/s. On the side, running 35B-A3B under MLX was roughly 2× faster than the Ollama GGUF. Also had both models build a BBS to gauge intent handling.
A hub for the 5-article series that organizes math symbols in AI and LLM articles for reading, not solving. Covers equations, vectors and matrices, probability and statistics, derivatives, and gradient descent with backprop, plus a reading-order guide for different backgrounds.