Building an NSFW-capable local LLM on the GMKtec EVO-X2 (Strix Halo). Getting GPU inference at ~11 tokens/s with LM Studio and MS3.2-24B-Magnum-Diamond.
MioTTS from Aratako is a family of 0.1B to 2.6B Japanese-English TTS models built from scratch around the custom MioCodec. Its key feature is that it runs directly in llama.cpp and Ollama.
Liquid AI's LFM2.5 uses a hybrid of short-range convolutions and attention, achieving edge optimization without SSMs. This article covers the architecture, benchmarks, and community use cases.
A technical overview of Qwen3‑TTS from Alibaba’s Qwen team: one‑line pip install, 3‑second voice cloning, natural‑language voice design, and support for 10 languages including Japanese. Apache 2.0 licensed.
A technical walkthrough of Alibaba's Qwen3-Omni-30B-A3B. An omni-modal model that activates only 3B out of 30B and responds with speech from text/image/audio/video inputs. The article organizes the Thinker–Talker architecture, benchmarks, and the overall Qwen3 MoE family.
A technical look at ByteDance's UI-TARS-1.5-7B, which beats OpenAI CUA and Claude 3.7 by a wide margin at identifying GUI elements from screenshots, and can run locally with a desktop app.
Technical overview of Alibaba’s Qwen3-Coder-Next. An ultra-efficient MoE with 80B parameters but only 3B activated, runs even on a single RTX 4090. Brings 70%+ SWE-Bench performance to local use.
A paper explains that two seemingly mysterious Transformer behaviors, heavy attention on specific tokens and unusually large activations in specific dimensions, are actually manifestations of the same mechanism.
OpenRouter ships :free models and a Free Router endpoint. Tested both for rate limits (50/day → 1,000/day after a $10 top-up), the tool-calling failure on free models, and which workloads they actually fit.
I looked into PageIndex, a RAG system that builds hierarchical document trees using only LLM reasoning, without chunking or vector databases. I also consider how it fits with layout detection and OCR pipelines.