Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.
The three-stage pipeline of BERT perplexity scan → LLM judgment → escalation packaged as a cross-platform Python tool. The installer automatically downloads llama-server and GGUF models.
All variants of huihui-ai's Qwen 3.5 abliterated produced garbage tokens. GLM-4.7-Flash abliterated had a broken chat template. The official version with thinking disabled turned out to be the right answer.
Experiment log: from LUKE/BERT fill-mask fine-tuning, to perplexity-based error detection, to Qwen2.5 7B correction judgment with human escalation on mismatch. A complete pipeline running on a single RTX 4060 Laptop with 8GB VRAM.
Set up the CLI version of NDLOCR-Lite on Apple Silicon Mac, then tested OCR result correction with Qwen 3.5 and Swallow. Includes experiments with direct image reading and the anchoring effect.
A plan to build an internal help desk RAG system using a Mac mini M4 Pro and Dify. Highlights what's new in Dify circa 2025 and tips for running local LLMs.