#Quantization

3 articles

TechMay 26, 202615 min

Hy-MT2 1.8B Q4_K_M on M1 Max 64GB: 1.25bit 440MB build does not load on stock llama.cpp yet

Hands-on with Tencent Hy-MT2 1.8B Q4_K_M (1.08GB) on M1 Max 64GB via llama-server. JSON, SRT, HTML, glossary, and minority-language prompts with full input-output pairs. The 1.25bit 440MB build does not load on stock llama.cpp 8990, and 30B-A3B (hy_v3) is not in the Mac route yet.

AI LLM Translation Local LLM Hugging Face Quantization MoE Open Source Mac Apple Silicon Experiment

TechMar 25, 202617 min

Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization

Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.

LLM Local LLM Quantization Apple Silicon Inference Optimization KV Cache Rust

TechJan 30, 20265 min

Not All Bits Are Equal: There is no universal solution for memory allocation in reasoning models

How should memory be allocated in reasoning models? This paper explains the trade-offs among quantization, KV cache, and test-time compute, based on 1,700 experiments.

LLM Quantization Inference Research