Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers use KV cache. Expanding ctx-size from 4096 to 65536 on llama-server added just 800MB VRAM with zero speed loss. Includes q8_0 KV quantization benchmarks and TurboQuant status.
Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.
The three-stage pipeline of BERT perplexity scan → LLM judgment → escalation packaged as a cross-platform Python tool. The installer automatically downloads llama-server and GGUF models.
Using tori29umai’s LoRA to automatically split facial parts, results from batching 28 images, and a log of running into the limits when attempting finer hair separation
Set up the CLI version of NDLOCR-Lite on Apple Silicon Mac, then tested OCR result correction with Qwen 3.5 and Swallow. Includes experiments with direct image reading and the anchoring effect.
Configuration for running a Qwen-Image-Layered LoRA that automatically separates facial parts on RunPod. Comparison of RTX 6000 Ada (48GB) and RTX PRO 6000 (96GB).
Tried running the NSFW variant of Qwen-Image-Edit (Phr00t AIO) on RunPod to generate 3-view reference sheets for a 3D model base mesh. A log of failures on RTX 4090 and eventual success on RTX 5090.