Out-of-bounds read in Ollama's GGUF loader before 0.17.1. If your Ollama API is network-accessible, a crafted model file can exfiltrate env vars, API keys, system prompts, and conversation fragments from process memory.
Tested connecting MCP servers to Ollama local LLMs on M1 Max 64GB. MCPHost is deprecated, tool calling breaks with quantized models, and context fills fast. Includes working TypeScript and Python custom MCP server setups.
Three local image generation engines (WAI-Anima, WAI-IL/SDXL, FLUX.2 Klein 4B) tied together by a thin FastAPI wrapper that takes Japanese prompts. Ollama (gemma3:12b) handles JP→EN, ComfyUI workflows are built on the fly in Python, FLUX.2 runs as an mflux subprocess, and the whole thing is reachable from an iPhone over Tailscale.
Hands-on log of building the DEV article's PDF RAG on M1 Max 64GB, extending it with images via CLIP, and pushing through Japanese with bge-m3 + Qwen3.6 35B. Documents the modality gap, the dual inference server crash, and LLM-jp 4-8B's empty chat template silently dropping the system role.
The NotebookLM clone open-notebook assumes Docker and cloud APIs by default. I installed SurrealDB natively, ran four processes in tmux, and wired everything through Ollama's qwen3.6:35b and bge-m3. I fed it the Qwen3.6 benchmark article I wrote this morning, and it answered with the correct numbers.
Tried Qwen3.6-27B on both Ollama and MLX. Ollama couldn't load the VL-projector-embedded GGUF, MLX ran it at 11 tok/s. On the side, running 35B-A3B under MLX was roughly 2× faster than the Ollama GGUF. Also had both models build a BBS to gauge intent handling.
A hands-on log of Qwen3.6-35B-A3B under Ollama 0.20.6. Generation speed matches Qwen3.5 at 27 tok/s, but thinking tokens grew 13× for the same prompt. Multi-turn, persona, and a three-tier NSFW probe are included.
LLM safety stacks five layers — input filter, system prompt, RLHF, Constitutional AI, output filter — and each provider blocks at different layers. A breakdown of where abliterated vs uncensored models cut, and the default censorship level baked into local LLMs.
I tested local Vision LLMs (Gemma 3, Qwen2.5-VL, Llama 3.2 Vision, Gemma 4) to see if they could look at character illustrations and pixel art and generate RPG-style stats in JSON format.
Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.
The three-stage pipeline of BERT perplexity scan → LLM judgment → escalation packaged as a cross-platform Python tool. The installer automatically downloads llama-server and GGUF models.