#Ollama

16 articles

Tech May 11, 2026 5 min

Ollama CVE-2026-7482: crafted GGUF leaks heap memory from exposed API servers

Out-of-bounds read in Ollama's GGUF loader before 0.17.1. If your Ollama API is network-accessible, a crafted model file can exfiltrate env vars, API keys, system prompts, and conversation fragments from process memory.

Ollama Security Vulnerability CVE Local LLM LLM

Tech May 5, 2026 8 min

Ollama + MCP servers on M1 Max 64GB: MCPHost deprecation, tool calling limits, and a minimal custom server

Tested connecting MCP servers to Ollama local LLMs on M1 Max 64GB. MCPHost is deprecated, tool calling breaks with quantized models, and context fills fast. Includes working TypeScript and Python custom MCP server setups.

Ollama MCP ローカルLLM LLM AIエージェント

Tech May 3, 2026 10 min

A FastAPI wrapper that takes Japanese, runs it through Ollama, and routes to ComfyUI or mflux to drive Anima, WAI-IL, and FLUX.2 Klein from one WebUI

Three local image generation engines (WAI-Anima, WAI-IL/SDXL, FLUX.2 Klein 4B) tied together by a thin FastAPI wrapper that takes Japanese prompts. Ollama (gemma3:12b) handles JP→EN, ComfyUI workflows are built on the fly in Python, FLUX.2 runs as an mflux subprocess, and the whole thing is reachable from an iPhone over Tailscale.

AI 画像生成 ComfyUI FLUX Apple Silicon Mac Ollama FastAPI Tailscale 実験

Tech May 2, 2026 23 min

Wiring Up a Multimodal Japanese Local RAG with FastAPI, Chroma, Open WebUI, and Ollama on M1 Max

Hands-on log of building the DEV article's PDF RAG on M1 Max 64GB, extending it with images via CLIP, and pushing through Japanese with bge-m3 + Qwen3.6 35B. Documents the modality gap, the dual inference server crash, and LLM-jp 4-8B's empty chat template silently dropping the system role.

AI LLM RAG ローカルLLM FastAPI llama.cpp Chroma Python Apple Silicon Ollama 日本語LLM 実験

Tech Apr 23, 2026 21 min

Running open-notebook on M1 Max Without Docker or Cloud APIs, and Letting qwen3.6:35b Read Its Own Article

The NotebookLM clone open-notebook assumes Docker and cloud APIs by default. I installed SurrealDB natively, ran four processes in tmux, and wired everything through Ollama's qwen3.6:35b and bge-m3. I fed it the Qwen3.6 benchmark article I wrote this morning, and it answered with the correct numbers.

AI LLM ローカルLLM Ollama Qwen Apple Silicon RAG OSS 実験

Tech Apr 23, 2026 13 min

Qwen3.6-27B Dense vs Qwen3.6-35B-A3B MoE on M1 Max — MLX Was 2× Faster Than Ollama

Tried Qwen3.6-27B on both Ollama and MLX. Ollama couldn't load the VL-projector-embedded GGUF, MLX ran it at 11 tok/s. On the side, running 35B-A3B under MLX was roughly 2× faster than the Ollama GGUF. Also had both models build a BBS to gauge intent handling.

LLM Local LLM Qwen Ollama MLX Apple Silicon MoE Experiment

Tech Apr 21, 2026 updated 11 min

I Ran Qwen3.6-35B-A3B on M1 Max via Ollama and Thinking Tokens Ballooned 13×

A hands-on log of Qwen3.6-35B-A3B under Ollama 0.20.6. Generation speed matches Qwen3.5 at 27 tok/s, but thinking tokens grew 13× for the same prompt. Multi-turn, persona, and a three-tier NSFW probe are included.

LLM Local LLM Qwen Ollama Apple Silicon MoE Experiment

Tech Apr 15, 2026 updated 11 min

Five layers of LLM safety filters: where abliterated and uncensored models actually intervene

LLM safety stacks five layers — input filter, system prompt, RLHF, Constitutional AI, output filter — and each provider blocks at different layers. A breakdown of where abliterated vs uncensored models cut, and the default censorship level baked into local LLMs.

AI LLM ローカルLLM Security Gemini Claude Ollama

Tech Apr 14, 2026 10 min

Can Local Vision LLMs Extract RPG Stats from Character Art?

I tested local Vision LLMs (Gemma 3, Qwen2.5-VL, Llama 3.2 Vision, Gemma 4) to see if they could look at character illustrations and pixel art and generate RPG-style stats in JSON format.

AI Local LLM VLM Image Recognition Ollama Gemma Qwen Apple Silicon Experiment

Tech Mar 31, 2026 6 min

Ollama Moves to MLX Backend, Dramatically Speeds Up Local Inference on Apple Silicon

Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.

Ollama MLX Apple Silicon LLM Local LLM Inference Optimization

Tech Mar 23, 2026 14 min

Packaging the BERT + Qwen OCR Correction Pipeline as a Python Tool

The three-stage pipeline of BERT perplexity scan → LLM judgment → escalation packaged as a cross-platform Python tool. The installer automatically downloads llama-server and GGUF models.

NLP OCR Machine Learning Python BERT LLM llama.cpp Qwen NDLOCR-Lite Gradio Ollama Experiment

Tech Mar 1, 2026 11 min

The Reason Qwen 3.5 Failed on Radeon 8060S Was an Outdated AMD Driver

Isolating the cause of Qwen 3.5 failing on ROCm/Vulkan via CPU inference, llama-server, and LM Studio — an AMD driver update resolved everything.

AI LLM Local LLM AMD llama.cpp Ollama LM Studio Experiment