#ローカルLLM

9 articles

TechMay 7, 20268 min

Gemma 4 MTP drafter on M1 Max 64GB: 26B A4B +13%, 31B Dense and E4B got slower

Tested Gemma 4 MTP drafter on M1 Max 64GB with mlx-vlm 0.5.0. Only the 26B A4B MoE got +13%; 31B Dense and E4B got slower. Code gen vs short haiku prompts flip the result.

AI LLM Google Gemma ローカルLLM 推論 MLX 実験

TechMay 6, 2026updated9 min

Gemma 4 MTP drafter: 3x speedup for Dense, limited gains on 26B MoE at batch 1

Reading Google's MTP drafter docs, vLLM recipes, and the AI for Developers guide. The 3x claim holds for 31B Dense but 26B A4B MoE stalls at batch 1 because speculative decoding verification loads extra expert weights per candidate token.

AI LLM Google Gemma ローカルLLM 推論

TechMay 5, 20268 min

Ollama + MCP servers on M1 Max 64GB: MCPHost deprecation, tool calling limits, and a minimal custom server

Tested connecting MCP servers to Ollama local LLMs on M1 Max 64GB. MCPHost is deprecated, tool calling breaks with quantized models, and context fills fast. Includes working TypeScript and Python custom MCP server setups.

Ollama MCP ローカルLLM LLM AIエージェント

TechMay 2, 202623 min

Wiring Up a Multimodal Japanese Local RAG with FastAPI, Chroma, Open WebUI, and Ollama on M1 Max

Hands-on log of building the DEV article's PDF RAG on M1 Max 64GB, extending it with images via CLIP, and pushing through Japanese with bge-m3 + Qwen3.6 35B. Documents the modality gap, the dual inference server crash, and LLM-jp 4-8B's empty chat template silently dropping the system role.

AI LLM RAG ローカルLLM FastAPI llama.cpp Chroma Python Apple Silicon Ollama 日本語LLM 実験

TechMay 2, 2026updated12 min

Reading an Article on Building a Local PDF RAG with FastAPI, llama.cpp, Chroma, and Open WebUI

Notes on a DEV Community article that wires up FastAPI as an OpenAI-compatible RAG API layer with llama.cpp, Chroma, and Open WebUI, plus where the architecture fits and what to watch for.

AI LLM RAG ローカルLLM FastAPI llama.cpp Chroma Python Docker

TechApr 23, 202621 min

Running open-notebook on M1 Max Without Docker or Cloud APIs, and Letting qwen3.6:35b Read Its Own Article

The NotebookLM clone open-notebook assumes Docker and cloud APIs by default. I installed SurrealDB natively, ran four processes in tmux, and wired everything through Ollama's qwen3.6:35b and bge-m3. I fed it the Qwen3.6 benchmark article I wrote this morning, and it answered with the correct numbers.

AI LLM ローカルLLM Ollama Qwen Apple Silicon RAG OSS 実験

TechApr 20, 2026updated9 min

Running TRELLIS.2 on Apple Silicon MPS: a CUDA-free port

A port that replaces TRELLIS.2's CUDA-only libraries (flash_attn, nvdiffrast, sparse 3D convolution) with pure-PyTorch equivalents and runs Microsoft's 4B image-to-3D model on an M4 Pro in about 3.5 minutes without any NVIDIA GPU.

AppleSilicon MPS PyTorch 3D ローカルLLM ML

TechApr 15, 2026updated11 min

Five layers of LLM safety filters: where abliterated and uncensored models actually intervene

LLM safety stacks five layers — input filter, system prompt, RLHF, Constitutional AI, output filter — and each provider blocks at different layers. A breakdown of where abliterated vs uncensored models cut, and the default censorship level baked into local LLMs.

AI LLM ローカルLLM Security Gemini Claude Ollama

TechApr 8, 2026updated7 min

9 Japanese LLMs in April 2026 compared: LLM-jp-4, PLaMo, Nemotron Nano 9B JP, Swallow, Namazu

9 Japanese-specialized LLMs as of April 2026 — LLM-jp-4 (11.7T tokens from scratch), PLaMo, Nemotron Nano 9B JP (#1 sub-10B on Nejumi 4), Swallow 30B-A3B, Namazu — broken down by whether they were scratch-trained, continued pre-trained, or post-trained, with size, license, benchmark scores.

AI LLM ローカルLLM Japanese AI