Tech 8 min read

Ollama + MCP servers on M1 Max 64GB: MCPHost deprecation, tool calling limits, and a minimal custom server

IkesanContents

A DEV Community article covered connecting Ollama with MCP servers.
On my M1 Max 64GB running Ollama’s qwen3.6:35b, the LLM itself works fine — but adding MCP servers changes the picture.
Ollama does tool calling, not MCP client behavior. It doesn’t manage server lifecycles or capability negotiation.

The source article mentions MCPFind indexing 832 MCP servers in its ai-ml category.
That number makes it look like local LLMs can hook into anything, but the real question is “who speaks MCP.”

Ollama can call tools but doesn’t speak MCP

Ollama’s official docs describe tool calling via /api/chat with a tools parameter.
The model returns a tool call with function name and arguments, the app executes it, and feeds the result back as a tool-role message.

The OpenAI-compatible API is available at http://localhost:11434/v1/.
Pointing an existing OpenAI client at Ollama works at this boundary.

MCP is a layer above.
The spec defines hosts, clients, and servers. A host contains multiple clients, each maintaining a stateful JSON-RPC session with an MCP server.
Servers expose not just tools but also resources and prompts, and initialization includes capability negotiation.

flowchart LR
  User[Human] --> Host[MCP host]
  Host --> Client[MCP client]
  Client --> Server[MCP server]
  Host --> Ollama[Ollama<br/>local model]
  Ollama --> Host
  Server --> Local[filesystem / DB / CLI / API]

Ollama is an inference engine, not an MCP host.
You need a bridge that handles server startup, connection, tool listing, tool call execution, and result relay.

Check the MCPHost repo before adopting it

The source article uses MCPHost as its bridge example.

go install github.com/mark3labs/mcphost@latest

You write MCP servers in a config file and start it with an Ollama model.

mcphost --config mcp-config.json \
  --model ollama:qwen2.5:14b \
  --ollama-url http://localhost:11434

The architecture is straightforward — MCPHost becomes the MCP host, connects to servers like filesystem, and presents tools to Ollama via tool calling.

However, MCPHost’s GitHub README states it’s no longer actively maintained, with Kit as the successor.
Copying the tutorial without checking the repo means missing this.

MCPHost is fine for a quick local experiment, but for a persistent setup, pick a maintained host.
Continue.dev or a custom host via the official SDK are alternatives, though self-built hosts require implementing the tool call loop, streaming, error retry, and result truncation yourself.

Filesystem and DB servers fit local LLMs best

When connecting MCP servers to local LLMs, start with local-only servers: filesystem, Git, SQLite.
Model and tools run on the same machine with no external API keys — which aligns with why you’d use a local LLM in the first place.

When I ran open-notebook on M1 Max without Docker or cloud APIs, the satisfying part was everything closing locally — Ollama, SurrealDB, Worker, and frontend all on one machine.
MCP is the same: the more local your data target, the more local LLM advantages hold.

Conversely, GitHub, AWS, and SaaS MCP servers still require API tokens for external services even with a local LLM.
Search and browser servers still make external HTTP requests.
”Not sending documents to a cloud LLM” and “no network traffic at all” are different things.

This connects to the CLI-to-AI interface shift I wrote about.
git status or docker ps are fine as CLI, but internal APIs and device control — where you don’t want the model guessing parameter shapes — benefit from MCP’s schema.

Tool calling accuracy degrades with model size and quantization

The source article lists Qwen2.5, Llama 3.3, Mistral Nemo, and Gemma 3 as tool-calling candidates.
Ollama’s official tool calling examples also use qwen3, making Qwen a natural starting point.

But tool calling requires more than “the model returns plausible JSON.”
Argument types, required fields, nested objects, multi-tool calls, and incorporating previous tool results all need to work reliably.

When I loaded Qwen3.6 on Ollama, generation speed was fine.
But with thinking-token-heavy models, adding MCP tool definitions and tool results stretches total conversation latency significantly.

Smaller quantized models break JSON closing brackets or drop required arguments.
This isn’t Ollama-specific — it’s a classic weakness of function calling on local LLMs.
For anything beyond experimentation, test with your exact model tag, quantization level, context length, and number of MCP server tools.

More servers means more context consumed

With local LLMs, context pressure from MCP hits harder than with cloud models.
Each MCP server adds tool names, descriptions, and JSON Schema to the model prompt.
Tool results also accumulate in conversation history.

What Claude or GPT can carry casually in 128k or 200k context behaves differently in a 7B–14B model.
A filesystem server returning a large directory listing slows everything after it.
Search servers returning full document text do the same.

In my audit of 50 open-source MCP servers, I covered how adding servers expands attack surface.
With local LLMs, you also get context and latency expansion.
Enabling multiple servers at once is worse than scoping to what the current task needs.

“Local means safe” is also a misunderstanding.
MCP servers run as Node.js or Python processes locally, but vulnerabilities in their code turn model input into an attack vector.
The Flowise CustomMCP CVSS 10.0 RCE is a textbook case — input to the MCP server reached code execution without validation.
Arguments from Ollama’s tool calls pass through the same path, so hallucinated or malicious prompt content can still land.

Before adding a server, check whether its tool definitions include input validation in the JSON Schema and whether it runs as an isolated stdio subprocess.
The 50-server audit covers security patterns in more depth.

Writing your own MCP server

I’ve written about using MCP servers on this blog before, but never about building one.
For a personal server you don’t plan to publish, the official SDKs get you running in a few dozen lines.

The TypeScript SDK’s high-level McpServer API lets you define tools and handlers directly.

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

const server = new McpServer({
  name: "local-note-search",
  version: "0.1.0",
});

server.tool(
  "search_notes",
  { query: z.string() },
  async ({ query }) => {
    const { execFileSync } = await import("child_process");
    const out = execFileSync("grep", ["-rl", query, `${process.env.HOME}/notes`], {
      encoding: "utf-8",
      timeout: 5000,
    });
    return { content: [{ type: "text", text: out.trim() }] };
  }
);

const transport = new StdioServerTransport();
await server.connect(transport);

server.tool() takes a tool name, Zod schema, and handler.
StdioServerTransport uses stdin/stdout for JSON-RPC, so the server is launched as a subprocess by the MCP host.
No need to set up your own HTTP server.

The Python SDK’s FastMCP is even shorter.

from mcp.server.fastmcp import FastMCP
import subprocess

mcp = FastMCP("local-note-search")

@mcp.tool()
def search_notes(query: str) -> str:
    """Search local notes with grep"""
    r = subprocess.run(["grep", "-rl", query, "/Users/me/notes"],
                       capture_output=True, text=True, timeout=5)
    return r.stdout.strip() or "no matches"

mcp.run()

The decorator registers tools, and type hints auto-generate JSON Schema.
Both SDKs work after npm init or pip install — just add command and args to your MCP host config.

Getting to “it works” is surprisingly easy, but sustained use is another matter.
Input validation, timeouts, and result size limits need manual implementation.
Via Ollama, hallucinated arguments hit your handler, so even the grep example above should restrict paths.
The vulnerability patterns from the previous section apply to custom servers equally.

For unpublished personal servers, testing can be informal — iterate while running.
A local LLM plus custom server completes the tool chain without any cloud API, which suits offline or privacy-focused setups.

Building one reveals the weight of multi-tool servers

Writing a single server.tool() makes you feel how tool name, argument JSON Schema, and description ship to the model together.
One tool’s schema is small, but imagine a typical published server with “filesystem + git + shell + search + browser” — tool definitions alone become hundreds of tokens.

The Zod schema in server.tool() gets converted to JSON Schema by the MCP host and packed into the model’s system message or tool definition.
Ten tools means ten schemas and descriptions sent every turn.
At 128k or 200k context this is noise, but at 8k or 32k on Ollama it starts crowding out the actual conversation.

One multi-tool server vs. single-tool servers enabled on demand.
Once you’ve built one, the lightness of the latter becomes concrete.
A one-tool-one-server split means unused schemas never reach the model.
Under local LLM context constraints, that gap directly affects response quality.