Microsoft Ships Foundry Local GA: A ~20MB Native Library That Bundles Local AI Into Any App

The local LLM runtime space is crowded and moving fast.
Ollama switched to an MLX backend and dramatically improved inference speed on Apple Silicon, AMD’s Lemonade arrived as a multi-modal server that bundles GPU and NPU backends, and LM Studio keeps gaining traction with its GUI-first approach (I used it on my EVO-X2).

But all of these are tools that end users install and launch themselves.
If you’re an app developer who wants to embed local inference into your product, you’re stuck with two options: tell users “please install Ollama first” in your README, or link ONNX Runtime or llama.cpp directly and implement model management yourself.
The former hurts user experience. The latter is expensive to build.

Microsoft’s newly GA’d “Foundry Local” fills this gap.
Add a package from your language’s package manager, and your app ships with AI inference built in — no additional setup required from users.

What “Bundle” Actually Means

“You can bundle it into your app” is vague, so here’s what concretely happens.

Foundry Local is distributed through standard package managers:

Language	Source
C#	NuGet (`Microsoft.AI.Foundry.Local`)
Python	PyPI (`foundry-local`)
JavaScript / TypeScript	npm
Rust	crates.io

For C#, running dotnet add package pulls in the NuGet package, and at build time, the Foundry Local Core native libraries (.dll / .so / .dylib) and ONNX Runtime binaries are copied into the output directory.
This native library adds roughly 20MB — negligible impact on installer size.
Python uses pip install, JavaScript uses npm install, same mechanism.

Models are not bundled with the runtime.
On first launch, the optimal variant for the machine’s hardware is automatically downloaded from the Foundry Catalog and cached locally.
Subsequent runs need no network.
Download resume is supported.

Here’s how the setup flow compares to Ollama and LM Studio:

graph TD
    subgraph "Ollama approach"
        A1[User installs<br/>Ollama] --> A2[Starts ollama serve]
        A2 --> A3[App connects<br/>via HTTP]
    end
    subgraph "Foundry Local approach"
        B1[Developer adds<br/>SDK package] --> B2[Build bundles<br/>native library]
        B2 --> B3[User installs<br/>the app]
        B3 --> B4[AI inference<br/>ready immediately]
    end

The key difference: in the Foundry Local approach, the user’s workflow has zero AI-related steps.
Install the app, and the inference environment is ready.

When I previously shipped NDLOCR-Lite on iOS with ONNX Runtime Mobile, I had to handle model acquisition, conversion, and optimization all manually.
Foundry Local absorbs that entire burden — the runtime handles model download and hardware selection automatically.
Think of it as a desktop/cross-platform version of that workflow, with the manual steps removed.

In-Process Native Library

Foundry Local’s architecture is fundamentally different from Ollama or LM Studio.

Ollama runs as an independent HTTP server (daemon), and apps communicate via HTTP requests to localhost.
This makes model sharing across apps easy, but forces users to manage a separate process.

Foundry Local runs as an in-process native library.
It loads directly into the app’s process, and communication happens through SDK function calls.
No HTTP overhead, no daemon management.

graph TD
    A[Application] --> B[Foundry Local SDK<br/>Python / C# / JS / Rust]
    B --> C[Foundry Local Core<br/>Native Library<br/>.dll / .so / .dylib]
    C --> D[ONNX Runtime]
    C --> E[Model Manager<br/>Download, Cache, Load]
    D --> F{Hardware Detection}
    F -->|Windows| G[Windows ML<br/>GPU / NPU / CPU auto-select]
    F -->|macOS| H[Metal<br/>Apple Silicon GPU]
    F -->|Linux| I[CPU fallback]

“Foundry Local Core” is the heart of this architecture, managing the full lifecycle: model download, load, inference, and unload.
The inference engine underneath is ONNX Runtime.

For development and debugging, there’s also a mode that starts Foundry Local Core as an OpenAI-compatible HTTP server.
Useful for connecting LangChain or OpenAI SDK HTTP clients directly, but in-process SDK usage is recommended for production.

Hardware Acceleration

Hardware acceleration is what makes or breaks local inference performance.
After fighting with VRAM allocation on Strix Halo and comparing Vulkan vs ROCm on Lemonade, I can say that choosing the right backend for the right hardware is a genuine pain point for developers.

Foundry Local delegates this choice to Windows ML (on Windows) or Metal (on macOS), shielding developers from hardware differences.

Platform	Acceleration	Mechanism
Windows	GPU / NPU / CPU auto-select	Windows ML detects hardware and picks the optimal ONNX execution provider
Windows (Snapdragon X / XDNA2)	NPU	Via QNN (Qualcomm Neural Network) Execution Provider
macOS (Apple Silicon)	GPU	Via Metal
Linux x64	CPU	Fallback (GPU support on the roadmap)

Windows ML acts as an OS-level hardware abstraction layer.
The app doesn’t need to know whether it’s running on GPU, NPU, or CPU — Windows ML picks the right ONNX execution provider (DirectML, CUDA, QNN, etc.) automatically.
Execution providers are distributed via Windows Update, so new hardware driver support arrives without app changes.

Models are also compiled into the optimal format for the detected hardware on first download and cached.
The same model uses different binaries depending on whether it targets GPU or CPU.

Not having to worry about per-backend performance differences the way I did with AMD Lemonade is a relief for developers — but the flip side is that there’s no room for fine-grained tuning.
You can’t explicitly choose Vulkan vs ROCm or specify quantization methods like you can with llama.cpp.

Technical Comparison With Ollama and llama.cpp

When viewed as local LLM execution environments, each targets a different layer.

Aspect	Ollama / LM Studio	llama.cpp (direct)	Foundry Local
Target user	End users running inference themselves	Developers embedding an inference engine	Developers bundling AI into shipped apps
Deployment	User installs separately	Library link or subprocess	Included via package manager
Process model	Standalone daemon (HTTP)	In-process or server	In-process (native library)
Inference engine	llama.cpp / MLX	llama.cpp	ONNX Runtime
Model format	GGUF (flexible quantization)	GGUF (flexible quantization)	ONNX (from Foundry Catalog)
Model freedom	Any model	Any model	Catalog models only
Hardware selection	Manual or backend-dependent	Explicit via build flags	Auto via Windows ML / Metal
Package size	Hundreds of MB+	Library itself is lightweight	~20MB (excluding models)

Microsoft claims “ONNX Runtime is on average 3.9x faster than llama.cpp, and up to 13.4x faster for long sequences.”
These numbers come from ONNX-optimized models, though — not a direct comparison with GGUF quantized models. Take them with appropriate context.

The biggest trade-off is model freedom.
With Ollama or LM Studio, you can download any GGUF from Hugging Face and run it.
Foundry Local only supports models registered in the Foundry Catalog, currently limited to these families:

Phi (Microsoft, small and performant)
Qwen family
DeepSeek
Mistral
Whisper (speech-to-text)
GPT OSS variants

The major open models are covered, but niche or fine-tuned models won’t work.
If you want to run custom models like I did on my EVO-X2, Foundry Local isn’t the right tool.

Resource Requirements

Requirement	Minimum	Recommended
RAM	8 GB	16 GB
Free disk	3 GB	15 GB

On machines with less than 16GB RAM, avoid models larger than 7B parameters.
Models consume RAM (or VRAM) while loaded, and you need to explicitly unload them to reclaim memory.

Like when I tuned VRAM allocation on Strix Halo, unified memory systems are sensitive to how VRAM and main memory are split.
Foundry Local delegates hardware selection to Windows ML, so manual tuning isn’t available.
On the other hand, it’s designed to run reasonably well on machines whose users don’t know how to allocate VRAM manually.

Developer API

Foundry Local provides an OpenAI-compatible API with chat completions and transcription endpoints.
Existing code written against the OpenAI SDK can switch to local inference with minimal changes.

from foundry_local import FoundryLocalManager

manager = FoundryLocalManager()

# Works directly with the OpenAI client
import openai
client = openai.OpenAI(base_url=manager.endpoint, api_key="unused")
response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

SDKs are available for Python, JavaScript (TypeScript), C#, and Rust.
For C#, there’s also a dedicated Microsoft.AI.Foundry.Local.WinML package with tighter Windows ML integration.

When to Use It

Foundry Local makes sense when these conditions overlap:

Condition	Why
Can’t ask users to set up AI tools	In enterprise desktop or consumer apps, “please install Ollama” is too high a barrier. Everything must be in the installer
Can’t send data to the cloud	Summarizing internal documents, processing medical data, classifying legal documents — privacy and compliance requirements rule out cloud APIs
Offline support needed	Factories, ships, remote fieldwork — environments without reliable connectivity. After the initial download, it runs offline
Latency matters	No network round-trip means faster responses. Suited for real-time assistants and interactive UI features

Conversely, Foundry Local is unnecessary in these cases:

Case	Why
Users are technical	If they can install Ollama or LM Studio themselves, just assume Ollama
Model freedom matters	The GGUF ecosystem (Ollama / llama.cpp) offers far more choices
Fine-grained GPU tuning needed	Use llama.cpp directly
Web or server-side use case	Cloud APIs or server-side Ollama is more straightforward

Roadmap

Enhancements announced at GA:

Real-time speech transcription (Whisper streaming)
GPU acceleration on Linux
Broader NPU support
Expanded model catalog
Shared model cache across apps (so multiple apps don’t re-download the same model)

The GitHub repository is at microsoft/Foundry-Local.
Documentation is available on Microsoft Learn.

Worth noting: Azure AI Foundry is a separate cloud platform despite the similar name.
Foundry Local runs entirely on-device with no cloud dependency and requires no Azure subscription.