Microsoft Ships Foundry Local GA: A ~20MB Native Library That Bundles Local AI Into Any App
Contents
The local LLM runtime space is crowded and moving fast.
Ollama switched to an MLX backend and dramatically improved inference speed on Apple Silicon, AMD’s Lemonade arrived as a multi-modal server that bundles GPU and NPU backends, and LM Studio keeps gaining traction with its GUI-first approach (I used it on my EVO-X2).
But all of these are tools that end users install and launch themselves.
If you’re an app developer who wants to embed local inference into your product, you’re stuck with two options: tell users “please install Ollama first” in your README, or link ONNX Runtime or llama.cpp directly and implement model management yourself.
The former hurts user experience. The latter is expensive to build.
Microsoft’s newly GA’d “Foundry Local” fills this gap.
Add a package from your language’s package manager, and your app ships with AI inference built in — no additional setup required from users.
What “Bundle” Actually Means
“You can bundle it into your app” is vague, so here’s what concretely happens.
Foundry Local is distributed through standard package managers:
| Language | Source |
|---|---|
| C# | NuGet (Microsoft.AI.Foundry.Local) |
| Python | PyPI (foundry-local) |
| JavaScript / TypeScript | npm |
| Rust | crates.io |
For C#, running dotnet add package pulls in the NuGet package, and at build time, the Foundry Local Core native libraries (.dll / .so / .dylib) and ONNX Runtime binaries are copied into the output directory.
This native library adds roughly 20MB — negligible impact on installer size.
Python uses pip install, JavaScript uses npm install, same mechanism.
Models are not bundled with the runtime.
On first launch, the optimal variant for the machine’s hardware is automatically downloaded from the Foundry Catalog and cached locally.
Subsequent runs need no network.
Download resume is supported.
Here’s how the setup flow compares to Ollama and LM Studio:
graph TD
subgraph "Ollama approach"
A1[User installs<br/>Ollama] --> A2[Starts ollama serve]
A2 --> A3[App connects<br/>via HTTP]
end
subgraph "Foundry Local approach"
B1[Developer adds<br/>SDK package] --> B2[Build bundles<br/>native library]
B2 --> B3[User installs<br/>the app]
B3 --> B4[AI inference<br/>ready immediately]
end
The key difference: in the Foundry Local approach, the user’s workflow has zero AI-related steps.
Install the app, and the inference environment is ready.
When I previously shipped NDLOCR-Lite on iOS with ONNX Runtime Mobile, I had to handle model acquisition, conversion, and optimization all manually.
Foundry Local absorbs that entire burden — the runtime handles model download and hardware selection automatically.
Think of it as a desktop/cross-platform version of that workflow, with the manual steps removed.
In-Process Native Library
Foundry Local’s architecture is fundamentally different from Ollama or LM Studio.
Ollama runs as an independent HTTP server (daemon), and apps communicate via HTTP requests to localhost.
This makes model sharing across apps easy, but forces users to manage a separate process.
Foundry Local runs as an in-process native library.
It loads directly into the app’s process, and communication happens through SDK function calls.
No HTTP overhead, no daemon management.
graph TD
A[Application] --> B[Foundry Local SDK<br/>Python / C# / JS / Rust]
B --> C[Foundry Local Core<br/>Native Library<br/>.dll / .so / .dylib]
C --> D[ONNX Runtime]
C --> E[Model Manager<br/>Download, Cache, Load]
D --> F{Hardware Detection}
F -->|Windows| G[Windows ML<br/>GPU / NPU / CPU auto-select]
F -->|macOS| H[Metal<br/>Apple Silicon GPU]
F -->|Linux| I[CPU fallback]
“Foundry Local Core” is the heart of this architecture, managing the full lifecycle: model download, load, inference, and unload.
The inference engine underneath is ONNX Runtime.
For development and debugging, there’s also a mode that starts Foundry Local Core as an OpenAI-compatible HTTP server.
Useful for connecting LangChain or OpenAI SDK HTTP clients directly, but in-process SDK usage is recommended for production.
Hardware Acceleration
Hardware acceleration is what makes or breaks local inference performance.
After fighting with VRAM allocation on Strix Halo and comparing Vulkan vs ROCm on Lemonade, I can say that choosing the right backend for the right hardware is a genuine pain point for developers.
Foundry Local delegates this choice to Windows ML (on Windows) or Metal (on macOS), shielding developers from hardware differences.
| Platform | Acceleration | Mechanism |
|---|---|---|
| Windows | GPU / NPU / CPU auto-select | Windows ML detects hardware and picks the optimal ONNX execution provider |
| Windows (Snapdragon X / XDNA2) | NPU | Via QNN (Qualcomm Neural Network) Execution Provider |
| macOS (Apple Silicon) | GPU | Via Metal |
| Linux x64 | CPU | Fallback (GPU support on the roadmap) |
Windows ML acts as an OS-level hardware abstraction layer.
The app doesn’t need to know whether it’s running on GPU, NPU, or CPU — Windows ML picks the right ONNX execution provider (DirectML, CUDA, QNN, etc.) automatically.
Execution providers are distributed via Windows Update, so new hardware driver support arrives without app changes.
Models are also compiled into the optimal format for the detected hardware on first download and cached.
The same model uses different binaries depending on whether it targets GPU or CPU.
Not having to worry about per-backend performance differences the way I did with AMD Lemonade is a relief for developers — but the flip side is that there’s no room for fine-grained tuning.
You can’t explicitly choose Vulkan vs ROCm or specify quantization methods like you can with llama.cpp.
Technical Comparison With Ollama and llama.cpp
When viewed as local LLM execution environments, each targets a different layer.
| Aspect | Ollama / LM Studio | llama.cpp (direct) | Foundry Local |
|---|---|---|---|
| Target user | End users running inference themselves | Developers embedding an inference engine | Developers bundling AI into shipped apps |
| Deployment | User installs separately | Library link or subprocess | Included via package manager |
| Process model | Standalone daemon (HTTP) | In-process or server | In-process (native library) |
| Inference engine | llama.cpp / MLX | llama.cpp | ONNX Runtime |
| Model format | GGUF (flexible quantization) | GGUF (flexible quantization) | ONNX (from Foundry Catalog) |
| Model freedom | Any model | Any model | Catalog models only |
| Hardware selection | Manual or backend-dependent | Explicit via build flags | Auto via Windows ML / Metal |
| Package size | Hundreds of MB+ | Library itself is lightweight | ~20MB (excluding models) |
Microsoft claims “ONNX Runtime is on average 3.9x faster than llama.cpp, and up to 13.4x faster for long sequences.”
These numbers come from ONNX-optimized models, though — not a direct comparison with GGUF quantized models. Take them with appropriate context.
The biggest trade-off is model freedom.
With Ollama or LM Studio, you can download any GGUF from Hugging Face and run it.
Foundry Local only supports models registered in the Foundry Catalog, currently limited to these families:
- Phi (Microsoft, small and performant)
- Qwen family
- DeepSeek
- Mistral
- Whisper (speech-to-text)
- GPT OSS variants
The major open models are covered, but niche or fine-tuned models won’t work.
If you want to run custom models like I did on my EVO-X2, Foundry Local isn’t the right tool.
Resource Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16 GB |
| Free disk | 3 GB | 15 GB |
On machines with less than 16GB RAM, avoid models larger than 7B parameters.
Models consume RAM (or VRAM) while loaded, and you need to explicitly unload them to reclaim memory.
Like when I tuned VRAM allocation on Strix Halo, unified memory systems are sensitive to how VRAM and main memory are split.
Foundry Local delegates hardware selection to Windows ML, so manual tuning isn’t available.
On the other hand, it’s designed to run reasonably well on machines whose users don’t know how to allocate VRAM manually.
Developer API
Foundry Local provides an OpenAI-compatible API with chat completions and transcription endpoints.
Existing code written against the OpenAI SDK can switch to local inference with minimal changes.
from foundry_local import FoundryLocalManager
manager = FoundryLocalManager()
# Works directly with the OpenAI client
import openai
client = openai.OpenAI(base_url=manager.endpoint, api_key="unused")
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
SDKs are available for Python, JavaScript (TypeScript), C#, and Rust.
For C#, there’s also a dedicated Microsoft.AI.Foundry.Local.WinML package with tighter Windows ML integration.
When to Use It
Foundry Local makes sense when these conditions overlap:
| Condition | Why |
|---|---|
| Can’t ask users to set up AI tools | In enterprise desktop or consumer apps, “please install Ollama” is too high a barrier. Everything must be in the installer |
| Can’t send data to the cloud | Summarizing internal documents, processing medical data, classifying legal documents — privacy and compliance requirements rule out cloud APIs |
| Offline support needed | Factories, ships, remote fieldwork — environments without reliable connectivity. After the initial download, it runs offline |
| Latency matters | No network round-trip means faster responses. Suited for real-time assistants and interactive UI features |
Conversely, Foundry Local is unnecessary in these cases:
| Case | Why |
|---|---|
| Users are technical | If they can install Ollama or LM Studio themselves, just assume Ollama |
| Model freedom matters | The GGUF ecosystem (Ollama / llama.cpp) offers far more choices |
| Fine-grained GPU tuning needed | Use llama.cpp directly |
| Web or server-side use case | Cloud APIs or server-side Ollama is more straightforward |
Roadmap
Enhancements announced at GA:
- Real-time speech transcription (Whisper streaming)
- GPU acceleration on Linux
- Broader NPU support
- Expanded model catalog
- Shared model cache across apps (so multiple apps don’t re-download the same model)
The GitHub repository is at microsoft/Foundry-Local.
Documentation is available on Microsoft Learn.
Worth noting: Azure AI Foundry is a separate cloud platform despite the similar name.
Foundry Local runs entirely on-device with no cloud dependency and requires no Azure subscription.