Tech 8 min read

NVIDIA NIM opens free hosted inference across 100+ models on an OpenAI-compatible endpoint that OpenClaw and Cursor plug into directly

IkesanContents

A post making the rounds on X led with the classic provocation: “Why is nobody talking about this?”
The claim was that NVIDIA is giving away fully free hosted inference for about 80 models through build.nvidia.com — MiniMax M2.7, GLM 5.1, Kimi 2.5, DeepSeek 3.2, GPT-OSS-120B, Sarvam-M, and more.

Setup is a few lines, and since it is OpenAI API compatible you can drop it straight into OpenClaw, OpenCode, Zed IDE, the Hermes agent, or Cursor IDE.
I went back through NIM to sort out what is actually “free” and where the limits start biting.

What NIM’s free inference API actually is

Since 2024, NVIDIA has been shipping NIM (NVIDIA Inference Microservices) as a brand that covers both inference containers and hosted inference APIs.
The build.nvidia.com portal that is getting attention now is the “hosted inference catalog” half of that offering.

Sign in, join the NVIDIA Developer Program, and you can generate an API key that starts with nvapi-.
With that key, hitting https://integrate.api.nvidia.com/v1 returns OpenAI Chat Completions–compatible responses from whatever model is listed in the catalog.

The catalog holds 100+ models as of April 2026, spanning a very wide set of categories.
The “around 80” in the original tweet is roughly the count of general-purpose LLMs individual users can invoke from signup — add embeddings, ASR, and simulation models and you clear 100 without much effort.

AreaRepresentative models
General LLMsNVIDIA Nemotron family, Meta Llama 3/4 family, Mistral family, Google Gemma family
Chinese LLMsDeepSeek-V3.2, Moonshot Kimi-K2.5, Zhipu GLM-5 family, MiniMax-M2.5/M2.7
Open research LLMsOpenAI GPT-OSS-20B/120B, Sarvam-M (India-focused multilingual)
MultimodalNemotron Nano VL, Llama 3.2-Vision, various VLMs
SpeechNemotron-ASR, Riva, Parakeet-family ASR/TTS
Embedding/RetrieverNV-EmbedQA, Reranker, NV-EmbedCode
Biology/physics simBioNeMo-related (proteins, genomics), Cosmos-family world models

The “80–100 free models” claim is not an exaggeration.
MiniMax M2.7 really does have its own endpoint at build.nvidia.com/minimaxai/minimax-m2.7, with the official snippet distributed in plain OpenAI SDK style.

Minimal setup

The OpenAI Python SDK works as-is. Just point base_url and api_key at NVIDIA.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"],  # nvapi-... issued from build.nvidia.com
)

response = client.chat.completions.create(
    model="minimaxai/minimax-m2.7",
    messages=[{"role": "user", "content": "Summarize the end of the universe in 300 words"}],
    temperature=0.6,
    max_tokens=1024,
    stream=False,
)

print(response.choices[0].message.content)

JS/TS is the same story — the openai package, LangChain, LlamaIndex, and most OpenAI-compatible clients work by swapping baseURL and nothing else.

model is specified as <provider>/<model-name>.
minimaxai/minimax-m2.7, zhipuai/glm-5.1, moonshotai/kimi-k2.5, deepseek-ai/deepseek-v3.2, openai/gpt-oss-120b, sarvamai/sarvam-m — providers are kept under their own namespaces.

What “free” actually includes (credits and rate limits)

Here is where the fine print starts.

ItemValue
Credits granted at signup1,000 inference credits
Upper limit via requestAround 5,000 credits
Rate limitRoughly 40 requests/min per model
Enterprise track90-day AI Enterprise trial (for self-hosting)

“Credits” are token-consumption based, and the cost per request depends on model size and type.
Lightweight models like Nemotron Nano let you play for a long time, but heavyweights like Kimi K2.5 or GLM-5.1 eat through credits surprisingly fast.

40 req/min is a prototyping budget.
It is not enough to run a production chatbot, and the Developer Forum is full of rate-limit-increase requests — treat this as a fair-use tier.

And fundamentally, this is NVIDIA’s sales funnel.

flowchart LR
    A["build.nvidia.com<br/>Try hosted inference"] --> B["Pull the NIM container<br/>for the model you liked<br/>onto your own<br/>DGX / Hopper / Blackwell"]
    B --> C["In production, sign<br/>NVIDIA AI Enterprise<br/>with paid support"]
    style A fill:#1e3a8a,color:#fff
    style C fill:#166534,color:#fff

Hosted inference is the entry point. The funnel ends in NVIDIA hardware and licensing — a freemium design.

Wiring OpenClaw / OpenCode / Cursor / Zed to NIM

Because it is OpenAI-compatible, anything that lets you swap in an OpenAI-compatible endpoint will work.

OpenClaw

OpenClaw’s main implementations have been Anthropic-targeted, but forks that read OPENAI_BASE_URL and OPENAI_API_KEY have been multiplying.
After Anthropic locked third-party harnesses out of subscription usage, users looking to move their code-generation backend away from Anthropic have been flipping to NVIDIA NIM or OpenRouter, and NIM rode that wave.

Open-weight models on NIM are not drop-in replacements for Claude Sonnet 4.6 or Opus 4.7.
For long-running code-generation tasks, picking a model strong on long-horizon work like GLM-5.1 or Kimi K2.5, or leaning on DeepSeek-V3.2 for completion-heavy workflows, is the realistic pattern.

OpenCode

OpenCode, the OSS coding agent, already lets you declare OpenAI-compatible providers in ~/.config/opencode/config.json.
Adding NVIDIA NIM is a single block, so the fit is good.

OpenCode itself had to remove its OAuth integration after a legal demand from Anthropic in March.
Since then it has been steering away from tight coupling with official SDKs and toward diversifying backends across open-weight models — OpenAI-compatible hosts like NIM are a tailwind.

Cursor

Cursor 3 went all-in on an agent-first IDE, and its Models settings let you specify an OpenAI-compatible base URL directly.
Plug in https://integrate.api.nvidia.com/v1 with nvapi-... and the editor’s completion, chat, and agent execution backends flip to NIM-hosted models.

The 40 req/min ceiling evaporates immediately under Cursor autocomplete, so a realistic split is “completion from another provider, agent execution on NIM.”

Zed IDE / Hermes agent

Zed exposes OpenAI-compatible providers through assistant.providers.
General-purpose tool-calling agents like Hermes drop in too, as long as they are built on the OpenAI Chat Completions API.

Function-calling quality on NIM varies by model.
The Nemotron family and GPT-OSS family follow the OpenAI tools format fairly cleanly, while MoonshotAI’s models lean on their own tagging style — expect to tweak prompt templates on the harness side.

How NIM compares to other freemium LLM APIs

NIM is not the first free OpenAI-compatible LLM API.
In Japan, Sakura’s AI Engine offers a monthly free tier of 3,000 requests covering Kimi-K2.5 and gpt-oss-120b, and AWS exposes DeepSeek and Mistral through Bedrock’s Mantle engine as an OpenAI-compatible API.
NIM is the hyperscaler-scale entry in that lineage.

ProviderEndpointFree tierStrength
NVIDIA NIMintegrate.api.nvidia.com/v11,000–5,000 inference credits, 40 req/minSheer model count (100+), funnel to self-hosted NIM
Sakura AI EngineOpenAI-compatible3,000 requests/monthDomestic-only, closed-network ready
Amazon Bedrock MantleOpenAI-compatibleDepends on AWS creditsIAM, Projects API, tight AWS integration
OpenRouterOpenAI-compatibleFree models per providerProvider abstraction, fallback

The picking logic is roughly this.

  • Want to try as many models as possible → NIM
  • Bound by Japan data residency → Sakura
  • Production already on AWS → Bedrock Mantle
  • Need one API across many providers → OpenRouter

Is “lock in and start building now” actually right?

The original X post closes with “free inference, lock it in and start building now, anon. Thank me later.”
The free-lunch energy is real, but for anything operational you have to factor in:

  • Production will die on credit exhaustion. The path is prototype → self-hosted NIM → AI Enterprise eventually.
  • Model names change constantly. MiniMax M2.5 → M2.7, Kimi K2 → K2.5 → K2.6 — recommended models shift on a monthly cadence. If you don’t parameterize the model string via env vars, you will feel it.
  • Rate limits move unilaterally. The Developer Forum is full of “increase my limit” threads; assume fair-use.
  • Safety filters sit on NVIDIA’s side. Raw open-weight model behavior and NIM-routed behavior don’t fully line up.

Even with those caveats, “100+ models behind one API key, for free” is genuinely strong.
Not having to open a new cloud account every time a new model drops is a real quality-of-life win for casual experiments.


One aside — most of the services people on X insist “nobody is talking about” have been around for a while. They only register as “usable” now because agent harnesses have finally converged on accepting arbitrary OpenAI-compatible base URLs.
NIM is a textbook example of that.