Sarvam 30B/105B: India’s first open‑source LLM built end‑to‑end domestically
On March 6, 2026, the Indian AI startup Sarvam AI open‑sourced Sarvam 30B and Sarvam 105B. The standout point is that the entire pipeline was completed within India: pretraining, SFT, and RL all ran on domestic compute (infrastructure provided by the IndiaAI mission).
Rather than a simple fine‑tune or a derivative of an existing model, this is India’s first competitive open LLM developed end‑to‑end—from architecture and data to the inference stack.
Model Architecture
Both models are based on a Mixture‑of‑Experts (MoE) transformer. MoE lets you grow total parameters while keeping per‑token compute roughly constant, increasing capacity while keeping inference cost within a practical range.
| Item | Sarvam 30B | Sarvam 105B |
|---|---|---|
| Total parameters | 30B | 105B |
| Active parameters | 2.4B | - |
| Attention type | GQA | MLA |
| Number of experts | 128 (sparse) | 128 (sparse) |
| Pretraining tokens | 16T | 12T |
| Primary use | Real‑time conversation | Composite reasoning and agents |
Looking at attention specifically, the 30B model adopts GQA (Grouped Query Attention) to reduce KV‑cache memory, while the 105B model goes with MLA (Multi‑head Latent Attention). MLA—used by DeepSeek—is a compressed‑attention method that further lowers inference memory requirements for long contexts.
Training Pipeline Design
The pretraining corpus spans code, the web, mathematics, and multilingual content. A notable design choice is using a sigmoid function for the routing score. Compared to the softmax that’s common in MoE, sigmoid yields more stable expert load balancing and reduces the risk of routing collapse during training.
The RL pipeline uses an asynchronous GRPO architecture (Group Relative Policy Optimization), separating generation, reward computation, and policy updates to sustain throughput at MoE scale. They intentionally omit KL regularization (KL divergence to a reference model) to avoid optimization conflicts between reward maximization and policy anchoring. Reward shaping encourages structured reasoning, concise answers, and correct tool use; they report no reward collapse.
SFT data includes large volumes of agent traces generated from simulation environments and real‑world repositories, deliberately training capabilities for tool calls, environment reasoning, and multi‑step decision‑making.
Benchmarks
Sarvam 105B
| Benchmark | Sarvam-105B | GPT-OSS-120B | Qwen3-Next-80B | GLM-4.5-Air |
|---|---|---|---|---|
| Math500 | 98.6 | 97.0 | 98.2 | 97.2 |
| LiveCodeBench v6 | 71.7 | 72.3 | 68.7 | 59.5 |
| MMLU | 90.6 | 90.0 | 90.0 | 87.3 |
| GPQA Diamond | 78.7 | 80.1 | 77.2 | 75.0 |
| BrowseComp | 49.5 | - | 38.0 | 21.3 |
| Tau2 (avg) | 68.3 | 65.8 | 55.0 | 53.2 |
| AIME 25 (with tools) | 88.3 (96.7) | 90.0 | 87.8 | 83.3 |
The BrowseComp score of 49.5 and Tau2 score of 68.3 stand out. Tau2 evaluates agent‑style long‑horizon reasoning and task completion, where Sarvam 105B tops the comparison set.
Against current frontier models of similar scale (~100B), Sarvam 105B is competitive. It stands shoulder‑to‑shoulder with DeepSeek R1 0528 (AIME25 87.5, HMMT Feb 82.5) and dramatically outperforms R1’s BrowseComp score of 3.2 with 49.5.
Sarvam 30B
With just 2.4B active parameters in an efficient MoE setup, it reaches Math500 97.0, MMLU 85.1, AIME25 80.0 (96.7 with tools), and LiveCodeBench 70.0—competitive at the 30B class internationally.
Measured throughput on H100 shows a throughput/GPU that is 3–6× better than a Qwen3 baseline. On L40S it’s 1.5–3×. On a MacBook Pro M3 using MXFP4 mixed‑precision inference, token generation is 20–40% faster.
Indian Languages and Tokenizer
Supported languages cover the 22 constitutionally recognized languages under the Eighth Schedule of the Indian Constitution, plus English—for a total of 23. Japanese, Chinese, French, and similar languages are out of scope; this is not a general multilingual model. As the company positions it—“sovereign AI for India”—the model targets the Indian market.
India’s 22 scheduled languages span 12 distinct writing systems. Sarvam built a custom tokenizer that supports all of them.
The evaluation metric is the fertility score (average tokens per word); lower is better. It reportedly outperforms other tokenizers especially on low‑resource languages like Odia, Santali, and Manipuri (Meitei). Efficient tokenization for Indian languages directly affects inference cost and latency, translating into tangible cost deltas for India‑focused deployments.
Indian‑language benchmarks were measured via an LLM‑as‑judge protocol. Sarvam 105B achieves a 90% average win rate across dimensions and 84% in STEM/math/coding. Sarvam 30B scores 89% overall and 87% for STEM and related areas.
Release
Weights are distributed on Hugging Face (30B, 105B) and on AI Kosh, an India‑based model hub. Sample inference code for Transformers, vLLM, and SGLang is available on the Hugging Face pages.
Sarvam 30B is already in production in the company’s conversational agent platform Samvaad, while Sarvam 105B powers Indus for composite reasoning and agent workflows.
Hardware Requirements and Local Execution
MoE models are often misunderstood as “lightweight because the active parameters are small,” but inference still needs to keep all expert weights in memory. Active parameters govern compute (speed), while memory consumption scales with total parameters.
Weight Size and Required VRAM
| Model | BF16 weights | Context length |
|---|---|---|
| Sarvam 30B | ~60 GB | 65,536 tokens |
| Sarvam 105B | ~212 GB | 128K tokens |
The figures above are only for weights. Real inference requires additional memory for KV cache and activations, and KV usage grows with longer contexts.
Running Sarvam 30B
Among MoE models, 30B is relatively lightweight and, with quantization, falls within reach of personal GPUs.
The community has published GGUF‑quantized builds, with file sizes by quantization level as follows.
| Quantization | File size | Suggested VRAM |
|---|---|---|
| BF16 | ~64 GB | ≥ 64 GB |
| Q8_0 | ~34 GB | ≥ 40 GB |
| Q6_K | ~26 GB | 32 GB |
| Q4_K_M | ~19 GB | 24 GB |
Q4_K_M fits on a single 24 GB VRAM GPU (RTX 4090, RTX 5090, etc.). With Q6_K, an RTX 5090 with 32 GB VRAM can offload all layers to the GPU.
However, support for the sarvam_moe architecture in llama.cpp had not yet been merged as of March 2026. PR #20275 is in progress; until it lands, you need to build from a custom branch. Ollama support also depends on the upstream merge.
vLLM inference isn’t officially supported either; you need Sarvam AI’s fork or a hotpatch_vllm.py‑based patch. For SGLang, the Hugging Face page provides a sample that runs with tp=2 (tensor parallel across 2 GPUs).
Inference has been confirmed on an RTX 5090 (32 GB VRAM, CUDA 13.0) with Q8_0, Q6_K, and Q4_K_M quantizations.
Running Sarvam 105B
At ~212 GB BF16 just for weights, local execution is impractical for individuals.
| Setup | GPU |
|---|---|
| vLLM (official sample) | tp=8 (e.g., H100 80 GB ×8) |
| SGLang (official sample) | tp=4 (e.g., A100 80 GB ×4) |
A GGUF build was not available as of March 2026; some provide FP8 Dynamic variants (e.g., RedHatAI). Even FP8 weighs in at ~106 GB, implying at least two 80 GB GPUs.
For individuals, renting cloud GPU instances is the practical route. Providers like RunPod, Lambda Labs, and Vast.ai offer multi‑GPU H100/A100 instances billed by the hour.
If You Want to Try It Locally
The most realistic near‑term path is to run the Sarvam 30B Q4_K_M quantized build on an RTX 4090/5090. Because a custom llama.cpp build is required, you’ll first need a build environment (CMake, CUDA toolkit).
# カスタムブランチからビルド
git clone -b add-sarvam-moe https://github.com/sumitchatterjee13/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
# 推論実行
./build/bin/llama-cli -m sarvam-30B-Q4_K_M.gguf -p "Hello" -n 512 -ngl 99
Once llama.cpp merges the PR and Ollama adds support, it should become as simple as ollama run sarvam-30b. It’s worth keeping an eye on PR #20275.