Chroma Context-1 achieves search performance equivalent to Frontier LLM with 20B parameters

The standard implementation of RAG is a single-pass structure that retrieves similar documents with a single query, packs them into context, and passes them to LLM. As I mentioned in The story of building an in-house help desk with Dify, in practice, the flow is to improve accuracy with GraphRAG and reranking. However, the fundamental weakness of the single pass remains. It is not possible to structurally handle multi-hop questions in which the next query is determined only after reading a certain document.

Multi-hop search can be solved by using Frontier LLM as an agent. As NeMo Retriever’s Agentic RAG demonstrated with the ReACT loop, it is an approach where LLM collects the necessary information step by step as it iteratively improves the search query. However, the cost and latency of the frontier model are heavy for production use.

Context-1, released by Chroma on March 26, 2026, directly addresses this cost issue. A self-editing context management mechanism was trained using RL using a search specialized agent model with 20B parameters based on gpt-oss-20B. Weights are openly available in Apache 2.0.

What is a self-editing context?

As the agent repeats the search, retrieved documents continue to accumulate in the context window. Even if the window itself is widened like Claude’s 1M context, the problem of lower accuracy due to being buried in noise (context rot) does not go away. Several approaches to this problem have emerged.

Approach	Typical example	Mechanism
External proxy type	Compresr Context Gateway	Insert a proxy between the agent and the LLM API to reduce tokens with look-ahead summarization and tool output compression
Window expansion type	Claude 1M, Gemini 2M	Enlarge the context window itself to withstand accumulation
Built-in model	Context-1	The model itself actively deletes unnecessary passages

Context-1 is the third “model built-in” type. There are four types of tools that agents can use.

Tools	Description
`search_corpus(query)`	BM25 and dense vector hybrid search. After fusion with RRF, return top chunks through reranker
`grep_corpus(pattern)`	Regex search. Return up to 5 chunks
`read_document(doc_id)`	Read the entire document
`prune_chunks(chunk_ids)`	Remove specified chunks from context

BM25 used by search_corpus is a classical ranking function based on inverted index, which is an advanced form of TF-IDF. A hybrid configuration combined with dense vector search is widely adopted in current RAGs, where BM25 picks up exact matches of keywords and dense vectors cover semantic similarities. RRF (Reciprocal Rank Fusion) is an algorithm that combines multiple search result lists by reciprocal ranking, and is the same algorithm used as a fallback in NeMo Retriever’s pipeline.

The fourth prune_chunks is the core. Keep the context window clean by deleting chunks that are deemed unnecessary through repeated searches. During learning, the design is such that “chunks after pruning are included in reward calculations,” and deletion accuracy is also targeted for optimization.

flowchart TD
    A[クエリ入力] --> B[search_corpus / grep_corpus<br/>で情報収集]
    B --> C{必要な情報が<br/>揃ったか?}
    C -- 不十分 --> D[read_documentで<br/>詳細読み込み]
    D --> E[prune_chunksで<br/>不要チャンク削除]
    E --> F[トークン使用量を<br/>トラジェクトリに追記]
    F --> B
    C -- 十分 --> G[最終回答生成]
    H[ソフト閾値超過] -.-> E
    I[ハードカットオフ到達] -.-> J[prune以外の<br/>ツールコール拒否]

Token budget management is also included. Adds the usage amount to the trajectory every turn ([Token usage: 14,203/32,768]), prompts for pruning or generation of a final answer when the soft threshold is exceeded, and rejects tool calls other than prune at hard cutoff.

Synthetic data generation pipeline

The key to the design is scalable task generation that does not rely on manual annotation. Data is generated in 4 domains.

Domain	Data Source	Chunk Size
Web	Collected using Serper and Jina AI Reader using Wikipedia title as a seed	Large scale
Finance	2025 SEC Filings 1707 Companies (10-K, 20-F)	Large
Legal	USPTO Patent Publication January 2026 - 1500 items. §102/§103 Rejection Citation	Medium
Email	Epstein file (released in November 2025) + Enron email, 984 unique threads	396,510 chunks after expansion

Email domains are special and the Epstein corpus alone is too small, so we augmented the data by replacing names and dates in Enron emails. This increased the number of chunks from 1,366 to 396,510.

Extraction-based Verification is used to ensure the quality of task generation. The method is to have LLM extract both original quotations from supporting documents (document_quotes)'' and corresponding parts from clues (clue_quotes)”, normalize them, and check for matches. The match rate with manual labels is 84.40% for Web, 93% for Finance, and 87.5% for Email.

Reinforcement learning design

SFT stage

Generate SFT trajectories by running an agent loop on a large model (such as Kimi K2.5). Failure trajectories are also included in the learning data (judgment that distribution characteristics are more important than accuracy). The filtering criteria are as follows.

trajectory recall > 50% and output recall > 40% → full use
Include low recall on a probabilistic basis
Zero recall up to 5% (failure mode exposure)
Trajectory recall >> Discrepancy in output recall is excluded (to avoid reinforcing the disparity between exploration and selection)

RL stage: CISPO algorithm

The base model is gpt-oss-20b (LoRA applied). The algorithm uses CISPO, which is a variant of GRPO.

As summarized in Comparison of 16 open source RL libraries, the current mainstream RL for LLM is the GRPO family. GRPO is an algorithm that updates the policy based on the relative reward difference within the group, and unlike PPO, it does not require a critical model and is therefore computationally efficient. CISPO in Context-1 is a variation of this, and the processing when the importance sampling weight falls outside the clipping range is different. Where GRPO has a “zero gradient”, CISPO uses “scaling with detached coefficients”. This allows rare token sequences such as pruning decisions and query modifications to contribute to learning.

DRGRPO’s unbiased loss and DAPO’s clip-higher were also tried, but CISPO was reported to be the most stable.

We sample 128 queries in each training step and roll out in 8 independent environments for each query (1024 trajectories/step). Discard steps where all 8 rollouts have the same reward (because the slope becomes zero due to within-group normalization).

The reward consists of five components.

Reward Component	Contents
Outcome reward	Fβ score (Initial value of β emphasizes recall 16 times. Annealing across epochs to recall 4 times)
Process reward	Trajectory recall (Did you reach the correct answer during exploration?)
Final Answer Bonus	+1.0 for getting a chunk containing the correct answer
Repeated pruning penalty	0.1/time for more than 3 consecutive prune calls (upper limit 0.5)
Number of turns penalty	Linear increase from 0 to 0.5 from 64 to 128 turns

Learn using a two-step curriculum. It combines a difficulty curriculum in which Phase 1 is low difficulty (few hops) and Phase 2 is high difficulty (multiple hops), and a reward curriculum in which β of Fβ is annealed between epochs from emphasis on recall to emphasis on precision.

The learning scale is approximately 5 epochs and 300 steps, and convergence was confirmed around 230 steps. The search infrastructure replicated collections within Chroma Cloud, serving a peak of 3000+ queries/sec.

Comparison with base model

There are numerical values that show how the agent’s behavior has changed due to learning through RL.

Indicator	gpt-oss-20b	Context-1	Change
Parallel Tool Call/Turn	1.52	2.56	+68%
Turn/Trajectory	6.7	5.2	-22%
Pruning accuracy	0.824	0.941	+14pt

The number of turns is reduced, pruning accuracy is increased, and parallel searches are increased. RL improves the efficiency of an agent by gathering a lot of information in fewer steps and discarding unnecessary information with high precision.

A direct comparison for a single task (Web domain task 1_0) is as follows:

Indicators	gpt-oss-20b	Context-1
Trajectory Recall	0.640	0.739
Output Recall	0.361	0.641
F1	0.307	0.487
Final Answer Found	0.541	0.798

Benchmark comparison with Frontier model

Context-1’s 4x is a configuration that combines four rollouts using RRF, and because it can be executed in parallel, it is positioned as “cheaper than a single API call to the frontier model.”

Key comparisons in the generated benchmark (Final Answer Found) (web domain, difficulty level 2 and above).

Model	Web	Finance	Legal	Email
Context-1 (4x)	0.97	0.82	0.95	0.98
Context-1 (1x)	0.88	0.64	0.89	0.92
gpt-5.4	0.97	0.67	0.95	0.97
sonnet-4.6	0.96	0.72	0.91	0.97
opus-4.6	0.98	0.84	0.94	0.98
gemini-3.1-pro	0.97	0.82	0.88	0.94

Although the email domain was not included in the training data, it improved significantly, confirming generalization performance. The slightly lower accuracy in Finance is said to be due to early termination due to token budget constraints, and it recovers to 0.82 in the 200k context/no prune configuration.

A similar trend is seen in the public benchmarks (BrowseComp+, FRAMES, HotpotQA), with Context-1 (4x) almost saturating HotpotQA and 0.96 in FRAMES, which is on par with Frontier.

Inference is provided by vLLM. The MoE layer is quantized with MXFP4 on NVIDIA B200 to achieve a throughput of 400 to 500 tok/s.