Tech 9 min read

NVIDIA NeMo Retriever's Agentic RAG ranks first in ViDoRe v3

IkesanContents

The NVIDIA NeMo Retriever team announced an agentic search pipeline that combines an LLM agent and a searcher. “Agentic” refers to an approach in which LLM autonomously repeats decisions and actions, and in search, it refers to the movement of “not finishing with a single query, but considering the results and thinking about the next query.” It achieved first place in ViDoRe v3 (visual document search benchmark) with NDCG@10: 69.22, and second place in BRIGHT (inference-based search benchmark) with NDCG@10: 50.90, demonstrating its versatility across multiple domains with the same pipeline.

Why semantic similarity is not enough

Traditional dense retrieval ranks documents based on semantic similarity, a method that measures the “closeness of meaning” of text by the distance between vectors. Even if the keywords don’t match, if the meaning is similar, it will result in a hit, so it’s much smarter than simple keyword matching. However, there are limits. Accuracy tends to drop in multi-hop inferences (e.g., queries that require tracing multiple pieces of information in a chain, such as “What year was the university to which author A belongs?”) or queries with ambiguous intentions. The differences in search methods are summarized in Comparison of search methods in the second half of the article.

On the other hand, LLM is good at inference but cannot process millions of documents at once.

Agentic search is an architecture that compensates for the weaknesses of both. The search engine processes large documents at high speed, and the LLM uses inference to continually improve search queries.

Pipeline with ReACT loop

The heart of the pipeline is the ReACT (Reasoning + Acting) loop. LLM agents use the search engine as a tool to iteratively refine their queries.

flowchart TD
    A[クエリ入力] --> B[LLMが計画立案<br/>think]
    B --> C[検索実行<br/>retrieve: query, top_k]
    C --> D{情報は十分か?}
    D -- 不十分 --> E[クエリを改善<br/>言い換え・分解・絞り込み]
    E --> C
    D -- 十分 --> F[最終回答を生成<br/>final_results]
    B2[ステップ上限・コンテキスト長超過] --> G[RRFでフォールバック<br/>軌跡全体でランク付け]
    G --> F

The actions performed by the agent can be summarized as follows.

OperationContents
Dynamically adjust queriesModify the following queries based on newly discovered information
Continuous rephrasingKeep reshaping the query until you find useful information
Complexity DecompositionDecompose a multipart query into multiple simple queries

If the loop reaches its upper limit, it will fall back using RRF (Reciprocal Rank Fusion: an algorithm that integrates multiple search result lists using the reciprocal of their ranks). Ranks documents across agent trajectories, so accuracy doesn’t plummet after a limit is exceeded.

Thread-safe singleton design

Initially, it was implemented as MCP Server (Model Context Protocol: a standard protocol for connecting tools and data sources to LLM), but several problems arose. Server start/stop overhead, network round trip latency, GPU/memory management complexity, and concurrency bottlenecks. These factors were cumulatively reducing experimental throughput.

The remedy is an in-process thread-safe singleton structure. By loading the model and corpus embedding only once at startup and protecting all accesses with reentrant locks (RLocks), simultaneous access from multiple threads is possible.

# 概念的なシングルトン構造
class RetrieverSingleton:
    _instance = None
    _lock = threading.RLock()  # 再入可能ロック

    @classmethod
    def get_instance(cls):
        with cls._lock:
            if cls._instance is None:
                cls._instance = cls._create()  # 一度だけ初期化
            return cls._instance

Experiment throughput was significantly improved because there was no network overhead and GPU VRAM could be shared.

Benchmark results

ViDoRe v3

ViDoRe v3 is a search accuracy benchmark for visual documents (PDFs containing images, charts, tables, etc.).

PipelineNDCG@10
NeMo Agentic (Opus 4.5 + colembed-vl-8b-v2)69.22 (#1)
Dense Retrieval (colembed-vl-8b-v2)64.36
INF-X-Retriever (nemotron specialized tuning)62.31

BRIGHT

BRIGHT is a benchmark for complex queries that require inference.

PipelineNDCG@10
INF-X-Retriever63.40 (#1)
NeMo Agentic (Opus 4.5 + reasoning-3b)50.90 (#2)
Dense Retrieval (reasoning-3b)38.28

INF-X-Retriever is tuned specifically for ViDoRe v3, so its accuracy in BRIGHT is greatly reduced. NeMo Agentic has achieved stable results in both benchmarks, and its strength is its ability to generalize without domain-specific optimization.

Detailed Ablation (ViDoRe v3)

AgentEmbedded ModelNDCG@10Average TimeNumber of Search Calls
Opus 4.5colembed-vl-8b-v269.22136.3 seconds9.2 times
gpt-oss-120bcolembed-vl-8b-v266.3878.6 seconds2.4 times
gpt-oss-120bllama-nemotron-embed-vl-1b-v262.4278.1 seconds2.5 times
None (Dense)colembed-vl-8b-v264.360.67 seconds

Opus 4.5 has the highest accuracy, but it takes 136.3 seconds per query. Compared to Dense’s 0.67 seconds, it is 203 times slower.

With BRIGHT, the accuracy difference between Opus 4.5 vs gpt-oss-120b increases to 9.52 points (2.84 points with ViDoRe v3). The value of high-performance agents is greater for tasks that require deep reasoning.

Costs and trade-offs

When using Opus 4.5 with ViDoRe v3, input 760K tokens, output 6.3K tokens/query. Calculating the price of Opus 4.5 (15/15/75 per M tokens), it comes out to about $11.90 per query.

This is in contrast to dense search, which operates at almost zero cost. Agentic search is cost-effective in situations where accuracy is paramount (searching legal documents, collating medical records, etc.), but it is not suitable for high-volume processing.

The team recognizes this trade-off and cites moving away from frontier models to distilled lightweight agents as the next step.

The implementation is published on GitHub at NeMo Retriever Library.

Comparison of search methods

In this article, multiple search methods such as traditional search,'' dense search,” and “agentic search” were introduced. Sort out the differences between each.

The most classic method. Determination is made based on whether the search term is included in the document. Typical implementations are TF-IDF (scoring based on word frequency and rarity) and BM25 (an improved version of TF-IDF with document length normalization added). Elasticsearch’s full-text search engine uses this method.

The mechanism is inverted index (word → occurrence document mapping), which allows you to quickly find out which documents contain this word.

It can perform fast and accurate keyword matching, and is strong in unique string searches such as “error code E-4021”. On the other hand, it is weak when it comes to paraphrases and synonyms, and even if you search for “how to raise a dog,” you won’t get any results for “how to raise a pet.”

Vector search (Dense Retrieval)

A method that converts text into vectors (embedded representations) with hundreds to thousands of dimensions and measures similarity based on the distance between vectors. The embedding model compresses the “meaning” of the text into a numerical vector, so even if the notation is different, if the meaning is similar, it will be a hit.

「犬の飼い方」 → [0.23, -0.81, 0.45, ...]  ─┐
                                                ├─ コサイン類似度: 0.91(近い)
「ペットの育て方」 → [0.21, -0.78, 0.42, ...] ─┘

「自動車の修理」 → [-0.65, 0.33, -0.12, ...] → コサイン類似度: 0.12(遠い)

Typical implementations are pgvector and HNSW (approximate nearest neighbor search algorithm), Pinecone, Weaviate, and other dedicated vector DBs.

It is strong in synonyms and paraphrases, and can find semantically similar'' documents. However, there is a fundamental problem that high degree of similarity ≠ answers the question.” As mentioned in the PageIndex article, even if a question and a document are semantically similar, that doesn’t mean they actually contain the answer. It also cannot support queries that require inference that combines multiple pieces of information.

A method that combines keyword search and vector search. Merge both results using RRF etc. and rank them. This is the mainstream in current RAG systems.

Agentic search (method in this article)

A method that extends hybrid search further and allows the LLM agent to control the entire search process. It’s similar to when humans search on Google. If the first query does not find anything, it is rephrased, and if partial information is found, the next query is constructed based on that information. LLM automates this trial and error process.

MethodSynonym supportInferenceIterative improvementSpeedCost
Keyword searchxxxFastestAlmost zero
Vector SearchoxxFastLow
Hybrid searchoxxFastLow
Agentic SearchoooSlowHigh

How the results change for the same query

Let’s take a look at the differences when each method processes the query “Explain the difference between the generations in which the number of CUDA cores increased the most in the evolution of NVIDIA’s GPU architecture.”

MethodBehavior
Keyword searchReturns documents containing “NVIDIA”, “GPU”, and “CUDA core”. It hits, but it cannot handle comparisons and inferences such as “the generation that has increased the most.” A large number of potentially related documents are returned.
Vector SearchReturns documents semantically close to GPU architecture. Even if it does not say “CUDA core”, documents such as “parallel processing unit” will also be hit. However, comparative reasoning is not possible, and documents for individual architectures are returned in pieces.
Hybrid SearchMerge keyword and vector results. Although the coverage area has expanded, the inability to infer remains the same
Agentic SearchLLM decomposes and executes the query in stages by First, search the list of GPU generations'' → Search the number of CUDA cores for each generation” → Calculate the increase rate'' → Identify the generation where the largest increase occurred”

As the benchmark results in this article show, while agentic search is highly accurate, it costs 136 seconds and $11.90 per query. It is not necessary to make all searches agent; it is practical to process simple queries with keyword searches or vector searches, and route complex queries to adgent search.


Previous search articles: Summary of search data structures / PageIndex: Tree RAG without vector search / AliSQL: Vector search integration into MySQL