Gemini API multimodal File Search as game NPC memory: metadata filters, store tiers, and a cost estimate
Contents
Google announced on May 5, 2026 that Gemini API File Search is now multimodal.
Images and text go into the same search target, with custom metadata and page-level citation support added on top.
Read straight, it is a RAG upgrade for PDFs and internal docs with images.
From a game engine perspective, though, it looks like a building block for NPCs that “remember” their world.
If conversation logs, screenshots, maps, character sheets, and event progress tables can sit in the same searchable store, the way game AI is built shifts a bit.
What File Search gained
The announcement covers multimodal search that processes images and text together, a key-value metadata system for filtering at query time, and page-level citation that points to the source material behind each answer.
On the docs side, multimodal File Search requires specifying models/gemini-embedding-2 when creating a FileSearchStore.
The default text-only embedding model does not handle images, so you swap it for the multimodal one.
Internally, File Search splits uploaded files into chunks, converts them to embeddings, and stores them in a File Search store.
Files on the File API itself expire after 48 hours, but data ingested into a File Search store persists until you delete it manually.
This distinction matters in practice. Think of the store less as a raw file dump and more as a persistent, search-optimized index.
Custom metadata lets you attach key-value pairs like author or year to a file and filter with metadata_filter at search time.
For games, candidates include chapter, stage, character ID, world state, and build number.
Long prompts alone are not enough for NPC memory
In a previous article on AI API character settings, I tested system prompts and structured output across Gemini, Claude, and OpenAI to make game-NPC-style responses.
That was about pinning down personality.
File Search is about externalizing what an NPC knows.
The memory an NPC needs goes beyond chat history.
Items the player handed over, which quest got abandoned, what part of the screen the player was looking at, what happened in a past cutscene.
This information is scattered across text logs, JSON, screenshots, map images, and dialogue scripts.
Stuffing everything into a long context is possible, but an NPC that reads everything every turn is slow and expensive.
Worse, irrelevant context leads the model to pick up old conversations or state from a different stage and produce weird replies.
Searching only the relevant parts before each turn fits a game loop better.
graph TD
A[Game state] --> B[Conversation logs]
A --> C[Screenshots]
A --> D[Maps and design docs]
B --> E[File Search store]
C --> E
D --> E
F[NPC input] --> G[Filter by chapter and character via metadata]
G --> E
E --> H[Retrieve relevant memories only]
H --> I[Generate NPC response with Gemini]
The approach from the OCR-Memory article tackled a similar problem from a different angle.
OCR-Memory saves agent history as images and retrieves the relevant portion to reconstruct logs.
The Gemini File Search update means you can push mixed image-text search to a managed API without building that retrieval stack yourself.
Metadata becomes the main actor when plugging into a game engine
Multimodal search is the flashy part, but metadata is what actually matters on the game side.
In RPGs and adventure games, the memories an NPC may reference change by chapter, route, affinity level, and world state.
If a chapter-3 NPC pulls up the chapter-5 reveal through search, the game is broken.
Before worrying about image search accuracy, you need boundaries on what gets searched.
File Search metadata filters let you set those boundaries before the API runs the search.
| metadata | purpose |
|---|---|
character_id | Isolate each NPC’s memories |
chapter | Prevent future-event leakage |
route | Avoid cross-branch scenario bleed |
scene_id | Bias toward the current location or cutscene |
build | Separate dev-build and release-build assets |
If you are integrating this into a custom engine or Godot/Unity, avoid turning the search store into one giant memory box.
Keep event progression and character state in the game’s own canonical data; let File Search handle “pulling related reference material.”
Handing state transitions to the LLM makes reproducibility and debugging painful.
The opposite direction from PageIndex as a RAG approach
In the PageIndex article, I looked at a tree-RAG approach that skips vector search and has the LLM navigate document structure instead.
PageIndex traces which chapter to look at based on hierarchy.
Gemini File Search finds the nearest chunks via embeddings.
Game design docs and scenario scripts have properties of both.
A cleanly structured world-setting bible fits PageIndex-style tree navigation.
Once screenshots, background art, UI mockups, map fragments, and conversation logs are mixed in, hierarchy alone is not enough.
This File Search update is aimed at the latter case.
Images matter here.
Searches like “the tavern with the blue sign,” “the character I met near the red bridge,” or “the quest name shown in this UI panel” were tedious with text-only RAG because you had to preprocess every image.
Captioning images into a separate index is one option, but information is lost at the captioning step.
If File Search can natively embed images, at least for prototyping you can skip that preprocessing.
Not yet a game-ready memory substrate
The official docs list several constraints.
File Search cannot be used with the Live API and cannot currently be combined with Google Search grounding, URL Context, or other tools.
Per-file size limit is 100 MB.
Pricing has three billing points: embedding at index time, embedding at query time, and the document tokens retrieved into the model’s context.
Index-time embedding costs $0.15 per 1M tokens.
With gemini-embedding-2 for multimodal, model-side pricing applies: $0.20 per 1M tokens for text, roughly $0.00012 per image.
Query-time embedding is free. Storage is free.
Retrieved document tokens are charged at the normal model input rate.
For Gemini 2.5 Flash that is $0.30 per 1M input tokens; for 2.5 Pro, $1.25–$2.50.
Store capacity depends on your tier.
| Tier | store limit |
|---|---|
| Free | 1 GB |
| Tier 1 | 10 GB |
| Tier 2 | 100 GB |
| Tier 3 | 1 TB |
Store capacity includes both raw data and generated embeddings.
Google says the total is roughly 3x the input data size.
Free tier’s 1 GB means an effective limit of about 300 MB of source material.
A prototype with a few dozen screenshots and conversation logs fits, but once concept art and video captures enter the picture, Tier 1 becomes necessary fast.
Google recommends keeping each store under 20 GB, so even on Tier 2+ you should split stores by purpose.
Multimodal embedding costs for gemini-embedding-2 by media type:
| media type | indexing cost |
|---|---|
| text | $0.20 / 1M tokens |
| image | $0.00012 / image |
| audio | $0.00016 / second |
| video | $0.00079 / frame |
Images are roughly 0.012 cents each.
Audio and video are supported too, but game use cases will mostly involve text and images.
Model pricing for the LLM that processes search results varies widely.
| model | input / 1M tokens | output / 1M tokens |
|---|---|---|
| 2.5 Flash-Lite | $0.10 | $0.40 |
| 2.5 Flash | $0.30 | $2.50 |
| 2.5 Pro (under 200k) | $1.25 | $10.00 |
| 2.5 Pro (over 200k) | $2.50 | $15.00 |
Flash-Lite input is one-third of Flash.
In a prototype that runs many searches, this gap hits the monthly bill directly.
Pro doubles its input rate past 200k tokens, so watch this boundary when pulling large chunks via File Search.
Batch API gives 50% off on both embedding and model inference.
For dev pipelines where real-time response is not needed, running indexing and inference through Batch cuts costs in half.
A rough estimate for a single-character prototype: 200 conversation entries (about 200k tokens), 50 screenshots, and 50k tokens of setting text in one store.
Indexing costs about $0.05 for text and $0.006 for images, totaling roughly $0.056. One-time cost, negligible.
If each search returns an average of 2,000 tokens and the NPC generates a 200-token reply, Flash-Lite costs about $0.0003 per query.
At 100 searches per day that is $0.03, under $1/month. Bumping to Flash brings it to about $3/month. Affordable for solo dev.
This is not something to call on every frame. Treat it as an external search triggered at conversation start, quest updates, or longer NPC responses.
Citation has a different flavor in games.
In enterprise RAG, you show the user which page backs the answer.
A game NPC obviously cannot display page numbers, but during development a debug UI that shows citations is valuable.
When an NPC says something odd, being able to trace which script or screenshot it pulled from makes tuning much easier.
For a first prototype, conversation logs and screenshots are enough
No need to build a full NPC memory system from day one.
A prototype with one character, one stage, a few dozen conversation entries, and a few dozen screenshots is enough.
Put them in a Gemini File Search store, filter by character_id and scene_id, and pull relevant chunks before each NPC input.
The first thing to watch is not search accuracy but cross-contamination.
Does the NPC pull memories from a different character?
Does it surface events from a future chapter?
When screenshot-derived information mixes into conversation, does it feel natural?
When search results are appended to the prompt each turn, does the NPC’s tone or JSON output break?
If these pass, expand to more images, maps, event CGs, UI state captures, and quest logs.
If cross-contamination happens even with a small dataset, suspect metadata design and store partitioning before blaming the embedding model.
The dev pipeline will benefit before the runtime does
Using File Search as NPC memory at runtime is fun to imagine, but the place it helps first is during development.
Game dev is a mess of scattered documents.
World-setting text, character sheets, concept art, level-design screenshots, scenario scripts, playtest video captures, QA bug reports.
These end up across Confluence, Google Drive, Slack, and local folders, and answering “which stage had that map with the red bridge” or “which chapter’s script does this character first appear in” is manual work every time.
Multimodal File Search lets you pipe this scattered material into a single API-backed search.
Put concept art and scenario scripts in the same store, and “find scenes where this character appears and related background art” comes back via embeddings.
Add chapter or milestone metadata, and you can filter to “M3 assets only.”
Scenario consistency checking is another use.
In a long-form RPG or adventure game, when a writer on a different chapter needs to check “has this plot thread been resolved yet,” grepping all scripts is painful.
File Search lets you query “the scene where so-and-so learns the truth” in natural language and get back the relevant script chunks and screenshots.
Citation returns page numbers, which reduces the cost of verifying against the source.
During development, a few hundred milliseconds of latency does not stall the game loop.
API call costs are rounding errors next to CI/CD build costs and asset management tool licenses.
Plugging this into a scenario tool or QA review board as a search box, before worrying about runtime integration, is where the ROI shows up first.