Sentence Transformers v5.4 Adds Unified Embeddings for Text, Image, Audio, and Video
Contents
Sentence Transformers v5.4, developed by HuggingFace, was released on April 9, 2026.
The biggest change is that the SentenceTransformer and CrossEncoder APIs, previously text-only, now support multimodal input.
Just pass text, images, audio, or video to the same model.encode() and they get mapped into a shared vector space.
Sentence Transformers is the go-to library for semantic text similarity search and clustering, with over 30 million monthly downloads. Another piece of the HuggingFace ecosystem—alongside projects like TRL v1.0—has crossed the modality barrier.
What Changed
Previously, Sentence Transformers was specialized for text-to-text similarity. If you wanted image search, you had to handle CLIP models separately, and audio or video were completely out of scope. In v5.4, VLM (Vision Language Model)-based models are wrapped into Sentence Transformers’ unified interface, enabling embedding generation across modalities through the same API.
Three key changes were made.
- Multimodal embedding model integration. The
SentenceTransformerclass can now accept images, audio, and video directly inencode() - Multimodal reranker support. The
CrossEncoderclass can score pairs across different modalities tokenizer_kwargsrenamed toprocessor_kwargs. This change unifies preprocessing parameters—not just text tokenization, but also image resizing, audio sampling rate, etc. The old name still works for backward compatibility
Supported Models
Here’s the full list of models integrated as of v5.4.
Embedding Models
| Model | Parameters | Supported Modalities | Notes |
|---|---|---|---|
| Qwen3-VL-Embedding-2B | 2B | Text, Image, Video | Qwen3-VL based |
| Qwen3-VL-Embedding-8B | 8B | Text, Image, Video | Higher accuracy |
| NVIDIA Nemotron Embed VL 1B v2 | 1.7B | Text, Image | Lightweight |
| NVIDIA Omni Embed Nemotron 3B | 4.7B | Text, Image | NVIDIA’s general-purpose model |
| BGE-VL-base | 0.1B | Text, Image | Ultra-lightweight from BAAI |
| BGE-VL-large | 0.4B | Text, Image | Larger BGE-VL variant |
| BGE-VL-MLLM-S1/S2 | 8B | Text, Image | MLLM integrated version |
Reranker Models
| Model | Parameters | Supported Modalities |
|---|---|---|
| Qwen3-VL-Reranker-2B | 2B | Text, Image, Video |
| Qwen3-VL-Reranker-8B | 8B | Text, Image, Video |
| NVIDIA Nemotron Rerank VL 1B v2 | 2B | Text, Image |
| jina-reranker-m0 | 2B | Text, Image |
If Qwen3-Omni is “an inference model covering all modalities from input to output,” then Qwen3-VL-Embedding/Reranker is “a multimodal model specialized for search and information retrieval.” They share the same Qwen3-VL architecture as their foundation, but serve entirely different purposes.
How Embeddings and Rerankers Work Together
In search systems, embedding models and rerankers play different roles.
flowchart TD
A[User Query] --> B[Embedding model<br/>vectorizes query]
B --> C[Fast Top-K candidate<br/>search from corpus]
C --> D[Reranker performs<br/>precise scoring]
D --> E[Final ranked results]
F[Corpus<br/>Text/Image/Video] --> G[Pre-vectorize with<br/>Embedding and store]
G --> C
Embedding models convert each input into a fixed-length vector. Pre-vectorize millions of documents, and you can narrow down candidates quickly through distance computation alone. However, measuring by vector distance alone makes it hard to capture subtle nuances.
Rerankers take query-document pairs as direct input and output relevance scores. They run inference per pair, making them slow but highly accurate. Running a reranker across all pairs of millions of documents is impractical, so the standard approach is a two-stage pipeline: filter with embeddings first, then apply the reranker to the top few dozen results.
This two-stage pipeline shares the same architecture as the setup NVIDIA NeMo Retriever used to take first place on ViDoRe v3. NeMo Retriever added a ReACT loop where agents autonomously rewrite search queries, but the fundamental structure is the same Retrieve & Rerank pattern.
Code Examples
Cross-Modal Search with Text and Images
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"Qwen/Qwen3-VL-Embedding-2B",
revision="refs/pr/23"
)
# Encode images (supports URLs, file paths, and PIL objects)
img_embeddings = model.encode([
"https://example.com/car.jpg",
"https://example.com/bee.jpg",
])
# Encode text queries
text_embeddings = model.encode([
"A green car parked in front of a yellow building",
"A bee on a flower",
])
# Compute cross-modal similarity
similarities = model.similarity(text_embeddings, img_embeddings)
Both text and images go into the same encode().
The modality is automatically detected based on input type.
URLs are downloaded automatically, local file paths are read directly, and PIL.Image objects are processed as-is.
Precise Scoring with Rerankers
from sentence_transformers import CrossEncoder
model = CrossEncoder(
"Qwen/Qwen3-VL-Reranker-2B",
revision="refs/pr/11"
)
query = "A green car parked in front of a yellow building"
documents = [
"https://example.com/car.jpg", # Image
"https://example.com/bee.jpg", # Image
"A vintage car on a European street", # Text
{ # Combined text+image
"text": "A car in a European city",
"image": "https://example.com/car.jpg",
},
]
rankings = model.rank(query, documents)
The reranker accepts mixed lists containing image-only, text-only, and combined text+image entries.
The dictionary format {"text": ..., "image": ...} lets you combine text and image into a single document—useful for searching slides and screenshots where OCR is unreliable.
Checking Modality Support
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")
print(model.modalities)
# ['text', 'image', 'video', 'message']
print(model.supports("audio"))
# False
The modalities property returns the list of modalities a model supports.
Qwen3-VL-based models don’t support audio—they cover text, image, and video.
The Modality Gap Problem
Multimodal embeddings have a known issue called the “modality gap.” Similarity scores between vectors of different modalities are systematically lower than those within the same modality.
For instance, the similarity between a car image and the text “a green car” might be 0.51. Between two texts with similar meaning, scores above 0.8 are common. This isn’t a model defect—it occurs because embeddings of different modalities form separate clusters within the vector space.
In practice, as long as relative ordering is preserved, the embeddings remain usable for search. The ordering—“car image” is more similar to “car text” than to “bee text”—holds. However, when mixing text search and cross-modal search, direct comparison of similarity scores becomes unreliable, and system design needs to account for this.
Input Format Reference
Here’s a summary of accepted input formats for each modality.
| Modality | Accepted Formats |
|---|---|
| Text | String |
| Image | PIL.Image, file path, URL, numpy array, torch tensor |
| Audio | File path, URL, numpy array, torch tensor, {"array": ..., "sampling_rate": ...} dict, torchcodec.AudioDecoder |
| Video | File path, URL, numpy array, torch tensor, {"array": ..., "video_metadata": ...} dict, torchcodec.VideoDecoder |
| Combined | Dict with modality names as keys, e.g. {"text": ..., "image": ...} |
URL input support for images and video is convenient during development, but in production it’s safer to pre-download and pass local paths or PIL objects for better error handling.
Impact on RAG Pipelines
Building multimodal RAG used to require managing text embedding models (e.g. all-MiniLM-L6-v2) and image search models like CLIP separately. Search indices were split by modality, and result merging logic had to be written from scratch.
With v5.4’s multimodal embeddings, text and images go into the same model and the same index. The Mintlify switch from RAG to ChromaFs was driven by RAG accuracy issues, but multimodal support addresses a different axis—simplifying the overall RAG pipeline architecture. Being able to throw PDF screenshots and slide images into the same search index as text and query across them is quietly effective in document search workflows.
Hardware Requirements
VLM-based models demand significant GPU resources.
| Model Size | Required VRAM | Notes |
|---|---|---|
| 0.1B (BGE-VL-base) | Under 1GB | CPU inference is practical |
| 1-2B | ~8GB | RTX 3060/4060 class |
| 8B | ~20GB | RTX 4090, A100, etc. |
The official documentation explicitly states CPU inference is “extremely slow,” recommending GPU environments or Google Colab. Apple Silicon support isn’t documented, but given PyTorch’s advancing MPS support, smaller models might work.
Relationship with CLIP Models
Sentence Transformers has supported CLIP models (clip-ViT-L-14, etc.) through the SentenceTransformer class for a while.
CLIP support continues in v5.4.
For choosing between CLIP and VLM-based models: CLIP is limited to text and image (ImageNet Zero-Shot accuracy 75.4% with ViT-L-14) but is lightweight and fast. VLM-based models also handle video and excel at complex tasks like document understanding, but are heavier. If you already have a system built on CLIP, there’s no rush to migrate—consider switching when you need video support or document image understanding.
Setup Guide
Basic Installation
pip install "sentence-transformers>=5.4.0"
This pulls in major dependencies like torch, transformers, and huggingface-hub automatically.
If you already have Sentence Transformers installed, pip install -U sentence-transformers will upgrade it.
Additional Dependencies for Multimodal Models
Multimodal functionality requires additional packages depending on the model.
Qwen3-VL-based models (both Embedding and Reranker):
pip install qwen-vl-utils
qwen-vl-utils is a utility for image/video preprocessing that the Qwen3-VL model’s processor depends on internally.
Without it, passing an image to encode() raises an ImportError.
For video input:
pip install torchcodec
torchcodec is PyTorch’s official video decoder for extracting frames from video files.
Not needed if you’re not working with video.
NVIDIA models (Nemotron family):
pip install timm
timm (PyTorch Image Models) is used by NVIDIA’s embedding models as the image encoder.
GPU Environment Setup
As noted above, GPU inference is practically required for VLM-based models. If you have a CUDA toolkit installed, reinstall PyTorch with CUDA support.
# For CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# For CUDA 12.4
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
If using Google Colab, just select a GPU runtime—the CUDA environment is already set up, no additional PyTorch installation needed.
Verifying with a Minimal Setup
Start with the lightest model, BGE-VL-base (0.1B), to verify your environment is correctly configured.
from sentence_transformers import SentenceTransformer
# Verify with lightest model (first run downloads the model)
model = SentenceTransformer("BAAI/BGE-VL-base")
# Text embedding
text_emb = model.encode(["test sentence"])
print(f"Text: shape={text_emb.shape}")
# Check supported modalities
print(f"Supported modalities: {model.modalities}")
# Image embedding (URL example)
from PIL import Image
import requests
from io import BytesIO
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
img = Image.open(BytesIO(requests.get(url).content))
img_emb = model.encode([img])
print(f"Image: shape={img_emb.shape}")
# Cross-modal similarity
similarity = model.similarity(text_emb, img_emb)
print(f"Similarity: {similarity}")
BGE-VL-base runs on under 1GB of VRAM, so you can barely test it on CPU. Once this works, switch to larger models like Qwen3-VL or Nemotron based on your use case.
Dependency Overview
Which packages you need ultimately depends on your model choice.
| Use Case | Required Package | Notes |
|---|---|---|
| Text only | sentence-transformers>=5.4.0 | No additional deps |
| Qwen3-VL models | + qwen-vl-utils | Image/video preprocessing |
| Qwen3-VL with video | + torchcodec | Frame extraction |
| NVIDIA models | + timm | Image encoder |
| GPU inference | CUDA-enabled torch | Practically required for 2B+ |
For a requirements.txt, it would look like this:
sentence-transformers>=5.4.0
qwen-vl-utils
torchcodec
timm
Installing everything causes no conflicts, so when in doubt, just install them all and leave unused ones alone.