Sentence Transformers v5.4 Adds Unified Embeddings for Text, Image, Audio, and Video

Sentence Transformers v5.4, developed by HuggingFace, was released on April 9, 2026. The biggest change is that the SentenceTransformer and CrossEncoder APIs, previously text-only, now support multimodal input. Just pass text, images, audio, or video to the same model.encode() and they get mapped into a shared vector space.

Sentence Transformers is the go-to library for semantic text similarity search and clustering, with over 30 million monthly downloads. Another piece of the HuggingFace ecosystem—alongside projects like TRL v1.0—has crossed the modality barrier.

What Changed

Previously, Sentence Transformers was specialized for text-to-text similarity. If you wanted image search, you had to handle CLIP models separately, and audio or video were completely out of scope. In v5.4, VLM (Vision Language Model)-based models are wrapped into Sentence Transformers’ unified interface, enabling embedding generation across modalities through the same API.

Three key changes were made.

Multimodal embedding model integration. The SentenceTransformer class can now accept images, audio, and video directly in encode()
Multimodal reranker support. The CrossEncoder class can score pairs across different modalities
tokenizer_kwargs renamed to processor_kwargs. This change unifies preprocessing parameters—not just text tokenization, but also image resizing, audio sampling rate, etc. The old name still works for backward compatibility

Supported Models

Here’s the full list of models integrated as of v5.4.

Embedding Models

Model	Parameters	Supported Modalities	Notes
Qwen3-VL-Embedding-2B	2B	Text, Image, Video	Qwen3-VL based
Qwen3-VL-Embedding-8B	8B	Text, Image, Video	Higher accuracy
NVIDIA Nemotron Embed VL 1B v2	1.7B	Text, Image	Lightweight
NVIDIA Omni Embed Nemotron 3B	4.7B	Text, Image	NVIDIA’s general-purpose model
BGE-VL-base	0.1B	Text, Image	Ultra-lightweight from BAAI
BGE-VL-large	0.4B	Text, Image	Larger BGE-VL variant
BGE-VL-MLLM-S1/S2	8B	Text, Image	MLLM integrated version

Reranker Models

Model	Parameters	Supported Modalities
Qwen3-VL-Reranker-2B	2B	Text, Image, Video
Qwen3-VL-Reranker-8B	8B	Text, Image, Video
NVIDIA Nemotron Rerank VL 1B v2	2B	Text, Image
jina-reranker-m0	2B	Text, Image

If Qwen3-Omni is “an inference model covering all modalities from input to output,” then Qwen3-VL-Embedding/Reranker is “a multimodal model specialized for search and information retrieval.” They share the same Qwen3-VL architecture as their foundation, but serve entirely different purposes.

How Embeddings and Rerankers Work Together

In search systems, embedding models and rerankers play different roles.

flowchart TD
    A[User Query] --> B[Embedding model<br/>vectorizes query]
    B --> C[Fast Top-K candidate<br/>search from corpus]
    C --> D[Reranker performs<br/>precise scoring]
    D --> E[Final ranked results]
    
    F[Corpus<br/>Text/Image/Video] --> G[Pre-vectorize with<br/>Embedding and store]
    G --> C

Embedding models convert each input into a fixed-length vector. Pre-vectorize millions of documents, and you can narrow down candidates quickly through distance computation alone. However, measuring by vector distance alone makes it hard to capture subtle nuances.

Rerankers take query-document pairs as direct input and output relevance scores. They run inference per pair, making them slow but highly accurate. Running a reranker across all pairs of millions of documents is impractical, so the standard approach is a two-stage pipeline: filter with embeddings first, then apply the reranker to the top few dozen results.

This two-stage pipeline shares the same architecture as the setup NVIDIA NeMo Retriever used to take first place on ViDoRe v3. NeMo Retriever added a ReACT loop where agents autonomously rewrite search queries, but the fundamental structure is the same Retrieve & Rerank pattern.

Code Examples

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    revision="refs/pr/23"
)

# Encode images (supports URLs, file paths, and PIL objects)
img_embeddings = model.encode([
    "https://example.com/car.jpg",
    "https://example.com/bee.jpg",
])

# Encode text queries
text_embeddings = model.encode([
    "A green car parked in front of a yellow building",
    "A bee on a flower",
])

# Compute cross-modal similarity
similarities = model.similarity(text_embeddings, img_embeddings)

Both text and images go into the same encode(). The modality is automatically detected based on input type. URLs are downloaded automatically, local file paths are read directly, and PIL.Image objects are processed as-is.

Precise Scoring with Rerankers

from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "Qwen/Qwen3-VL-Reranker-2B",
    revision="refs/pr/11"
)

query = "A green car parked in front of a yellow building"
documents = [
    "https://example.com/car.jpg",           # Image
    "https://example.com/bee.jpg",           # Image
    "A vintage car on a European street",     # Text
    {                                         # Combined text+image
        "text": "A car in a European city",
        "image": "https://example.com/car.jpg",
    },
]

rankings = model.rank(query, documents)

The reranker accepts mixed lists containing image-only, text-only, and combined text+image entries. The dictionary format {"text": ..., "image": ...} lets you combine text and image into a single document—useful for searching slides and screenshots where OCR is unreliable.

Checking Modality Support

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

print(model.modalities)
# ['text', 'image', 'video', 'message']

print(model.supports("audio"))
# False

The modalities property returns the list of modalities a model supports. Qwen3-VL-based models don’t support audio—they cover text, image, and video.

The Modality Gap Problem

Multimodal embeddings have a known issue called the “modality gap.” Similarity scores between vectors of different modalities are systematically lower than those within the same modality.

For instance, the similarity between a car image and the text “a green car” might be 0.51. Between two texts with similar meaning, scores above 0.8 are common. This isn’t a model defect—it occurs because embeddings of different modalities form separate clusters within the vector space.

In practice, as long as relative ordering is preserved, the embeddings remain usable for search. The ordering—“car image” is more similar to “car text” than to “bee text”—holds. However, when mixing text search and cross-modal search, direct comparison of similarity scores becomes unreliable, and system design needs to account for this.

Input Format Reference

Here’s a summary of accepted input formats for each modality.

Modality	Accepted Formats
Text	String
Image	PIL.Image, file path, URL, numpy array, torch tensor
Audio	File path, URL, numpy array, torch tensor, `{"array": ..., "sampling_rate": ...}` dict, torchcodec.AudioDecoder
Video	File path, URL, numpy array, torch tensor, `{"array": ..., "video_metadata": ...}` dict, torchcodec.VideoDecoder
Combined	Dict with modality names as keys, e.g. `{"text": ..., "image": ...}`

URL input support for images and video is convenient during development, but in production it’s safer to pre-download and pass local paths or PIL objects for better error handling.

Impact on RAG Pipelines

Building multimodal RAG used to require managing text embedding models (e.g. all-MiniLM-L6-v2) and image search models like CLIP separately. Search indices were split by modality, and result merging logic had to be written from scratch.

With v5.4’s multimodal embeddings, text and images go into the same model and the same index. The Mintlify switch from RAG to ChromaFs was driven by RAG accuracy issues, but multimodal support addresses a different axis—simplifying the overall RAG pipeline architecture. Being able to throw PDF screenshots and slide images into the same search index as text and query across them is quietly effective in document search workflows.

Hardware Requirements

VLM-based models demand significant GPU resources.

Model Size	Required VRAM	Notes
0.1B (BGE-VL-base)	Under 1GB	CPU inference is practical
1-2B	~8GB	RTX 3060/4060 class
8B	~20GB	RTX 4090, A100, etc.

The official documentation explicitly states CPU inference is “extremely slow,” recommending GPU environments or Google Colab. Apple Silicon support isn’t documented, but given PyTorch’s advancing MPS support, smaller models might work.

Relationship with CLIP Models

Sentence Transformers has supported CLIP models (clip-ViT-L-14, etc.) through the SentenceTransformer class for a while. CLIP support continues in v5.4.

For choosing between CLIP and VLM-based models: CLIP is limited to text and image (ImageNet Zero-Shot accuracy 75.4% with ViT-L-14) but is lightweight and fast. VLM-based models also handle video and excel at complex tasks like document understanding, but are heavier. If you already have a system built on CLIP, there’s no rush to migrate—consider switching when you need video support or document image understanding.

Setup Guide

Basic Installation

pip install "sentence-transformers>=5.4.0"

This pulls in major dependencies like torch, transformers, and huggingface-hub automatically. If you already have Sentence Transformers installed, pip install -U sentence-transformers will upgrade it.

Additional Dependencies for Multimodal Models

Multimodal functionality requires additional packages depending on the model.

Qwen3-VL-based models (both Embedding and Reranker):

pip install qwen-vl-utils

qwen-vl-utils is a utility for image/video preprocessing that the Qwen3-VL model’s processor depends on internally. Without it, passing an image to encode() raises an ImportError.

For video input:

pip install torchcodec

torchcodec is PyTorch’s official video decoder for extracting frames from video files. Not needed if you’re not working with video.

NVIDIA models (Nemotron family):

pip install timm

timm (PyTorch Image Models) is used by NVIDIA’s embedding models as the image encoder.

GPU Environment Setup

As noted above, GPU inference is practically required for VLM-based models. If you have a CUDA toolkit installed, reinstall PyTorch with CUDA support.

# For CUDA 12.1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

# For CUDA 12.4
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

If using Google Colab, just select a GPU runtime—the CUDA environment is already set up, no additional PyTorch installation needed.

Verifying with a Minimal Setup

Start with the lightest model, BGE-VL-base (0.1B), to verify your environment is correctly configured.

from sentence_transformers import SentenceTransformer

# Verify with lightest model (first run downloads the model)
model = SentenceTransformer("BAAI/BGE-VL-base")

# Text embedding
text_emb = model.encode(["test sentence"])
print(f"Text: shape={text_emb.shape}")

# Check supported modalities
print(f"Supported modalities: {model.modalities}")

# Image embedding (URL example)
from PIL import Image
import requests
from io import BytesIO

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
img = Image.open(BytesIO(requests.get(url).content))
img_emb = model.encode([img])
print(f"Image: shape={img_emb.shape}")

# Cross-modal similarity
similarity = model.similarity(text_emb, img_emb)
print(f"Similarity: {similarity}")

BGE-VL-base runs on under 1GB of VRAM, so you can barely test it on CPU. Once this works, switch to larger models like Qwen3-VL or Nemotron based on your use case.

Dependency Overview

Which packages you need ultimately depends on your model choice.

Use Case	Required Package	Notes
Text only	`sentence-transformers>=5.4.0`	No additional deps
Qwen3-VL models	+ `qwen-vl-utils`	Image/video preprocessing
Qwen3-VL with video	+ `torchcodec`	Frame extraction
NVIDIA models	+ `timm`	Image encoder
GPU inference	CUDA-enabled `torch`	Practically required for 2B+

For a requirements.txt, it would look like this:

sentence-transformers>=5.4.0
qwen-vl-utils
torchcodec
timm

Installing everything causes no conflicts, so when in doubt, just install them all and leave unused ones alone.