Tech 9 min read

Lance 3B unified multimodal: 40GB VRAM, RunPod costs, and why weights are split

IkesanContents

ByteDance Research published Lance on Hugging Face.
It’s a 3B active parameter model that handles image and video understanding, generation, and editing in a single framework under Apache 2.0.
The model card lists Python 3.10+, CUDA 12.4+, and 40GB+ GPU VRAM for inference.

Not something you casually try on an M1 Max or RTX 4060 Laptop.
But the approach differs from running separate image generation, video generation, and VLM models side by side.
When I built a multimodal Japanese RAG pipeline with FastAPI, Chroma, Open WebUI, and Ollama on M1 Max, retrieval, image description, and final response were handled by different models.
Lance is a research implementation that consolidates those roles into a unified architecture.

6 tasks through one CLI

The GitHub README has a shell script inference_lance.sh that switches between tasks.
Supported tasks are t2i, t2v, image_edit, video_edit, x2t_image, and x2t_video.
Text-to-image, text-to-video, image editing, video editing, image understanding, and video understanding all go through the same code path.

flowchart TD
    Text[テキスト] --> Context[共有された<br/>マルチモーダル文脈]
    Image[画像] --> Context
    Video[動画] --> Context
    Context --> Understand[理解系<br/>画像QA・動画QA]
    Context --> Generate[生成系<br/>画像・動画生成]
    Context --> Edit[編集系<br/>画像・動画編集]

According to the project page, text, image, and video contexts are treated as a shared sequence, with understanding and generation paths routed to separate experts.
The understanding side uses ViT tokens for semantic comprehension, while the generation side uses clean and noisy VAE latent representations.
Modality-aware RoPE reduces interference between heterogeneous visual tokens.

The video editing demo shows background replacement, object addition, person action changes, and style transfer all on the same model card.
Video understanding examples return action counts, movement directions, anomalies in footage, and short cooking procedure descriptions.

Surprisingly strong video benchmarks for 3B

Lance stays at 3B and pushes benchmarks through multi-task learning design.
According to the model card and paper page, it was trained from scratch within a budget of 128 A100 GPUs.
3B sounds lightweight, but that doesn’t mean it runs light locally. The 40GB VRAM requirement puts realistic hardware at A100, L40S, A6000, or H100.

On published benchmarks, Lance scores 0.90 overall on GenEVAL for image generation.
In the same table, Qwen-Image scores 0.87, FLUX.1-dev scores 0.82, and TUNA (another unified model) scores 0.90, placing Lance at the top of the unified model group.
DPG-Bench overall is 84.67, below Qwen-Image’s 88.32 and TUNA’s 86.76, though the relational understanding category hits 93.38.

On VBench for video generation, Lance leads the unified model group at 85.11 overall.
TUNA 1.5B scores 84.06 and Show-o2 2B scores 81.34.
Including dedicated generation models, Wan2.1-T2V 14B scores 83.69 and Hunyuan Video scores 83.43, so this 3B unified model outscores dedicated 14B models on the paper’s measurements.

That said, the image understanding demos do contain errors.
On the Hugging Face model card, a solar eclipse description includes an inaccurate mention of Earth’s shadow.
Understanding quality doesn’t come through in benchmark tables alone; you need to read through sample outputs to see the tendencies.
Comparing Lance’s VLM responses side by side with Qwen2.5-VL or Qwen3-VL on the same inputs would surface differences.

A different kind of unification from RAG embeddings

In Sentence Transformers v5.4’s unified text-image-audio-video embeddings, I covered a unified interface for retrieval.
That approach projects inputs into vectors for search; actual responses and image generation are handled by separate models.

Lance reads images and video, produces answers, and if needed generates or edits media with the same model.
Where embedding models unify the retrieval interface, Lance packs the entire production workflow into a single model.

In practice, though, a split architecture is easier to work with in many situations.
Run embeddings on a lightweight model, image understanding on a VLM, final answers on an LLM, and image generation through ComfyUI or an API.
This way you can swap parts based on local memory constraints or quality needs.
Lance’s 40GB VRAM requirement is the direct result of consolidating that split architecture into one model.

Files to look at before running

The GitHub repo contains benchmarks/, config/, data/, modeling/, inference_lance.py, and inference_lance.sh.
Model weights are downloaded from Hugging Face and placed in downloads/.
A Gradio demo starts with python lance_gradio_t2v_v2t.py --gpus 0 --server-port 7860.

Inference parameters are set in inference_lance.sh.
Defaults are 30 denoising steps, 4.0 CFG scale, 50 video frames, and a 480p resolution preset.
Maximum frame count is listed as 121.

To try it out, start with the same prompt format shown in the Hugging Face model card samples.
The README also notes that input prompts should follow the provided examples.
Image and video generation models can break differently depending on prompt wording even when the meaning is the same.
Lance appeared to be no exception.

Which GPU on RunPod

Since M1 Max and RTX 4060 Laptop won’t run this, cloud GPUs are the path forward.
Here’s a cost estimate for running on RunPod.

Weight file sizes

The total Hugging Face repo is about 57.4 GB.
Image and video model weights are separate files; downloading both gives this breakdown:

ComponentSize
Lance_3B (image model)24.7 GB
Lance_3B_Video (video model)28.4 GB
Qwen2.5-VL-ViT1.34 GB
Wan2.2 VAE2.82 GB

For image tasks only, Lance_3B + ViT + VAE comes to about 28.9 GB.
Including video, Lance_3B_Video + ViT + VAE totals about 32.6 GB.
Both together run about 57.3 GB.
The PyTorch + CUDA 12.4 Python environment takes another 4—6 GB, so budget at least 70 GB of disk, preferably 100 GB.

VRAM estimate

The model card requires 40 GB+.
Roughly estimating VRAM usage with bf16 weight loading: static weights alone for the image model come to about 28.9 GB.
During inference, KV cache and intermediate activations add several GB, bringing actual consumption to around 35—40 GB.

The video model weighs 28.4 GB, and decoding 50 frames increases activation memory.
Static weights + ViT + VAE total about 32.6 GB, with inference pushing toward 40 GB.
A100 40 GB would be right at the edge or OOM territory; video generation would be difficult.

No quantization (int8/int4) workflow is provided.
requirements.txt includes bitsandbytes, but there’s no quantization documentation, so you’d be on your own.

RunPod GPU candidates

Four realistic options with 40 GB+ VRAM available on RunPod.
Prices are Community Cloud estimates (cheaper when available, but preemptible).

GPUVRAMPrice range (est.)Notes
A600048 GB~$0.76/hrBest cost efficiency. Image model fits easily, video model fits too
L40S48 GB~$0.74/hrSimilar price to A6000. Ada Lovelace gen, slightly faster inference
A100 80GB SXM80 GB~$1.64/hrPlenty of VRAM headroom. No worries for video
H100 80GB SXM80 GB~$3.49/hrFastest, but pricey. For short bursts

A100 40 GB might work for the image model but carries high OOM risk for video.
L40S or A6000 at 48 GB is the sweet spot for cost vs. capability.

Setup time and session cost

After launching a Pod on RunPod, the first task is downloading model weights and setting up the Python environment.
Pulling 57 GB from Hugging Face takes roughly 20—40 minutes depending on RunPod’s network bandwidth.
pip install for 70+ packages adds another 10—15 minutes.
First-time setup alone takes 30 minutes to an hour.

On subsequent runs, storing weights on a network volume eliminates the download.
Volume storage on RunPod costs $0.07/GB/month, so a 100 GB volume runs $7/month.

For a test session cost estimate: about $2.2 for 3 hours on L40S.
Including first-time setup at 4 hours, about $3.
That’s enough to launch the Gradio demo and try out both image and video generation.

NUM_GPUS in inference_lance.sh

inference_lance.sh has a NUM_GPUS parameter, defaulting to 1.
Launching a multi-GPU Pod on RunPod allows distribution across GPUs.
If a single GPU with 48 GB+ is available, though, distribution isn’t needed.
Multi-GPU distribution applies when using a setup like A100 40 GB x2 to work around the 40 GB limit.

”Unified model” with two separate weight files

Looking at the weight file table, Lance_3B (image model, 24.7 GB) and Lance_3B_Video (video model, 28.4 GB) are separate checkpoints.
Image tasks load Lance_3B; video tasks load Lance_3B_Video.
ViT and VAE are shared, but the Transformer body weights are separate.

Qwen’s lineup was separate models from the start.
Understanding is Qwen2.5-VL, image generation is Qwen-Image, video is Wan2.2.
Names and repositories are distinct, with independent version bumps.
Separate weights are obvious from the naming, and they’re designed to be combined.

Lance brands itself a “unified multimodal model,” but it doesn’t ship a single weight file that handles all tasks across image and video.
What’s unified is the architecture and training pipeline.
Text, image, and video tokens pass through the same Transformer with multi-task learning that boosts cross-task performance. Inference scripts and CLI options are also shared.

However, with two separate weight downloads, the actual experience of downloading and setting up isn’t much different from the Qwen approach.
If you want both image and video, you pull 57 GB of weights.

At inference time, image tasks (t2i, image_edit, x2t_image) load Lance_3B, and video tasks (t2v, video_edit, x2t_video) load Lance_3B_Video.
A continuous workflow going from image understanding to video generation requires a checkpoint swap mid-pipeline.
Even an 80 GB A100 can’t fit all 57 GB of weights in VRAM simultaneously, so every task switch means a reload.

Relaying from Qwen-Image to Wan2.2 in the Qwen approach involves comparable infrastructure overhead.
Lance’s “unified” applies to the training phase.
Multi-task learning of image and video on the same Transformer leads to representation transfer that lifts benchmarks.
What reaches the inference side is two separate checkpoints reflecting that training benefit.

References