ActionMesh: Meta AI's model for generating animated 3D meshes from video

What Is ActionMesh?

ActionMesh is a model released by Meta AI Research in January 2025 that generates animated 3D meshes from video. It uses an architecture called Temporal 3D Diffusion, which adds a time axis to conventional 3D diffusion models and produces a synchronized sequence of 3D shapes for each frame.

Output Formats

It produces three kinds of output. Everything is in .glb (glTF binary) format, so it can be imported into major 3D tools such as Blender, Unity, Unreal Engine, and Three.js.

Output	Format	Description
Per-frame meshes	`mesh_000.glb`, `mesh_001.glb`, …	Static mesh for each time step
Animated mesh	`animated_mesh.glb`	Animation embedded. Requires Blender 3.5.1 for conversion
Rendered video	`.mp4`	Preview output

When you open animated_mesh.glb in Blender, the animation is already present on the timeline. It does not appear to use bone-based skinning. Instead, frame-by-frame shape changes are likely embedded as morph targets, so direct retargeting to another character is not possible.

No textures are included, so the output is geometry only. The project page says texturing is possible, but that just means you can apply textures to the exported mesh afterward.

Input Constraints

The input can be either an .mp4 file or a folder of sequential PNG files. It only uses 16 to 31 frames, and anything beyond that is ignored.

If you feed it a 6-second 30 fps video, it uses only the first 31 frames, which is roughly one second. Practical workarounds:

Lower the frame rate with ffmpeg: ffmpeg -i input.mp4 -vf "fps=3" output.mp4
Extract evenly spaced PNG frames and point the tool at that folder

The PNG approach gives you more control over which frames are actually used.

Runtime Environment

It requires an NVIDIA GPU with CUDA and at least 32 GB of VRAM. The tested GPUs are A100, H100, and H200.

GPU	VRAM	Notes
RTX 5090	32 GB	Bare minimum. OOM risk depends on the input
A100 80GB	80 GB	Officially tested and comfortable
H100	80 GB	Benchmark environment. About 115 seconds normally, 45 seconds in fast mode

It does not run on Apple Silicon such as the M4 Pro because it depends on CUDA and PyTorch3D. There is no quantized version or MLX port yet either.

If you want to try it on RunPod, an A100 80GB instance is the safe choice. Pushing an RTX 5090 to the 32 GB limit could easily waste time on OOM debugging. If you rent a pod over SSH, you can control everything from setup to specifying a PNG frame folder.

Combining It with AI Video Generation

If you use a video generation model such as Grok to create a clip where the camera circles around a single character, then pass that video into ActionMesh, you can build a single-image to 3D model pipeline.

Compared with static multi-view reconstruction models such as TripoSR or InstantMesh, the advantages could be:

If the character moves in the video, temporal cues may make it easier to infer how parts like hair or accessories belong to the body
It may estimate asymmetric parts, such as a side ponytail, more accurately

That said, the public demos focus on large and smooth shapes such as full human bodies and camels. It is still unclear how well it can reconstruct thin structures like hair. It is more realistic to treat it as a way to get the rough base shape, then refine the details in Blender.

Basic Usage

# Setup
git clone https://github.com/facebookresearch/actionmesh.git
cd actionmesh
pip install -r requirements.txt
pip install -e .

# Inference with video input
python inference/video_to_animated_mesh.py --input video.mp4

# Fast mode (slightly lower quality)
python inference/video_to_animated_mesh.py --input video.mp4 --fast

# Input from a PNG frame folder
python inference/video_to_animated_mesh.py --input frames_folder/