Tech 19 min read

Running TRELLIS.2 on M1 Max 64GB — a hands-on verification log

IkesanContents

Since I covered the CUDA-free port of TRELLIS.2 the other day, I wanted to verify whether it actually runs on the M1 Max 64GB I have on hand.

The benchmark in the original article was about 3.5 minutes on an M4 Pro 24GB. The M1 Max is a generation behind, but it has 32 GPU cores and 400 GB/s of memory bandwidth, so it should hold its own — that was the premise going in. It did run, but whether the speed is practical is a separate question.

Test environment

  • MacBook Pro (Apple M1 Max, 32 GPU cores, 64GB unified memory)
  • macOS Darwin 25.3.0
  • Python 3.11.14 (auto-installed via uv)
  • PyTorch 2.11.0 (MPS available)
  • 124GB free disk (model weights are about 15GB)

The system only had Python 3.13 and 3.12, but setup.sh specifies uv venv .venv --python python3.11, so uv downloaded 3.11.14 automatically and put it inside the venv. The system Python wasn’t touched.

Setup

Clone the repo and run setup.sh.

cd ~/projects
git clone https://github.com/shivampkumar/trellis-mac.git
cd trellis-mac
bash setup.sh

The script does the following.

  1. Creates a venv with uv venv .venv --python python3.11 (downloads 3.11 if missing)
  2. Batch-installs torch, torchvision, transformers, kornia, trimesh, xatlas, etc.
  3. Installs utils3d from a pinned git commit
  4. Shallow-clones upstream microsoft/TRELLIS.2
  5. Runs patches/mps_compat.py to apply MPS-compatibility patches
  6. Checks HuggingFace auth status

The whole thing takes about 4 minutes. Total dependency download is under 1GB, and the venv including Python 3.11 fits in about 800MB.

Patches applied

mps_compat.py rewrites 8 files in the upstream code and adds 2 new ones.

Patched: trellis2/modules/sparse/config.py
Patched: trellis2/modules/sparse/attention/full_attn.py
Patched: trellis2/modules/image_feature_extractor.py
Patched: trellis2/pipelines/rembg/BiRefNet.py
Patched: trellis2/representations/mesh/base.py
Patched: trellis2/models/sc_vaes/fdg_vae.py
Patched: trellis2/pipelines/trellis2_image_to_3d.py
Patched: trellis2/pipelines/base.py
Installed: trellis2/modules/sparse/conv/conv_none.py
Installed: stubs/o_voxel/convert.py

As described in the original article, sdpa and naive are added as sparse attention backends and flash_attn calls are rerouted to torch.nn.functional.scaled_dot_product_attention. The new conv_none.py is the gather-scatter implementation of sparse 3D convolution, and stubs/o_voxel/convert.py is a Python-dict replacement for the CUDA hashmap.

One thing that changed since the original article: nvdiffrast (texture baking) was stubbed out back then, but the current trellis-mac ships with backends/texture_baker.py, a pure-PyTorch implementation that actually bakes PBR textures.

HuggingFace auth and gated models

TRELLIS.2 inference needs access to two gated models.

  • facebook/dinov3-vitl16-pretrain-lvd1689m (image feature extraction)
  • briaai/RMBG-2.0 (background removal)

If your environment can’t take hidden password input at the CLI prompt, it’s easier to create a token up front and pass it as an argument.

hf auth login --token <HF-token>

Access requests are made by clicking “Request access” on each model page. RMBG-2.0 is gated: auto and was approved instantly. DINOv3 is gated: manual, so Meta has to review the request by hand. The form asks for name, date of birth, country, affiliation, and job title.

You can check approval by trying to download via huggingface_hub.

from huggingface_hub import hf_hub_download
for repo in ['facebook/dinov3-vitl16-pretrain-lvd1689m', 'briaai/RMBG-2.0']:
    try:
        hf_hub_download(repo_id=repo, filename='config.json')
        print(f'[OK] {repo}')
    except Exception as e:
        print(f'[NG] {repo}: {type(e).__name__}')

Manual review for DINOv3 takes a few hours on the fast side and up to a day in my experience. Mine came back in a few hours.

Test plan

I ran three input images back-to-back, to see how input affinity to the 3D model affects output quality.

① Official sample (baseline)

TRELLIS.2 official shoe input

assets/shoe_input.png. A photo-like shoe, firmly in the training distribution. Start here to confirm the pipeline runs.

② Anime illustration (Kanachan in a swimsuit)

Kanachan front, swimsuit, background removed

A 1760x2432 high-res anime illustration. Full-body front, limbs not overlapping. Good conditions for shape inference. The interesting parts are how the hair strands and facial features (especially the eyes) get interpreted.

Before feeding it in, I ran RMBG-2.0 to remove the background. TRELLIS.2 calls RMBG-2.0 internally as well, so it goes through the same model either way, but removing it ahead of time lets you eyeball the intermediate. Running RMBG-2.0 standalone on MPS finished the 1760x2432 image in a few seconds.

③ Pixel-art chibi (Kanachan Pixel)

Kanachan chibi pixel art

256x256 pixel art, chibi proportions, completely flat shading. Almost no shading cues, and about as far from the training distribution of a 3D generation model as you can get. I expected it to fall apart — the question is how.

②③ are “Kanachan”, the character that shows up on this blog. I’ve run several 3D experiments with her before.

Up to now the options have been cloud services (Rodin, Hunyuan 3D) or local runs on an NVIDIA GPU. Running locally without NVIDIA is new territory that TRELLIS.2 MPS opens up. Having the same Kanachan inputs across all of these makes cloud vs. local and CUDA vs. MPS comparisons easier.

3D generative models are trained on inputs that have clear shading and perspective — photos, painted illustrations. Flat anime-style coloring, and pixel art especially, sit outside that distribution. That makes them useful for probing “how much does the input have to look like a training example.”

Generation results

Every run used python generate.py <image> --output <out> with defaults (pipeline-type=512, texture-size=1024, seed=42). Each run ran in its own process, and I sampled RSS via ps every 10 seconds to capture memory peaks.

① Shoe (baseline)

ItemValue
Model load340s (first run, includes ~15GB download)
Sparse structure sampling84s
Shape SLat sampling26s
Texture SLat sampling14s
Mesh decode + generation total201s
PBR texture bake3954s (~66 min)
Total~75 min
Output mesh427,543 verts / 865,462 tris
GLB file27MB
Memory peak18.47GB

Generation itself (201s) was roughly on par with the M4 Pro numbers from the original article (sparse + Shape + Texture + mesh-decode totalled 185s).

PBR texture baking completely dominated the wall time: 201s of generation vs. 3954s of baking, with baking taking 95% of the total. The original article assumed nvdiffrast was stubbed and baking was skipped, but backends/texture_baker.py now has a pure-PyTorch baking implementation, and that’s what ran here.

Dropping the output into the lab’s 3D model viewer confirms the shape is recognizably a shoe, but the baked texture is covered in bright-red noise.

Shoe GLB output, texture is noisy red patterns

Opening the raw base color image out_shoe_basecolor.png directly shows it even more clearly: the red accents from the input are picked up but scattered randomly across the UV sheet instead of landing on meaningful surface regions.

Base color UV map is a red chaos

The pure-PyTorch implementation in backends/texture_baker.py is a straightforward UV unwrap (via xatlas) plus voxel-based attribute sampling. Compared to CUDA’s nvdiffrast (a differentiable rasterizer), the bake quality just isn’t there yet. Even after paying the long bake time, the resulting texture isn’t usable.

② Kanachan in a swimsuit (with texture)

ItemValue
Model load128s (cached)
Generation total152s
PBR texture bake5024s (~84 min)
Total~88 min
Output mesh300,373 verts / 671,142 tris
GLB file22MB
Memory peak17.77GB

The model cache cut load time from 340s to 128s (about 2.6x).

An interesting point: generation took 152s, faster than the shoe’s 201s. The input is 1760x2432 but TRELLIS.2 resizes internally, so the real driver is the voxel density of the sparse structure.

TRELLIS.2 represents 3D space as a collection of small cubes called “voxels”. The space is divided dice-style, and only cells with something inside are selected for computation — those “has something” cells are the active voxels. The more active voxels, the more points the convolutions have to chew through. A full-body anime silhouette seen head-on occupies less 3D volume (fewer active voxels) than a shoe, so it finished faster.

Kanachan 3D mesh, front. Shape reads as a humanoid but texture is noisy

Kanachan 3D mesh, back. The ponytail is reconstructed

The texture is noisy, same as the shoe, but the shape was genuinely surprising. From a single 2D anime illustration, the hair, limbs, and ponytail were all correctly lifted into 3D. Facial details (eyes, nose, mouth) are too smoothed to read from these screenshots, but the silhouette and body proportions hold together. The back is absent from the input, so it’s inferred, and yet comes out without obvious breakage. Watching TRELLIS.2’s shape inference work reasonably on an input that isn’t at the center of its training distribution was the most unexpected finding of this run.

③ Chibi pixel-art Kanachan (with texture)

ItemValue
Model load126s (cached)
Generation total166s
PBR texture bake13429s (~224 min)
Total~229 min (3h 49m)
Output mesh415,433 verts / 1,055,998 tris
GLB file28MB

A 256x256 pixel art input produced a mesh with over a million triangles: 1.2× the shoe (865K) and 1.6× the swimsuit (671K). The bake time scaled with it and hit the longest 224 minutes.

3D result of the chibi pixel art, a lumpy broken mesh

Completely broken, as expected. The first thing that came to mind was “this looks like Gekidan Inu Curry art” — the collage-style imagery Gekidan Inu Curry did for the witch-realms in Puella Magi Madoka Magica has exactly this otherworldly noise texture. Pixel art has lots of “boundaries” (sharp color jumps), and the 3D model reads those as fine surface relief, resulting in an over-noisy mesh. You can’t recognize the character. We’re outside the training distribution, and probably DINOv3’s feature extraction can’t meaningfully structure pixel art in the first place.

Chibi seen from the side: almost no depth, basically a flat wall

Viewed from the side, the mesh turns out to have almost no thickness at all. It’s more like a flat wall with noise stuck on it than a 3D object — “2D crushed into the height axis” would be a fair description. A front-facing pixel-art image gives the model nothing to infer the back from, and the depth axis essentially gets abandoned.

④ Shoe (rerun with --no-texture)

To verify that baking really was the dominant cost, I reran the shoe with --no-texture.

ItemValue
Model load133s
Generation total191.6s
PBR texture bake0s (skipped)
Total325s (~5.4 min)
Output mesh427,543 verts / 865,462 tris (identical to ①)
GLB file15MB (vs. ①’s 27MB, vertex colors only)

Same seed, so the mesh matches ① exactly. Skipping texture baking alone dropped the total from 75 min to 5.4 min.

Shoe vertex-color output, clean white mesh with the shape visible

In the viewer, the red texture noise is gone and the shoe’s form reads cleanly. Same mesh as ①, yet skipping the bake made the output look dramatically better. With the current bake quality where it is, --no-texture and vertex colors are the more practical default.

⑤ Kanachan swimsuit (rerun with --no-texture)

From ②’s result I got the impression of “shape is good but the texture noise ruins it”, so I reran the same image with --no-texture.

ItemValue
Model load132s
Generation total167.4s
PBR texture bake0s
Total299s (~5 min)
Output mesh300,373 verts / 671,142 tris (identical to ②)
GLB file11MB

Kanachan 3D, vertex colors, front. The antenna hair, ponytail, swimsuit, and overall proportions all read

Kanachan 3D, vertex colors, three-quarter view. Has real volume

Kanachan 3D, vertex colors, side (shown tipped over)

Face close-up: eye sockets and the nose/mouth relief faintly readable

This was the best result of the whole experiment. The antenna hair tuft on top, the ponytail, the swimsuit’s neckline and leg line — all come through as legitimate mesh relief. The back view isn’t broken either. Up close, the face shows faint depressions for the eyes and slight relief for the nose and mouth — not enough to call it “reads as the character at a glance”, but it’s the most face-like output of any 3D experiment I’ve done.

The texture noise on ② was hiding how good the underlying shape was. Getting a non-broken humanoid 3D mesh from a single anime illustration in about 5 minutes is more than enough for what the M1 Max can do locally. If vertex colors are fine for the use case, this path could actually show up in a real workflow.

⑥ Feeding in a cropped face

To push more of the vertex budget toward the face, I cropped just the head out of the original and fed that in. With the full body, vertices spread across body/clothes/limbs; feed just the head and the same budget should concentrate around the face — that was the hypothesis.

ItemValue
Model load133s
Generation total539.2s (~9 min)
Total~11 min
Output mesh2,002,285 verts / 2,705,690 tris
GLB file53MB

Vertex count did indeed shoot up — 2M vertices against the swimsuit’s 300K, a 6.7× jump.

Face-crop output, front

Face-crop output, close-up

Face-crop output, side profile

Face-crop output, back. Noticeably sparser, the back of the head is squashed

But the “eyes/nose/mouth pop” effect I was hoping for barely materialized. The 6.7× of extra vertices spread out across hair strands and general surface noise, and the eye/nose/mouth relief reads about the same as it did in the full-body version.

The back view is also visibly sparser. The model has to guess information that isn’t in the input, and the back of the head has essentially no cues, so fewer vertices land there than on the front. A single 2D image input inherently leaves the back underfed.

It shouldn’t really be surprising that vertex count doesn’t track semantic importance. From the model’s perspective, “how many points you need to reproduce the surface” is driven by surface area and the complexity of the relief. A face is relatively smooth as a curved surface, so it doesn’t need many vertices; hair tufts and unruly strand tips demand more. The model spends its budget there instead.

If the goal is a crisp face, bumping vertex count isn’t the right lever. The real lever would be adding more input information — multi-view input, for instance. But TRELLIS.2’s run() is designed around a single image; it isn’t trained to fuse multi-view inputs. I’ve used multi-image input with Hyper3D Rodin in the past, but that works because Rodin supports multi-view on the server side. TRELLIS.2 doesn’t give you that option. As of now, I don’t see a realistic path to push face detail higher on the TRELLIS.2 Mac port.

⑦ The high-resolution (1024) pipeline didn’t run

I tried the same Kanachan swimsuit input with --pipeline-type 1024 --no-texture to see if more resolution would surface more face detail. Shape SLat (109s) and Texture SLat (63s) both completed, but it crashed right at mesh decoding.

File ".../trellis2/modules/sparse/conv/conv_none.py", line 117
  t_idx = tgt_idx[mask]
torch.AcceleratorError: index 17020172 is out of bounds: 0, range 0 to 17020172

The gather-scatter sparse 3D convolution (conv_none.py) that is the core of the CUDA-free port is mishandling a bounds check on the larger tensors the 1024 pipeline produces. An off-by-one takes the index just past the end. The 512 pipeline runs the same code path without crashing, so this is presumably a bug that only triggers at specific voxel-grid sizes.

For now, the 512 pipeline is effectively the only one you can run on trellis-mac. The face-detail push hit its ceiling here.

Performance comparison

Versus M4 Pro 24GB (numbers quoted from the original article).

StageM4 Pro 24GBM1 Max 64GBRatio
Model load45s128–133s (cached)2.8× slower
Sparse structure sampling15s84s5.6× slower
Shape SLat sampling90s26s3.5× faster
Texture SLat sampling50s14s3.6× faster
Mesh decode30s~80s (estimated from totals)2.7× slower
Generation total~3.5 min~3.2–3.4 minroughly equal
PBR texture bake(not implemented at article time)66–224 min

Looked at purely as generation, the M1 Max matches the M4 Pro, and on Shape/Texture SLat it’s over 3× faster. Those two stages are dominated by sparse tensor math, and the M1 Max’s 400 GB/s memory bandwidth (vs. M4 Pro’s 273 GB/s) looks like the reason.

Sparse structure sampling and mesh decoding are slower on the M1 Max. These stages run backends/conv_none.py’s gather-scatter sparse 3D convolution, which has a lot of Python-layer loop and index work, so the hardware bandwidth doesn’t get to shine. The original article put it bluntly: “pure-PyTorch sparse convolution is about 10× slower than CUDA’s flex_gemm kernel.” That’s the bottleneck.

Memory and MPS fallback notes

Memory peaks across all three runs stayed in the 17.5–18.5GB range, matching the original article’s “~18GB peak”. With 64GB you can generate without closing other apps.

The important line here is that 16GB Macs don’t run this model. An M4 mini (16GB) can’t cover the 18GB peak and will start swapping, which takes it out of practical use. 24GB is the real minimum — M2/M3/M4 Pro-tier hardware is the usable floor.

On the MPS side, there’s exactly one CPU-fallback warning, for aten::segment_reduce.

UserWarning: The operator 'aten::segment_reduce' is not currently supported
on the MPS backend and will fall back to run on the CPU.
This may have performance implications.

This is a PyTorch MPS backend coverage gap, not a generation-specific thing. Running on an M4 Pro would hit the same warning and the same fallback. The question is whether PyTorch ships the op — it’s not a hardware issue.

M1 Max vs. M4 Pro outlook

Separating hardware differences from software differences to think about whether M4 Pro would close the gap.

Hardware-side:

MetricM1 MaxM4 Pro
Memory bandwidth400 GB/s273 GB/s
GPU cores3220
GPU generationoldernewer (better per-core efficiency)
Native BF16✗ (emulated)
Unified memory (top tier)64GB48GB

Memory bandwidth favors the M1 Max. Native BF16 favors the M4 Pro, but TRELLIS.2 doesn’t depend on BF16 (the original article says so), so it doesn’t hit this workload.

On the software side, missing ops like segment_reduce are a PyTorch MPS coverage gap — the M4 Pro would hit the same CPU fallback. So the story isn’t “optimized for M4”, it’s “undercovered everywhere in equal measure”.

If PyTorch’s MPS coverage filled in and sparse 3D convolution got an MPS-native implementation, the M1 Max could actually come out ahead thanks to its bandwidth advantage. Right now the software bottleneck is doing most of the work, and hardware differences are a secondary concern.

Texture baking is the practical wall

Organized by the numbers, baking takes 90%+ of the total.

RunGenerationBakeTotalBake ratio
① Shoe201s3954s4155s95%
② Swimsuit152s5024s5176s97%
③ Chibi166s13429s13595s99%
④ Shoe --no-texture192s0s325s

Baking is heavy because texture_baker.py runs UV unwrap → voxel attribute sampling → texel write in a Python loop over every triangle. Time scales roughly with triangle count, which is how the million-triangle chibi ended up at 224 minutes.

And the baked result at the end isn’t even usable. Opening the shoe’s _basecolor.png shows that the input’s color information didn’t map correctly onto the UV sheet — what gets baked is noise. Shape generation works and the quality is fine; PBR texture baking is unusable on both time and quality. If your use case is OK with vertex colors, --no-texture drops the total to about 5 minutes, and the output actually looks better.

Practicality and use cases

The fair read is that this MPS port genuinely does something new — it removes the “needs an NVIDIA GPU” precondition. That’s a real unlock.

That said, compared to cloud services like Hyper3D Rodin that complete in tens of seconds to a few minutes, or CUDA setups that do the same in a few minutes, TRELLIS.2 on the M1 Max is workable in about 5 minutes with vertex colors, but becomes 66–224 minutes with PBR textures — too heavy for everyday use.

For vertex-color workflows, this port is a real option when you prefer not to ship your inputs to a cloud service. If you need PBR textures, a cloud service or a CUDA box is still the realistic answer.

There’s a lot of room for software-side improvement. If PyTorch’s MPS op coverage progresses or texture_baker.py gets optimized, the picture changes meaningfully. Down the line, an MLX reimplementation or native Apple sparse-tensor APIs could tilt things considerably.


All screenshots in this article were taken by dropping the GLB files into the lab’s 3D model viewer. If you have the GLB output from a run, you can inspect it the same way.

References: