Can LTX-2 and Wan 2.2 Run on an M1 Max 64GB? I Checked, Then Actually Ran Them
I want to run AI video generation locally on an M1 Max 64GB MacBook Pro. Two candidates: LTX-2 and Wan 2.2. Both are open-source and have received attention as locally runnable models.
Before running them, I researched spec requirements and Mac compatibility. Wan 2.2 ran in GGUF format but took 82 minutes for a 2-second video.
LTX-2
A video and audio simultaneous generation model developed by Lightricks. Rather than generating video and audio separately and combining them, a single inference produces both.
- Parameters: Dual-stream of 14B video + 5B audio
- Output: FHD to 4K, 25/50fps, 6–10 seconds (up to 20 seconds)
- Audio: Environmental sounds, foley, music, and dialogue generated simultaneously
- License: Open source (GitHub, HuggingFace)
Audio Control
Audio can be specified in detail through prompts. From the official prompt guide:
- Environmental sounds: “rain falling on a tin roof”
- Sound effects: “footsteps on gravel”
- Music: “upbeat jazz piano in the background”
- Dialogue: Specify in quotes with language and accent
Modality-CFG lets you adjust the control balance between video and audio. Uses Gemma 3 as the text encoder and claims multilingual support. Japanese dialogue quality untested.
Mac Compatibility
Official system requirements assume NVIDIA GPU (32GB+ VRAM recommended), but it runs on Apple Silicon.
- MLX port project in progress
- Native macOS app exists (for LTX-Video original)
- FP8 models don’t work on Metal. Must use FP16 (bf16) or GGUF format
Performance (Existing Reports)
| Environment | Generation time | Video length | Resolution |
|---|---|---|---|
| NVIDIA H100 | ~2 seconds | 5 seconds | 768x512 |
| Mac M3 | ~5 minutes | a few seconds | unknown |
| Mac M1 16GB | ~15 minutes | ~4 seconds | unknown |
M1 Max 64GB should be faster than M1 16GB, but nothing close to NVIDIA.
Wan 2.2
A video generation model developed by Alibaba, well-regarded for quality.
- Models: Two sizes — 1.3B (lightweight) and 14B (full)
- Recommended VRAM: 1.3B: 6–12GB, 14B: 24GB+
- Output: 480p to 1080p
- Audio: Not supported (video only)
Mac Compatibility
Officially CUDA-only, Metal/MPS not supported. However, there are ways to run it.
- Community fork with MPS support (M1 Pro working)
- FP8 causes errors on MPS via ComfyUI → can bypass with GGUF format
- Apple Silicon issues reported in ComfyUI issue #9255
Performance (Existing Reports)
| Environment | Model | Generation time | Frames | Resolution |
|---|---|---|---|---|
| NVIDIA RTX 3060 12GB | 14B | ~15 minutes | 81 frames | 840x420 |
| Mac M1 Pro | 1.3B | ~10 minutes | 8 frames | 480p |
M1 Pro takes 10 minutes for 8 frames (about 0.3 seconds at 25fps). A 5-second video would be well over an hour.
Comparison
| LTX-2 | Wan 2.2 | |
|---|---|---|
| Mac compatibility | MLX port available. Relatively easy | Officially unsupported. GGUF workaround needed |
| Setup difficulty | Low | Moderately high |
| Simultaneous audio | Supported | Not supported |
| Generation speed (Mac) | ~7 min for 1.3-sec video (M1 Max 64GB, GGUF) | 82 min for 2-sec video (M1 Max 64GB, GGUF) |
| Video quality | Good | Higher quality (especially 14B) |
| Model size | ~42GB | 1.3B: a few GB / 14B: tens of GB |
For just trying things out on Mac, LTX-2 is overwhelmingly easier. Simultaneous audio makes it interesting for i2v use cases. Wan 2.2 has higher quality but running at practical speeds on Mac is difficult — you’d want NVIDIA GPU for serious use.
Real-World Tests
Wan 2.2: Challenge with GGUF Format
I initially planned to skip this, but found that GGUF format could bypass the FP8 issue and run on Mac.
Four files needed:
| Category | File | Size |
|---|---|---|
| diffusion_models | wan2.2_t2v_high_noise_14B_Q4_K_S.gguf | 8.75 GB |
| diffusion_models | wan2.2_t2v_low_noise_14B_Q4_K_S.gguf | 8.75 GB |
| text_encoders | umt5-xxl-encoder-Q8_0.gguf | 6.04 GB |
| vae | wan_2.1_vae.safetensors | 242 MB |
Total about 24GB. Model from bullerwins/Wan2.2-T2V-A14B-GGUF, text encoder from city96/umt5-xxl-encoder-gguf, VAE from Comfy-Org/Wan_2.2_ComfyUI_Repackaged.
Wan 2.2’s 14B model has a two-step high noise / low noise architecture. Two KSamplerAdvanced nodes are chained in the workflow, each loading models from separate checkpoints.
- Stage 1 (high noise): Steps 0–10, noise added. Determines overall composition and layout
- Stage 2 (low noise): Steps 10–20, no noise added. Receives stage 1’s output latent and refines details
Using separately optimized models for high and low noise stages improves quality — that’s why two GGUF files are needed.
Quantization level is Q4_K_S. With 64GB available, Q8_0 would fit in memory, but confirming it works is the priority.
ComfyUI-GGUF Node
ComfyUI’s standard model loader can’t read GGUF. Need to install city96/ComfyUI-GGUF as a custom node. This adds UnetLoaderGGUF and CLIPLoaderGGUF nodes.
Workflow
Based on ComfyUI’s official Wan 2.2 T2V 14B workflow, swapping model loaders for GGUF versions:
UNETLoader× 2 →UnetLoaderGGUF× 2 (with GGUF filenames)CLIPLoader→CLIPLoaderGGUF(umt5-xxl GGUF Q8_0)VAELoader→ wan_2.1_vae.safetensors- Resolution 1280x704 → 832x480, frame count 57 → 33 (just verifying it runs first)
The VAE Trap
Initially used wan2.2_vae.safetensors (1.41GB), which errored at VAE decode after 80 minutes of sampling:
RuntimeError: Given groups=1, weight of size [48, 48, 1, 1, 1],
expected input[1, 16, 9, 60, 104] to have 48 channels, but got 16 channels instead
wan2.2_vae expects 48-channel input, but the T2V diffusion model outputs 16-channel latent. The official workflow notes also clearly state “This model uses the wan 2.1 VAE.” Wan 2.2 only updated the diffusion model; T2V/I2V VAE still uses the 2.1 version. The correct VAE is wan_2.1_vae.safetensors (242MB). Watch out — the version numbers don’t match.
Result: It worked
After fixing the VAE, the video generated without errors.
Prompt: “a robot is running through a futuristic cyberpunk city with neon signs and darkness with bright HDR lights”
| Phase | Time |
|---|---|
| Stage 1 KSampler (high noise) 10 steps | 41 min 40 sec (250s/step) |
| Stage 2 KSampler (low noise) 10 steps | 39 min 21 sec (236s/step) |
| VAE decode | ~1 min 41 sec |
| Total | 1 hour 22 min 45 sec |
832x480, 33 frames (16fps ≈ 2 seconds) took 82 minutes. It works, but practical speed this isn’t. A 5-second video would be over 3 hours.
LTX-2: Setup
Using ComfyUI’s official “LTX-2 i2v” template. Loading the workflow prompts downloading these models:
| Category | File | Size |
|---|---|---|
| checkpoints | ltx-2-19b-dev-fp8.safetensors | 25.22 GB |
| text_encoders | gemma_3_12B_it_fp4_mixed.safetensors | 8.8 GB |
| loras | ltx-2-19b-distilled-lora-384.safetensors | 7.15 GB |
| loras | ltx-2-19b-lora-camera-control-dolly-left.safetensors | 312 MB |
| latent_upscale_models | ltx-2-spatial-upscaler-x2-1.0.safetensors | unknown |
About 42GB total. The checkpoint alone is 25GB, so depending on your connection this could take a while.
Pre-testing research said FP8 doesn’t work on Metal, but since the official ComfyUI template specifies FP8 checkpoints, I tried it first.
FP8: Doesn’t Work
RuntimeError: Undefined type Float8_e4m3fn
Errors out in 55 seconds. Metal doesn’t implement Float8_e4m3fn, so FP8 checkpoints can’t be used on Apple Silicon. The official ComfyUI template specifies FP8 because it assumes NVIDIA GPU.
LTX-2: Retry with GGUF Format
If FP8 fails, use GGUF like with Wan 2.2. Kijai/LTXV2_comfy distributes GGUF quantized LTX-2 models.
However, LTX-2 has a wider FP8 problem than Wan 2.2. Not just the checkpoint — the official distribution of the text encoder (Gemma 3 12B), gemma_3_12B_it_fp4_mixed.safetensors, also contains FP8 tensors. Both the model body and text encoder need GGUF replacements.
| Category | File | Size | Source |
|---|---|---|---|
| diffusion_models | ltx-2-19b-dev-Q4_K_S.gguf | 11 GB | Kijai/LTXV2_comfy |
| text_encoders | gemma-3-12b-it-Q4_K_M.gguf | 6.8 GB | unsloth/gemma-3-12b-it-GGUF |
| text_encoders | ltx-2-19b-embeddings_connector_distill_bf16.safetensors | 2.7 GB | Kijai/LTXV2_comfy |
| vae | LTX2_video_vae_bf16.safetensors | 2.3 GB | Kijai/LTXV2_comfy |
| loras | ltx-2-19b-distilled-lora-384.safetensors | 7.1 GB | Lightricks/LTX-2 |
~30GB total. More than Wan 2.2’s 24GB, but no problem with 64GB.
Gemma 3 Also Requires GGUF
Trying gemma_3_12B_it_fp4_mixed.safetensors first produced the exact same error as the checkpoint:
RuntimeError: Undefined type Float8_e4m3fn
The name says “fp4_mixed” but it contains mixed FP8 tensors internally. Can’t use on Mac. Download GGUF from unsloth/gemma-3-12b-it-GGUF and load with the DualCLIPLoaderGGUF node (type: ltxv). Specify Gemma 3 GGUF as clip_name1 and embeddings connector as clip_name2.
ComfyUI-GGUF has Gemma 3 support as of PR #402. Latest version is fine.
Workflow Configuration
Unlike Wan 2.2, LTX-2 is a single-stage architecture. One KSampler is enough.
- UnetLoaderGGUF → LoraLoaderModelOnly (distilled LoRA, strength 0.6) → ModelSamplingLTXV (max_shift 2.05, base_shift 0.95)
- DualCLIPLoaderGGUF (type: ltxv): Gemma 3 GGUF + embeddings connector
- LTXVImgToVideo: generates latent from input image (for i2v)
- KSampler: 25 steps, cfg 5.5, euler, normal
- VAELoader → VAEDecode: LTX2_video_vae_bf16.safetensors
The official workflow uses distilled LoRA at strength 0.6. The dev model alone produces unstable output — this LoRA is essentially required.
T2V Test: Works but Quality Is Rough
First confirmed operation with 512x288, 33 frames Text-to-Video.
| Phase | Time |
|---|---|
| KSampler 25 steps | 6 min 7 sec (14.7s/step) |
| VAE decode | ~47 sec |
| Total | 6 min 54 sec |
~12x faster than Wan 2.2’s 82 minutes. Though resolutions are different (512x288 vs 832x480), so not a fair comparison. Video quality produced neon-lit cityscape but the robot subject was blurred and unidentifiable. Attributed to no distilled LoRA and low resolution.
I2V Test
Added distilled LoRA, raised resolution to 768x512, tested Image-to-Video.
| Phase | Time |
|---|---|
| KSampler 25 steps | ~12 min 50 sec (30.8s/step) |
| VAE decode | ~52 sec |
| Total | 13 min 42 sec |
768x512, 33 frames (25fps ≈ 1.3 seconds) in 13 min 42 sec. Compared to 512x288 T2V (6 min 54 sec), 2.7x the pixels took about 2x the time — not linear scaling.
T2V output (no distilled LoRA, 512x288):
Just colored blobs bouncing up and down. The robot subject is nowhere. 7 minutes to produce something achievable with CSS animation.
I2V output (distilled LoRA 0.6, 768x512):
The second attempt (strength 0.7, modified prompt, 13 min 50 sec) was also a total failure.
Frame-by-frame pixel difference analysis:
| Test | Inter-frame diff | First/second half motion | State |
|---|---|---|---|
| T2V (no LoRA, 512x288) | 14.5 | 14.5 / 14.4 | Whole screen bobbing up and down. Subject unidentifiable |
| I2V 1st (strength 1.0) | 10.5 | 0.2 / 20.8 | Static first half → sudden collapse second half |
| I2V 2nd (strength 0.7) | 0.7 | 0.2 / 1.2 | Almost a still image after 13 min 50 sec |
With I2V at strength 1.0, the input image is locked too tightly so the first half is completely static, then control is lost and it collapses mid-way. Lower to 0.7 and it barely moves. The “just right” range is extremely narrow, or motion generation simply doesn’t work at this quantization level.
Why: The Setup Differed from Official
Looking at these results I thought “GGUF quantization is the cause,” but the workflow configuration was completely different from official. The variables weren’t properly isolated.
Examining the official I2V workflow (Lightricks/ComfyUI-LTXVideo’s LTX-2_I2V_Distilled_wLora.json), the sampling pipeline is fundamentally different:
| Item | My workflow | Official workflow |
|---|---|---|
| Model | dev + distilled LoRA | distilled (direct use of distilled model) |
| Sampler | KSampler 25 steps | SamplerCustomAdvanced 2-stage |
| Scheduler | normal | ManualSigmas (hardcoded sigma values) |
| CFG | 5.5 | 1.0 |
| I2V method | LTXVImgToVideo | LTXVImgToVideoInplace × 2 |
The official workflow hides a 2-stage pipeline inside a group node. Stage 1 (8 steps) denoises, then re-conditions with LTXVImgToVideoInplace(strength=1.0), and Stage 2 (3 steps) refines. Sigma values are hardcoded via ManualSigmas — no LTXVScheduler.
Of course dev + LoRA didn’t work — sigma values are tuned for the distilled model, so plugging in a different model can’t work correctly. This was “wrong model” level failure, not “quantization degradation.”
LTX-2: Retest with Official Workflow
Verifying with fully official-aligned setup, only swapping model loaders for GGUF. This isolates just GGUF quantization and Apple Silicon-specific issues.
Distilled Model GGUF
The official uses the distilled model directly, not dev model + distilled LoRA. Downloaded ltx-2-19b-distilled-Q4_K_M.gguf (12GB) from Kijai/LTXV2_comfy.
| Category | File | Size |
|---|---|---|
| diffusion_models | ltx-2-19b-distilled-Q4_K_M.gguf | 12 GB |
| text_encoders | gemma-3-12b-it-Q4_K_M.gguf | 6.8 GB |
| text_encoders | ltx-2-19b-embeddings_connector_distill_bf16.safetensors | 2.7 GB |
| vae | LTX2_video_vae_bf16.safetensors | 2.3 GB |
The previous ltx-2-19b-dev-Q4_K_S.gguf (11GB) and distilled LoRA (7.1GB) are no longer needed. Removing the LoRA also simplifies the workflow.
Official 2-Stage Pipeline: NaN on MPS
Faithfully reproduced the official I2V workflow sampling pipeline, only swapping model loaders for GGUF versions.
- UnetLoaderGGUF:
ltx-2-19b-distilled-Q4_K_M.gguf(no LoRA) - DualCLIPLoaderGGUF (type: ltxv): Gemma 3 GGUF + embeddings connector
- ResizeImagesByLongerEdge (768) → LTXVPreprocess (crf=33): input image preprocessing
- LTXVImgToVideoInplace (strength=0.6): initial I2V conditioning
- Stage 1: ManualSigmas
1., 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0(8 steps), CFGGuider cfg=1.0, SamplerCustomAdvanced - LTXVImgToVideoInplace (strength=1.0): stage 1 output re-conditioning
- Stage 2: ManualSigmas
0.909375, 0.725, 0.421875, 0.0(3 steps), CFGGuider cfg=1.0, SamplerCustomAdvanced (seed=420 fixed) - VAEDecode → save
The only difference from official is the node type for model loaders (UNETLoader/CLIPLoader → UnetLoaderGGUF/DualCLIPLoaderGGUF) and filenames. Pipeline structure, sigma values, CFG, strength values all match official exactly.
Result: sampling completes 8+3 steps, but RuntimeWarning: invalid value encountered in cast occurs at VAE decode. Output is a 1-frame garbage image (17KB). The diffusion model’s output latent contains NaN/Inf that VAE can’t decode.
The dev GGUF + KSampler configuration generated 33-frame video in the same MPS environment, so GGUF+MPS itself isn’t broken. SamplerCustomAdvanced + ManualSigmas 2-stage pipeline appears incompatible with the MPS backend.
Switch to Single-Stage KSampler
Since the official pipeline doesn’t work, tried the single-stage KSampler configuration with the distilled model — which has a working track record.
- UnetLoaderGGUF → ModelSamplingLTXV → KSampler (10 steps, cfg 1.0, euler, normal)
- LTXVImgToVideo (strength 0.7)
- 768x512, 33 frames
No NaN, 33 frames generated.
Average inter-frame difference is 1.19. Some frames show max 10.3 motion, but overall nearly static. The dev GGUF + KSampler T2V (CFG 5.5, 25 steps) showed inter-frame diff 14.5 with actual motion — the distilled model + CFG 1.0 combination just doesn’t generate motion through KSampler. The distilled model is tuned for the official 2-stage pipeline, and KSampler’s simple scheduler doesn’t work as expected.
LTX-2 Conclusion
Running LTX-2 on M1 Max 64GB requires GGUF format (FP8 doesn’t work on Metal). Simple GGUF + KSampler configuration generates video but quality doesn’t reach usable levels. The officially recommended 2-stage SamplerCustomAdvanced + ManualSigmas pipeline produces NaN on the MPS backend and doesn’t work.
| Configuration | Works | Quality |
|---|---|---|
| dev GGUF + KSampler T2V | Works (7 min) | Subject unidentifiable |
| dev GGUF + LoRA + KSampler I2V | Works (14 min) | Motion breaks or static |
| distilled GGUF + official 2-stage pipeline | NaN (fails) | - |
| distilled GGUF + KSampler I2V | Works (3 min) | Essentially a still image |
With the official pipeline failing on MPS, proper video generation with LTX-2 on Mac is currently not feasible. NVIDIA GPU is needed.