Tech 13 min read

Can LTX-2 and Wan 2.2 Run on an M1 Max 64GB? I Checked, Then Actually Ran Them

I want to run AI video generation locally on an M1 Max 64GB MacBook Pro. Two candidates: LTX-2 and Wan 2.2. Both are open-source and have received attention as locally runnable models.

Before running them, I researched spec requirements and Mac compatibility. Wan 2.2 ran in GGUF format but took 82 minutes for a 2-second video.

LTX-2

A video and audio simultaneous generation model developed by Lightricks. Rather than generating video and audio separately and combining them, a single inference produces both.

  • Parameters: Dual-stream of 14B video + 5B audio
  • Output: FHD to 4K, 25/50fps, 6–10 seconds (up to 20 seconds)
  • Audio: Environmental sounds, foley, music, and dialogue generated simultaneously
  • License: Open source (GitHub, HuggingFace)

Audio Control

Audio can be specified in detail through prompts. From the official prompt guide:

  • Environmental sounds: “rain falling on a tin roof”
  • Sound effects: “footsteps on gravel”
  • Music: “upbeat jazz piano in the background”
  • Dialogue: Specify in quotes with language and accent

Modality-CFG lets you adjust the control balance between video and audio. Uses Gemma 3 as the text encoder and claims multilingual support. Japanese dialogue quality untested.

Mac Compatibility

Official system requirements assume NVIDIA GPU (32GB+ VRAM recommended), but it runs on Apple Silicon.

Performance (Existing Reports)

EnvironmentGeneration timeVideo lengthResolution
NVIDIA H100~2 seconds5 seconds768x512
Mac M3~5 minutesa few secondsunknown
Mac M1 16GB~15 minutes~4 secondsunknown

M1 Max 64GB should be faster than M1 16GB, but nothing close to NVIDIA.

Wan 2.2

A video generation model developed by Alibaba, well-regarded for quality.

  • Models: Two sizes — 1.3B (lightweight) and 14B (full)
  • Recommended VRAM: 1.3B: 6–12GB, 14B: 24GB+
  • Output: 480p to 1080p
  • Audio: Not supported (video only)

Mac Compatibility

Officially CUDA-only, Metal/MPS not supported. However, there are ways to run it.

  • Community fork with MPS support (M1 Pro working)
  • FP8 causes errors on MPS via ComfyUI → can bypass with GGUF format
  • Apple Silicon issues reported in ComfyUI issue #9255

Performance (Existing Reports)

EnvironmentModelGeneration timeFramesResolution
NVIDIA RTX 3060 12GB14B~15 minutes81 frames840x420
Mac M1 Pro1.3B~10 minutes8 frames480p

M1 Pro takes 10 minutes for 8 frames (about 0.3 seconds at 25fps). A 5-second video would be well over an hour.

Comparison

LTX-2Wan 2.2
Mac compatibilityMLX port available. Relatively easyOfficially unsupported. GGUF workaround needed
Setup difficultyLowModerately high
Simultaneous audioSupportedNot supported
Generation speed (Mac)~7 min for 1.3-sec video (M1 Max 64GB, GGUF)82 min for 2-sec video (M1 Max 64GB, GGUF)
Video qualityGoodHigher quality (especially 14B)
Model size~42GB1.3B: a few GB / 14B: tens of GB

For just trying things out on Mac, LTX-2 is overwhelmingly easier. Simultaneous audio makes it interesting for i2v use cases. Wan 2.2 has higher quality but running at practical speeds on Mac is difficult — you’d want NVIDIA GPU for serious use.

Real-World Tests

Wan 2.2: Challenge with GGUF Format

I initially planned to skip this, but found that GGUF format could bypass the FP8 issue and run on Mac.

Four files needed:

CategoryFileSize
diffusion_modelswan2.2_t2v_high_noise_14B_Q4_K_S.gguf8.75 GB
diffusion_modelswan2.2_t2v_low_noise_14B_Q4_K_S.gguf8.75 GB
text_encodersumt5-xxl-encoder-Q8_0.gguf6.04 GB
vaewan_2.1_vae.safetensors242 MB

Total about 24GB. Model from bullerwins/Wan2.2-T2V-A14B-GGUF, text encoder from city96/umt5-xxl-encoder-gguf, VAE from Comfy-Org/Wan_2.2_ComfyUI_Repackaged.

Wan 2.2’s 14B model has a two-step high noise / low noise architecture. Two KSamplerAdvanced nodes are chained in the workflow, each loading models from separate checkpoints.

  • Stage 1 (high noise): Steps 0–10, noise added. Determines overall composition and layout
  • Stage 2 (low noise): Steps 10–20, no noise added. Receives stage 1’s output latent and refines details

Using separately optimized models for high and low noise stages improves quality — that’s why two GGUF files are needed.

Quantization level is Q4_K_S. With 64GB available, Q8_0 would fit in memory, but confirming it works is the priority.

ComfyUI-GGUF Node

ComfyUI’s standard model loader can’t read GGUF. Need to install city96/ComfyUI-GGUF as a custom node. This adds UnetLoaderGGUF and CLIPLoaderGGUF nodes.

Workflow

Based on ComfyUI’s official Wan 2.2 T2V 14B workflow, swapping model loaders for GGUF versions:

  • UNETLoader × 2 → UnetLoaderGGUF × 2 (with GGUF filenames)
  • CLIPLoaderCLIPLoaderGGUF (umt5-xxl GGUF Q8_0)
  • VAELoader → wan_2.1_vae.safetensors
  • Resolution 1280x704 → 832x480, frame count 57 → 33 (just verifying it runs first)

The VAE Trap

Initially used wan2.2_vae.safetensors (1.41GB), which errored at VAE decode after 80 minutes of sampling:

RuntimeError: Given groups=1, weight of size [48, 48, 1, 1, 1],
expected input[1, 16, 9, 60, 104] to have 48 channels, but got 16 channels instead

wan2.2_vae expects 48-channel input, but the T2V diffusion model outputs 16-channel latent. The official workflow notes also clearly state “This model uses the wan 2.1 VAE.” Wan 2.2 only updated the diffusion model; T2V/I2V VAE still uses the 2.1 version. The correct VAE is wan_2.1_vae.safetensors (242MB). Watch out — the version numbers don’t match.

Result: It worked

After fixing the VAE, the video generated without errors.

Prompt: “a robot is running through a futuristic cyberpunk city with neon signs and darkness with bright HDR lights”

PhaseTime
Stage 1 KSampler (high noise) 10 steps41 min 40 sec (250s/step)
Stage 2 KSampler (low noise) 10 steps39 min 21 sec (236s/step)
VAE decode~1 min 41 sec
Total1 hour 22 min 45 sec

832x480, 33 frames (16fps ≈ 2 seconds) took 82 minutes. It works, but practical speed this isn’t. A 5-second video would be over 3 hours.

LTX-2: Setup

Using ComfyUI’s official “LTX-2 i2v” template. Loading the workflow prompts downloading these models:

CategoryFileSize
checkpointsltx-2-19b-dev-fp8.safetensors25.22 GB
text_encodersgemma_3_12B_it_fp4_mixed.safetensors8.8 GB
lorasltx-2-19b-distilled-lora-384.safetensors7.15 GB
lorasltx-2-19b-lora-camera-control-dolly-left.safetensors312 MB
latent_upscale_modelsltx-2-spatial-upscaler-x2-1.0.safetensorsunknown

About 42GB total. The checkpoint alone is 25GB, so depending on your connection this could take a while.

Pre-testing research said FP8 doesn’t work on Metal, but since the official ComfyUI template specifies FP8 checkpoints, I tried it first.

FP8: Doesn’t Work

RuntimeError: Undefined type Float8_e4m3fn

Errors out in 55 seconds. Metal doesn’t implement Float8_e4m3fn, so FP8 checkpoints can’t be used on Apple Silicon. The official ComfyUI template specifies FP8 because it assumes NVIDIA GPU.

LTX-2: Retry with GGUF Format

If FP8 fails, use GGUF like with Wan 2.2. Kijai/LTXV2_comfy distributes GGUF quantized LTX-2 models.

However, LTX-2 has a wider FP8 problem than Wan 2.2. Not just the checkpoint — the official distribution of the text encoder (Gemma 3 12B), gemma_3_12B_it_fp4_mixed.safetensors, also contains FP8 tensors. Both the model body and text encoder need GGUF replacements.

CategoryFileSizeSource
diffusion_modelsltx-2-19b-dev-Q4_K_S.gguf11 GBKijai/LTXV2_comfy
text_encodersgemma-3-12b-it-Q4_K_M.gguf6.8 GBunsloth/gemma-3-12b-it-GGUF
text_encodersltx-2-19b-embeddings_connector_distill_bf16.safetensors2.7 GBKijai/LTXV2_comfy
vaeLTX2_video_vae_bf16.safetensors2.3 GBKijai/LTXV2_comfy
lorasltx-2-19b-distilled-lora-384.safetensors7.1 GBLightricks/LTX-2

~30GB total. More than Wan 2.2’s 24GB, but no problem with 64GB.

Gemma 3 Also Requires GGUF

Trying gemma_3_12B_it_fp4_mixed.safetensors first produced the exact same error as the checkpoint:

RuntimeError: Undefined type Float8_e4m3fn

The name says “fp4_mixed” but it contains mixed FP8 tensors internally. Can’t use on Mac. Download GGUF from unsloth/gemma-3-12b-it-GGUF and load with the DualCLIPLoaderGGUF node (type: ltxv). Specify Gemma 3 GGUF as clip_name1 and embeddings connector as clip_name2.

ComfyUI-GGUF has Gemma 3 support as of PR #402. Latest version is fine.

Workflow Configuration

Unlike Wan 2.2, LTX-2 is a single-stage architecture. One KSampler is enough.

  • UnetLoaderGGUFLoraLoaderModelOnly (distilled LoRA, strength 0.6) → ModelSamplingLTXV (max_shift 2.05, base_shift 0.95)
  • DualCLIPLoaderGGUF (type: ltxv): Gemma 3 GGUF + embeddings connector
  • LTXVImgToVideo: generates latent from input image (for i2v)
  • KSampler: 25 steps, cfg 5.5, euler, normal
  • VAELoaderVAEDecode: LTX2_video_vae_bf16.safetensors

The official workflow uses distilled LoRA at strength 0.6. The dev model alone produces unstable output — this LoRA is essentially required.

T2V Test: Works but Quality Is Rough

First confirmed operation with 512x288, 33 frames Text-to-Video.

PhaseTime
KSampler 25 steps6 min 7 sec (14.7s/step)
VAE decode~47 sec
Total6 min 54 sec

~12x faster than Wan 2.2’s 82 minutes. Though resolutions are different (512x288 vs 832x480), so not a fair comparison. Video quality produced neon-lit cityscape but the robot subject was blurred and unidentifiable. Attributed to no distilled LoRA and low resolution.

I2V Test

Added distilled LoRA, raised resolution to 768x512, tested Image-to-Video.

PhaseTime
KSampler 25 steps~12 min 50 sec (30.8s/step)
VAE decode~52 sec
Total13 min 42 sec

768x512, 33 frames (25fps ≈ 1.3 seconds) in 13 min 42 sec. Compared to 512x288 T2V (6 min 54 sec), 2.7x the pixels took about 2x the time — not linear scaling.

T2V output (no distilled LoRA, 512x288):

Just colored blobs bouncing up and down. The robot subject is nowhere. 7 minutes to produce something achievable with CSS animation.

I2V output (distilled LoRA 0.6, 768x512):

The second attempt (strength 0.7, modified prompt, 13 min 50 sec) was also a total failure.

Frame-by-frame pixel difference analysis:

TestInter-frame diffFirst/second half motionState
T2V (no LoRA, 512x288)14.514.5 / 14.4Whole screen bobbing up and down. Subject unidentifiable
I2V 1st (strength 1.0)10.50.2 / 20.8Static first half → sudden collapse second half
I2V 2nd (strength 0.7)0.70.2 / 1.2Almost a still image after 13 min 50 sec

With I2V at strength 1.0, the input image is locked too tightly so the first half is completely static, then control is lost and it collapses mid-way. Lower to 0.7 and it barely moves. The “just right” range is extremely narrow, or motion generation simply doesn’t work at this quantization level.

Why: The Setup Differed from Official

Looking at these results I thought “GGUF quantization is the cause,” but the workflow configuration was completely different from official. The variables weren’t properly isolated.

Examining the official I2V workflow (Lightricks/ComfyUI-LTXVideo’s LTX-2_I2V_Distilled_wLora.json), the sampling pipeline is fundamentally different:

ItemMy workflowOfficial workflow
Modeldev + distilled LoRAdistilled (direct use of distilled model)
SamplerKSampler 25 stepsSamplerCustomAdvanced 2-stage
SchedulernormalManualSigmas (hardcoded sigma values)
CFG5.51.0
I2V methodLTXVImgToVideoLTXVImgToVideoInplace × 2

The official workflow hides a 2-stage pipeline inside a group node. Stage 1 (8 steps) denoises, then re-conditions with LTXVImgToVideoInplace(strength=1.0), and Stage 2 (3 steps) refines. Sigma values are hardcoded via ManualSigmas — no LTXVScheduler.

Of course dev + LoRA didn’t work — sigma values are tuned for the distilled model, so plugging in a different model can’t work correctly. This was “wrong model” level failure, not “quantization degradation.”

LTX-2: Retest with Official Workflow

Verifying with fully official-aligned setup, only swapping model loaders for GGUF. This isolates just GGUF quantization and Apple Silicon-specific issues.

Distilled Model GGUF

The official uses the distilled model directly, not dev model + distilled LoRA. Downloaded ltx-2-19b-distilled-Q4_K_M.gguf (12GB) from Kijai/LTXV2_comfy.

CategoryFileSize
diffusion_modelsltx-2-19b-distilled-Q4_K_M.gguf12 GB
text_encodersgemma-3-12b-it-Q4_K_M.gguf6.8 GB
text_encodersltx-2-19b-embeddings_connector_distill_bf16.safetensors2.7 GB
vaeLTX2_video_vae_bf16.safetensors2.3 GB

The previous ltx-2-19b-dev-Q4_K_S.gguf (11GB) and distilled LoRA (7.1GB) are no longer needed. Removing the LoRA also simplifies the workflow.

Official 2-Stage Pipeline: NaN on MPS

Faithfully reproduced the official I2V workflow sampling pipeline, only swapping model loaders for GGUF versions.

  • UnetLoaderGGUF: ltx-2-19b-distilled-Q4_K_M.gguf (no LoRA)
  • DualCLIPLoaderGGUF (type: ltxv): Gemma 3 GGUF + embeddings connector
  • ResizeImagesByLongerEdge (768) → LTXVPreprocess (crf=33): input image preprocessing
  • LTXVImgToVideoInplace (strength=0.6): initial I2V conditioning
  • Stage 1: ManualSigmas 1., 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0 (8 steps), CFGGuider cfg=1.0, SamplerCustomAdvanced
  • LTXVImgToVideoInplace (strength=1.0): stage 1 output re-conditioning
  • Stage 2: ManualSigmas 0.909375, 0.725, 0.421875, 0.0 (3 steps), CFGGuider cfg=1.0, SamplerCustomAdvanced (seed=420 fixed)
  • VAEDecode → save

The only difference from official is the node type for model loaders (UNETLoader/CLIPLoader → UnetLoaderGGUF/DualCLIPLoaderGGUF) and filenames. Pipeline structure, sigma values, CFG, strength values all match official exactly.

Result: sampling completes 8+3 steps, but RuntimeWarning: invalid value encountered in cast occurs at VAE decode. Output is a 1-frame garbage image (17KB). The diffusion model’s output latent contains NaN/Inf that VAE can’t decode.

The dev GGUF + KSampler configuration generated 33-frame video in the same MPS environment, so GGUF+MPS itself isn’t broken. SamplerCustomAdvanced + ManualSigmas 2-stage pipeline appears incompatible with the MPS backend.

Switch to Single-Stage KSampler

Since the official pipeline doesn’t work, tried the single-stage KSampler configuration with the distilled model — which has a working track record.

  • UnetLoaderGGUFModelSamplingLTXVKSampler (10 steps, cfg 1.0, euler, normal)
  • LTXVImgToVideo (strength 0.7)
  • 768x512, 33 frames

No NaN, 33 frames generated.

Average inter-frame difference is 1.19. Some frames show max 10.3 motion, but overall nearly static. The dev GGUF + KSampler T2V (CFG 5.5, 25 steps) showed inter-frame diff 14.5 with actual motion — the distilled model + CFG 1.0 combination just doesn’t generate motion through KSampler. The distilled model is tuned for the official 2-stage pipeline, and KSampler’s simple scheduler doesn’t work as expected.

LTX-2 Conclusion

Running LTX-2 on M1 Max 64GB requires GGUF format (FP8 doesn’t work on Metal). Simple GGUF + KSampler configuration generates video but quality doesn’t reach usable levels. The officially recommended 2-stage SamplerCustomAdvanced + ManualSigmas pipeline produces NaN on the MPS backend and doesn’t work.

ConfigurationWorksQuality
dev GGUF + KSampler T2VWorks (7 min)Subject unidentifiable
dev GGUF + LoRA + KSampler I2VWorks (14 min)Motion breaks or static
distilled GGUF + official 2-stage pipelineNaN (fails)-
distilled GGUF + KSampler I2VWorks (3 min)Essentially a still image

With the official pipeline failing on MPS, proper video generation with LTX-2 on Mac is currently not feasible. NVIDIA GPU is needed.