LTX-2 vs Wan 2.2 on M1 Max 64GB: FP8 fails on Metal, GGUF Wan 2.2 takes 82 min per 2-sec clip

Update 2026-04-29: A Windows + RTX 4060 (8GB VRAM) follow-up where I ran Wan 2.2 i2v → Running WAN 2.2 i2v on an RTX 4060 (8GB VRAM) with ComfyUI

I want to run AI video generation locally on an M1 Max 64GB MacBook Pro. Two candidates: LTX-2 and Wan 2.2. Both are open-source and have received attention as locally runnable models.

Before running them, I researched spec requirements and Mac compatibility. Wan 2.2 ran in GGUF format but took 82 minutes for a 2-second video.

LTX-2

A video and audio simultaneous generation model developed by Lightricks. Rather than generating video and audio separately and combining them, a single inference produces both.

Parameters: Dual-stream of 14B video + 5B audio
Output: FHD to 4K, 25/50fps, 6–10 seconds (up to 20 seconds)
Audio: Environmental sounds, foley, music, and dialogue generated simultaneously
License: Open source (GitHub, HuggingFace)

Audio Control

Audio can be specified in detail through prompts. From the official prompt guide:

Environmental sounds: “rain falling on a tin roof”
Sound effects: “footsteps on gravel”
Music: “upbeat jazz piano in the background”
Dialogue: Specify in quotes with language and accent

Modality-CFG lets you adjust the control balance between video and audio. Uses Gemma 3 as the text encoder and claims multilingual support. Japanese dialogue quality untested.

Mac Compatibility

Official system requirements assume NVIDIA GPU (32GB+ VRAM recommended), but it runs on Apple Silicon.

MLX port project in progress
Native macOS app exists (for LTX-Video original)
FP8 models don’t work on Metal. Must use FP16 (bf16) or GGUF format

Performance (Existing Reports)

Environment	Generation time	Video length	Resolution
NVIDIA H100	~2 seconds	5 seconds	768x512
Mac M3	~5 minutes	a few seconds	unknown
Mac M1 16GB	~15 minutes	~4 seconds	unknown

M1 Max 64GB should be faster than M1 16GB, but nothing close to NVIDIA.

Wan 2.2

A video generation model developed by Alibaba, well-regarded for quality.

Models: Two sizes — 1.3B (lightweight) and 14B (full)
Recommended VRAM: 1.3B: 6–12GB, 14B: 24GB+
Output: 480p to 1080p
Audio: Not supported (video only)

Mac Compatibility

Officially CUDA-only, Metal/MPS not supported. However, there are ways to run it.

Community fork with MPS support (M1 Pro working)
FP8 causes errors on MPS via ComfyUI → can bypass with GGUF format
Apple Silicon issues reported in ComfyUI issue #9255

Performance (Existing Reports)

Environment	Model	Generation time	Frames	Resolution
NVIDIA RTX 3060 12GB	14B	~15 minutes	81 frames	840x420
Mac M1 Pro	1.3B	~10 minutes	8 frames	480p

M1 Pro takes 10 minutes for 8 frames (about 0.3 seconds at 25fps). A 5-second video would be well over an hour.

Comparison

	LTX-2	Wan 2.2
Mac compatibility	MLX port available. Relatively easy	Officially unsupported. GGUF workaround needed
Setup difficulty	Low	Moderately high
Simultaneous audio	Supported	Not supported
Generation speed (Mac)	~7 min for 1.3-sec video (M1 Max 64GB, GGUF)	82 min for 2-sec video (M1 Max 64GB, GGUF)
Video quality	Good	Higher quality (especially 14B)
Model size	~42GB	1.3B: a few GB / 14B: tens of GB

For just trying things out on Mac, LTX-2 is overwhelmingly easier. Simultaneous audio makes it interesting for i2v use cases. Wan 2.2 has higher quality but running at practical speeds on Mac is difficult — you’d want NVIDIA GPU for serious use.

Real-World Tests

Wan 2.2: Challenge with GGUF Format

I initially planned to skip this, but found that GGUF format could bypass the FP8 issue and run on Mac.

Four files needed:

Category	File	Size
diffusion_models	wan2.2_t2v_high_noise_14B_Q4_K_S.gguf	8.75 GB
diffusion_models	wan2.2_t2v_low_noise_14B_Q4_K_S.gguf	8.75 GB
text_encoders	umt5-xxl-encoder-Q8_0.gguf	6.04 GB
vae	wan_2.1_vae.safetensors	242 MB

Total about 24GB. Model from bullerwins/Wan2.2-T2V-A14B-GGUF, text encoder from city96/umt5-xxl-encoder-gguf, VAE from Comfy-Org/Wan_2.2_ComfyUI_Repackaged.

Wan 2.2’s 14B model has a two-step high noise / low noise architecture. Two KSamplerAdvanced nodes are chained in the workflow, each loading models from separate checkpoints.

Stage 1 (high noise): Steps 0–10, noise added. Determines overall composition and layout
Stage 2 (low noise): Steps 10–20, no noise added. Receives stage 1’s output latent and refines details

Using separately optimized models for high and low noise stages improves quality — that’s why two GGUF files are needed.

Quantization level is Q4_K_S. With 64GB available, Q8_0 would fit in memory, but confirming it works is the priority.

ComfyUI-GGUF Node

ComfyUI’s standard model loader can’t read GGUF. Need to install city96/ComfyUI-GGUF as a custom node. This adds UnetLoaderGGUF and CLIPLoaderGGUF nodes.

Workflow

Based on ComfyUI’s official Wan 2.2 T2V 14B workflow, swapping model loaders for GGUF versions:

UNETLoader × 2 → UnetLoaderGGUF × 2 (with GGUF filenames)
CLIPLoader → CLIPLoaderGGUF (umt5-xxl GGUF Q8_0)
VAELoader → wan_2.1_vae.safetensors
Resolution 1280x704 → 832x480, frame count 57 → 33 (just verifying it runs first)

The VAE Trap

Initially used wan2.2_vae.safetensors (1.41GB), which errored at VAE decode after 80 minutes of sampling:

RuntimeError: Given groups=1, weight of size [48, 48, 1, 1, 1],
expected input[1, 16, 9, 60, 104] to have 48 channels, but got 16 channels instead

wan2.2_vae expects 48-channel input, but the T2V diffusion model outputs 16-channel latent. The official workflow notes also clearly state “This model uses the wan 2.1 VAE.” Wan 2.2 only updated the diffusion model; T2V/I2V VAE still uses the 2.1 version. The correct VAE is wan_2.1_vae.safetensors (242MB). Watch out — the version numbers don’t match.

Result: It worked

After fixing the VAE, the video generated without errors.

Prompt: “a robot is running through a futuristic cyberpunk city with neon signs and darkness with bright HDR lights”

Phase	Time
Stage 1 KSampler (high noise) 10 steps	41 min 40 sec (250s/step)
Stage 2 KSampler (low noise) 10 steps	39 min 21 sec (236s/step)
VAE decode	~1 min 41 sec
Total	1 hour 22 min 45 sec

832x480, 33 frames (16fps ≈ 2 seconds) took 82 minutes. It works, but practical speed this isn’t. A 5-second video would be over 3 hours.

LTX-2: Setup

Using ComfyUI’s official “LTX-2 i2v” template. Loading the workflow prompts downloading these models:

Category	File	Size
checkpoints	ltx-2-19b-dev-fp8.safetensors	25.22 GB
text_encoders	gemma_3_12B_it_fp4_mixed.safetensors	8.8 GB
loras	ltx-2-19b-distilled-lora-384.safetensors	7.15 GB
loras	ltx-2-19b-lora-camera-control-dolly-left.safetensors	312 MB
latent_upscale_models	ltx-2-spatial-upscaler-x2-1.0.safetensors	unknown

About 42GB total. The checkpoint alone is 25GB, so depending on your connection this could take a while.

Pre-testing research said FP8 doesn’t work on Metal, but since the official ComfyUI template specifies FP8 checkpoints, I tried it first.

FP8: Doesn’t Work

RuntimeError: Undefined type Float8_e4m3fn

Errors out in 55 seconds. Metal doesn’t implement Float8_e4m3fn, so FP8 checkpoints can’t be used on Apple Silicon. The official ComfyUI template specifies FP8 because it assumes NVIDIA GPU.

LTX-2: Retry with GGUF Format

If FP8 fails, use GGUF like with Wan 2.2. Kijai/LTXV2_comfy distributes GGUF quantized LTX-2 models.

However, LTX-2 has a wider FP8 problem than Wan 2.2. Not just the checkpoint — the official distribution of the text encoder (Gemma 3 12B), gemma_3_12B_it_fp4_mixed.safetensors, also contains FP8 tensors. Both the model body and text encoder need GGUF replacements.

Category	File	Size	Source
diffusion_models	ltx-2-19b-dev-Q4_K_S.gguf	11 GB	Kijai/LTXV2_comfy
text_encoders	gemma-3-12b-it-Q4_K_M.gguf	6.8 GB	unsloth/gemma-3-12b-it-GGUF
text_encoders	ltx-2-19b-embeddings_connector_distill_bf16.safetensors	2.7 GB	Kijai/LTXV2_comfy
vae	LTX2_video_vae_bf16.safetensors	2.3 GB	Kijai/LTXV2_comfy
loras	ltx-2-19b-distilled-lora-384.safetensors	7.1 GB	Lightricks/LTX-2

~30GB total. More than Wan 2.2’s 24GB, but no problem with 64GB.

Gemma 3 Also Requires GGUF

Trying gemma_3_12B_it_fp4_mixed.safetensors first produced the exact same error as the checkpoint:

RuntimeError: Undefined type Float8_e4m3fn

The name says “fp4_mixed” but it contains mixed FP8 tensors internally. Can’t use on Mac. Download GGUF from unsloth/gemma-3-12b-it-GGUF and load with the DualCLIPLoaderGGUF node (type: ltxv). Specify Gemma 3 GGUF as clip_name1 and embeddings connector as clip_name2.

ComfyUI-GGUF has Gemma 3 support as of PR #402. Latest version is fine.

Workflow Configuration

Unlike Wan 2.2, LTX-2 is a single-stage architecture. One KSampler is enough.

UnetLoaderGGUF → LoraLoaderModelOnly (distilled LoRA, strength 0.6) → ModelSamplingLTXV (max_shift 2.05, base_shift 0.95)
DualCLIPLoaderGGUF (type: ltxv): Gemma 3 GGUF + embeddings connector
LTXVImgToVideo: generates latent from input image (for i2v)
KSampler: 25 steps, cfg 5.5, euler, normal
VAELoader → VAEDecode: LTX2_video_vae_bf16.safetensors

The official workflow uses distilled LoRA at strength 0.6. The dev model alone produces unstable output — this LoRA is essentially required.

T2V Test: Works but Quality Is Rough

First confirmed operation with 512x288, 33 frames Text-to-Video.

Phase	Time
KSampler 25 steps	6 min 7 sec (14.7s/step)
VAE decode	~47 sec
Total	6 min 54 sec

~12x faster than Wan 2.2’s 82 minutes. Though resolutions are different (512x288 vs 832x480), so not a fair comparison. Video quality produced neon-lit cityscape but the robot subject was blurred and unidentifiable. Attributed to no distilled LoRA and low resolution.

I2V Test

Added distilled LoRA, raised resolution to 768x512, tested Image-to-Video.

Phase	Time
KSampler 25 steps	~12 min 50 sec (30.8s/step)
VAE decode	~52 sec
Total	13 min 42 sec

768x512, 33 frames (25fps ≈ 1.3 seconds) in 13 min 42 sec. Compared to 512x288 T2V (6 min 54 sec), 2.7x the pixels took about 2x the time — not linear scaling.

T2V output (no distilled LoRA, 512x288):

Just colored blobs bouncing up and down. The robot subject is nowhere. 7 minutes to produce something achievable with CSS animation.

I2V output (distilled LoRA 0.6, 768x512):

The second attempt (strength 0.7, modified prompt, 13 min 50 sec) was also a total failure.

Frame-by-frame pixel difference analysis:

Test	Inter-frame diff	First/second half motion	State
T2V (no LoRA, 512x288)	14.5	14.5 / 14.4	Whole screen bobbing up and down. Subject unidentifiable
I2V 1st (strength 1.0)	10.5	0.2 / 20.8	Static first half → sudden collapse second half
I2V 2nd (strength 0.7)	0.7	0.2 / 1.2	Almost a still image after 13 min 50 sec

With I2V at strength 1.0, the input image is locked too tightly so the first half is completely static, then control is lost and it collapses mid-way. Lower to 0.7 and it barely moves. The “just right” range is extremely narrow, or motion generation simply doesn’t work at this quantization level.

Why: The Setup Differed from Official

Looking at these results I thought “GGUF quantization is the cause,” but the workflow configuration was completely different from official. The variables weren’t properly isolated.

Examining the official I2V workflow (Lightricks/ComfyUI-LTXVideo’s LTX-2_I2V_Distilled_wLora.json), the sampling pipeline is fundamentally different:

Item	My workflow	Official workflow
Model	dev + distilled LoRA	distilled (direct use of distilled model)
Sampler	KSampler 25 steps	SamplerCustomAdvanced 2-stage
Scheduler	normal	ManualSigmas (hardcoded sigma values)
CFG	5.5	1.0
I2V method	LTXVImgToVideo	LTXVImgToVideoInplace × 2

The official workflow hides a 2-stage pipeline inside a group node. Stage 1 (8 steps) denoises, then re-conditions with LTXVImgToVideoInplace(strength=1.0), and Stage 2 (3 steps) refines. Sigma values are hardcoded via ManualSigmas — no LTXVScheduler.

Of course dev + LoRA didn’t work — sigma values are tuned for the distilled model, so plugging in a different model can’t work correctly. This was “wrong model” level failure, not “quantization degradation.”

LTX-2: Retest with Official Workflow

Verifying with fully official-aligned setup, only swapping model loaders for GGUF. This isolates just GGUF quantization and Apple Silicon-specific issues.

Distilled Model GGUF

The official uses the distilled model directly, not dev model + distilled LoRA. Downloaded ltx-2-19b-distilled-Q4_K_M.gguf (12GB) from Kijai/LTXV2_comfy.

Category	File	Size
diffusion_models	ltx-2-19b-distilled-Q4_K_M.gguf	12 GB
text_encoders	gemma-3-12b-it-Q4_K_M.gguf	6.8 GB
text_encoders	ltx-2-19b-embeddings_connector_distill_bf16.safetensors	2.7 GB
vae	LTX2_video_vae_bf16.safetensors	2.3 GB

The previous ltx-2-19b-dev-Q4_K_S.gguf (11GB) and distilled LoRA (7.1GB) are no longer needed. Removing the LoRA also simplifies the workflow.

Official 2-Stage Pipeline: NaN on MPS

Faithfully reproduced the official I2V workflow sampling pipeline, only swapping model loaders for GGUF versions.

UnetLoaderGGUF: ltx-2-19b-distilled-Q4_K_M.gguf (no LoRA)
DualCLIPLoaderGGUF (type: ltxv): Gemma 3 GGUF + embeddings connector
ResizeImagesByLongerEdge (768) → LTXVPreprocess (crf=33): input image preprocessing
LTXVImgToVideoInplace (strength=0.6): initial I2V conditioning
Stage 1: ManualSigmas 1., 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0 (8 steps), CFGGuider cfg=1.0, SamplerCustomAdvanced
LTXVImgToVideoInplace (strength=1.0): stage 1 output re-conditioning
Stage 2: ManualSigmas 0.909375, 0.725, 0.421875, 0.0 (3 steps), CFGGuider cfg=1.0, SamplerCustomAdvanced (seed=420 fixed)
VAEDecode → save

The only difference from official is the node type for model loaders (UNETLoader/CLIPLoader → UnetLoaderGGUF/DualCLIPLoaderGGUF) and filenames. Pipeline structure, sigma values, CFG, strength values all match official exactly.

Result: sampling completes 8+3 steps, but RuntimeWarning: invalid value encountered in cast occurs at VAE decode. Output is a 1-frame garbage image (17KB). The diffusion model’s output latent contains NaN/Inf that VAE can’t decode.

The dev GGUF + KSampler configuration generated 33-frame video in the same MPS environment, so GGUF+MPS itself isn’t broken. SamplerCustomAdvanced + ManualSigmas 2-stage pipeline appears incompatible with the MPS backend.

Switch to Single-Stage KSampler

Since the official pipeline doesn’t work, tried the single-stage KSampler configuration with the distilled model — which has a working track record.

UnetLoaderGGUF → ModelSamplingLTXV → KSampler (10 steps, cfg 1.0, euler, normal)
LTXVImgToVideo (strength 0.7)
768x512, 33 frames

No NaN, 33 frames generated.

Average inter-frame difference is 1.19. Some frames show max 10.3 motion, but overall nearly static. The dev GGUF + KSampler T2V (CFG 5.5, 25 steps) showed inter-frame diff 14.5 with actual motion — the distilled model + CFG 1.0 combination just doesn’t generate motion through KSampler. The distilled model is tuned for the official 2-stage pipeline, and KSampler’s simple scheduler doesn’t work as expected.

LTX-2 Conclusion

Running LTX-2 on M1 Max 64GB requires GGUF format (FP8 doesn’t work on Metal). Simple GGUF + KSampler configuration generates video but quality doesn’t reach usable levels. The officially recommended 2-stage SamplerCustomAdvanced + ManualSigmas pipeline produces NaN on the MPS backend and doesn’t work.

Configuration	Works	Quality
dev GGUF + KSampler T2V	Works (7 min)	Subject unidentifiable
dev GGUF + LoRA + KSampler I2V	Works (14 min)	Motion breaks or static
distilled GGUF + official 2-stage pipeline	NaN (fails)	-
distilled GGUF + KSampler I2V	Works (3 min)	Essentially a still image

With the official pipeline failing on MPS, proper video generation with LTX-2 on Mac is currently not feasible. NVIDIA GPU is needed.