Technical details of UltraFlux-v1, a model that pushes FLUX.1-dev into native 4K generation. It covers the differences from Z-Image and FLUX.2 Klein, its RoPE extensions and VAE improvements, and practical caveats.
A technical walkthrough of Alibaba's Qwen3-Omni-30B-A3B. An omni-modal model that activates only 3B out of 30B and responds with speech from text/image/audio/video inputs. The article organizes the Thinker–Talker architecture, benchmarks, and the overall Qwen3 MoE family.
A technical look at ByteDance's UI-TARS-1.5-7B, which beats OpenAI CUA and Claude 3.7 by a wide margin at identifying GUI elements from screenshots, and can run locally with a desktop app.
Technical overview of Alibaba’s Qwen3-Coder-Next. An ultra-efficient MoE with 80B parameters but only 3B activated, runs even on a single RTX 4090. Brings 70%+ SWE-Bench performance to local use.
Published as an official ComfyUI workflow, InfiniteTalk is a lip-sync model specialized in generating mouth animation from audio files. This article covers how it differs from MOVA and Vidu Q3 and what models it requires.
A comparison between the 'UI UX Pro Max Skill' for AI coding assistants such as Claude Code and the UI/UX improvement articles I wrote earlier. Which works better: automatic inference or explicit human intent?
AnimeGamer, developed by Tencent ARC Lab, generates anime-style videos while tracking game-state transitions. It takes a fundamentally different approach from general-purpose video generation models.
MOVA-720p from the OpenMOSS team is an open-source model that generates video and audio in a single pass. This article covers how it differs from closed models like Vidu Q3 and what its architecture looks like.
Robbyant, an Ant Group subsidiary, released LingBot-World, a world model that generates interactive video in real time from a single image. This article covers how it differs from conventional video generators, its technical features, and Apple Silicon support.