A DEV Community article proposes cross-modal distillation for wildfire evacuation routing that encodes road closures and AQI thresholds directly into the loss function. I look at the teacher-student gap when the student drops satellite imagery, why 23ms edge inference is irrelevant if sensor data is 5 minutes old, and what's missing for production.
Inclusion AI released LLaDA2.0-Uni. A 16B MoE diffusion LLM that handles image understanding, 1024px image generation, image editing, and interleaved text-image generation in a single model.
Xiaomi launched two MiMo-V2.5 models at once. MiMo-V2.5-Pro hits SWE-bench Pro 57.2, Claw-Eval 63.8, and τ3-Bench 72.9 — frontier-tier — while MiMo-V2.5 brings native omnimodality plus a 1M context. Both are API-only for now; open weights are promised but unscheduled.
Sentence Transformers v5.4 adds multimodal support. Eight embedding models and four rerankers including Qwen3-VL and NVIDIA Nemotron can now be used through a unified API.
Google DeepMind has released Gemma 4: four models—31B dense, 26B MoE (A4B), E4B, and E2B—with a 256K context, multimodal input, tool calling, and support for 140 languages.
A technical walkthrough of Alibaba's Qwen3-Omni-30B-A3B. An omni-modal model that activates only 3B out of 30B and responds with speech from text/image/audio/video inputs. The article organizes the Thinker–Talker architecture, benchmarks, and the overall Qwen3 MoE family.