Luma AI's Uni-1 integrates image understanding and generation in one decoder-only autoregressive model. It does not use diffusion; instead, it tokenizes text and image patches in a shared vocabulary and generates them sequentially.
AttnRes to replace Transformer's fixed residual combination with softmax attention in the depth direction. Demonstration with Kimi Linear 48B improved GPQA-Diamond +7.5pt and HumanEval +3.1pt. Training overhead was kept below 4% and inference below 2%.
A paper explains that two seemingly mysterious Transformer behaviors, heavy attention on specific tokens and unusually large activations in specific dimensions, are actually manifestations of the same mechanism.