40GB+ VRAM for a 3B model. VBench 85.11 beats dedicated 14B video generators. RunPod GPU costs from $2.2/session. The 'unified' model still ships as two checkpoint files.
A read of arXiv:2604.26622 OCR-Memory. It renders agent execution history into images, uses Set-of-Mark to let a VLM pick relevant segments, then retrieves verbatim text from the original logs.
Designing field-level confidence thresholds for human-in-the-loop document extraction, and the OCR and threshold walls hit when automating journal entries with freee MCP.
Inclusion AI released LLaDA2.0-Uni. A 16B MoE diffusion LLM that handles image understanding, 1024px image generation, image editing, and interleaved text-image generation in a single model.
I tested local Vision LLMs (Gemma 3, Qwen2.5-VL, Llama 3.2 Vision, Gemma 4) to see if they could look at character illustrations and pixel art and generate RPG-style stats in JSON format.
Zhipu AI's GLM-OCR reaches 94.62% on OmniDocBench v1.5 despite using only 0.9B parameters. I dug into its layout parsing, vertical text handling, and math recognition.
Baidu's PaddleOCR-VL-1.5 reaches 94.5% accuracy on OmniDocBench v1.5 with just 0.9B parameters, surpassing large models such as GPT-4o and Qwen2.5-VL-72B.
An explanation of the difference between conventional OCR and VLM (vision-language model) based OCR. Introduces DeepSeek-OCR and explores the possibility of combining both approaches.