#VLM

8 articles

TechMay 19, 2026updated9 min

Lance 3B unified multimodal: 40GB VRAM, RunPod costs, and why weights are split

40GB+ VRAM for a 3B model. VBench 85.11 beats dedicated 14B video generators. RunPod GPU costs from $2.2/session. The 'unified' model still ships as two checkpoint files.

AI マルチモーダル画像生成動画生成 VLM オープンソース HuggingFace

TechMay 2, 202614 min

OCR-Memory Lets Agents Recall History as Images

A read of arXiv:2604.26622 OCR-Memory. It renders agent execution history into images, uses Set-of-Mark to let a VLM pick relevant segments, then retrieves verbatim text from the original logs.

AI AIエージェント OCR VLM RAG トークン管理論文

TechApr 30, 202611 min

Using Confidence Scores to Reduce Human Review in Document Extraction

Designing field-level confidence thresholds for human-in-the-loop document extraction, and the OCR and threshold walls hit when automating journal entries with freee MCP.

AI OCR VLM MCP AI Agents API

TechApr 27, 20267 min

LLaDA2.0-Uni Is an Open-Weight Diffusion LLM That Unifies Image Understanding and Generation

Inclusion AI released LLaDA2.0-Uni. A 16B MoE diffusion LLM that handles image understanding, 1024px image generation, image editing, and interleaved text-image generation in a single model.

AI LLM Image Generation VLM MoE Open Model Multimodal

TechApr 14, 202610 min

Can Local Vision LLMs Extract RPG Stats from Character Art?

I tested local Vision LLMs (Gemma 3, Qwen2.5-VL, Llama 3.2 Vision, Gemma 4) to see if they could look at character illustrations and pixel art and generate RPG-style stats in JSON format.

AI Local LLM VLM Image Recognition Ollama Gemma Qwen Apple Silicon Experiment

TechMar 17, 20265 min

GLM-OCR (0.9B) sets a new SOTA for document parsing, so I checked columns, vertical text, and math support

Zhipu AI's GLM-OCR reaches 94.62% on OmniDocBench v1.5 despite using only 0.9B parameters. I dug into its layout parsing, vertical text handling, and math recognition.

AI OCR VLM GLM

TechJan 30, 20264 min

PaddleOCR-VL-1.5 - document parsing SOTA with only 0.9B parameters

Baidu's PaddleOCR-VL-1.5 reaches 94.5% accuracy on OmniDocBench v1.5 with just 0.9B parameters, surpassing large models such as GPT-4o and Qwen2.5-VL-72B.

AI OCR VLM PaddlePaddle

TechJan 20, 20264 min

The rise of VLM-based OCR - DeepSeek-OCR and the potential of hybrid use

An explanation of the difference between conventional OCR and VLM (vision-language model) based OCR. Introduces DeepSeek-OCR and explores the possibility of combining both approaches.

AI OCR DeepSeek VLM