Tech 5 min read

GLM-OCR (0.9B) sets a new SOTA for document parsing, so I checked columns, vertical text, and math support

IkesanContents

Not long after writing about PaddleOCR-VL-1.5, the 0.9B parameter class moved again. GLM-OCR, released by Zhipu AI and the THUDM group at Tsinghua University, reached 94.62% on OmniDocBench v1.5.

This blog has already tried NDLOCR, Tesseract.js, PaddleOCR, and DeepSeek-OCR before. This time I focused on the areas OCR usually struggles with: paragraph structure, vertical and horizontal text, and math.

What GLM-OCR is

The technical report was published on March 11, 2026.

ItemDetails
Parameters0.9B (CogViT 0.4B + GLM-0.5B)
DevelopersZhipu AI + Tsinghua THUDM
LicenseCode: Apache 2.0, model weights: MIT
Supported languages100+ including Japanese
Output formatsMarkdown, JSON with coordinates, LaTeX
UsagevLLM, SGLang, Ollama, HuggingFace Transformers, pip install glmocr

The architecture has three pieces:

CogViT (vision encoder, 0.4B)
  → cross-modal connector (token downsampling)
    → GLM-0.5B (language decoder)

Its notable feature is Multi-Token Prediction, which averages 5.2 tokens per step and gives roughly a 50% throughput gain over standard autoregressive decoding.

Training happens in four stages:

  1. Pretrain the vision encoder on hundreds of billions of image-text pairs
  2. Multimodal pretraining plus MTP adaptation
  3. OCR-focused supervised fine-tuning for text, formulas, tables, and KIE
  4. Reinforcement learning with GRPO

The fourth stage is the interesting one. Different rewards are used for different tasks: normalized edit distance for text, CDM for formulas, TEDS for tables, and field-level F1 for KIE. It also penalizes repetition and invalid JSON.

Benchmark comparison

On OmniDocBench v1.5, GLM-OCR slightly beats the model I covered in the PaddleOCR-VL-1.5 article.

ModelParametersOverall
GLM-OCR0.9B94.62
PaddleOCR-VL-1.50.9B94.50
MinerU2.51.2B90.67
Gemini-3 Proundisclosed90.33
Qwen3-VL235B89.15

A 0.9B model outperforming a 235B model by more than five points is striking. Specialization still matters.

MetricGLM-OCRPaddleOCR-VL-1.5
Text Edit (lower is better)0.0400.035
Formula CDM93.9094.21
Table TEDS93.96-
Table TEDS-S96.39-
Reading Order Edit (lower is better)0.044-

PaddleOCR-VL-1.5 is slightly better at raw text and formulas, while GLM-OCR is much stronger on tables, which is what pushes the total score ahead.

Paragraphs and layout parsing

One of OCR’s hardest problems is layout parsing. In my NDLOCR histogram article, a four-column vertical book layout forced me to fall back to PyMuPDF and histograms after Layout Parser failed.

GLM-OCR uses a two-stage pipeline:

graph TD
    A[Input image] --> B[PP-DocLayout-V3<br/>layout parsing]
    B --> C1[Paragraph]
    B --> C2[Table]
    B --> C3[Formula]
    B --> C4[Figure]
    B --> C5[Header / footer]
    C1 --> D[GLM-OCR Core<br/>parallel recognition]
    C2 --> D
    C3 --> D
    C4 --> D
    C5 --> D
    D --> E[Merge & Post Process<br/>restore reading order]
    E --> F[Structured output<br/>Markdown / JSON]

PP-DocLayout-V3 first splits the page into semantic regions. GLM-OCR then recognizes each region in parallel, and Merge & Post Process restores reading order before producing structured output.

That design reduces hallucinations, but if the first-stage layout detector is wrong, the error flows through the entire pipeline. The report lists that as a known limitation.

Vertical and horizontal text

The report barely discusses vertical text directly.

The model card mentions support for “diverse text orientations,” and the PaddleOCR pipeline includes orientation-classification options. Japanese and Chinese are both supported.

But the internal evaluation shows the multilingual scenario scoring only 69.3, well below categories such as receipt KIE and tables. That suggests CJK text and vertical layouts may still be a weak point.

Compared with NDLOCR, which was designed by the National Diet Library of Japan and is very strong at vertical Japanese text, GLM-OCR seems more capable at structure understanding than at the fine-grained quirks of vertical layout.

Math recognition

Math is one of GLM-OCR’s strongest areas.

BenchmarkScore
UniMERNet96.5 (SOTA)
Formula CDM (OmniDocBench)93.90
OlmOCR-Bench Arxiv Math80.7%

It outputs formulas in LaTeX and handles fractions, subscripts, superscripts, matrices, determinants, and other 2D notations with high accuracy. The report switches into formula mode with the prompt "Formula Recognition:".

As with other VLM-based OCR systems, the image is understood as a whole, which is exactly why math recognition works better than with older pattern-based OCR.

The report also says that highly complex mathematical expressions can still fail, so this is not a universal solution.

Other recognition abilities

The model also covers a broad set of edge cases.

CategoryScore
Table recognition (TEDS-S)96.39
Header / footer95.8%
Receipt KIE94.5
Seal recognition90.5
Baseline98.8%
Long fine text86.9%
Handwriting87.0%
Code documents84.7%
Multi-column layouts76.7%
Old scans37.6%

Old scans are still weak at 37.6%, which means degraded paper and faded print remain a challenge. Given that NDLOCR was built for old National Diet Library material, there are still cases where a more traditional OCR stack may be a better fit.

How to try it

The easiest way is Ollama:

ollama run glm-ocr

There is also a Python SDK:

pip install glmocr

It also works through vLLM, SGLang, and HuggingFace Transformers. At 0.9B parameters, it is small enough that edge deployment is still on the table.

When to use VLM OCR versus traditional OCR

Based on all the OCR approaches I have tried in this blog, the rough split looks like this:

Use caseRecommendation
High-accuracy document parsing, especially tables and mathGLM-OCR / PaddleOCR-VL-1.5
Vertical Japanese text and old booksNDLOCR / NDLOCR-Lite
In-browser real-time OCRTesseract.js
OCR post-processing and correctionEncoder model + LLM
Bulk PDF processingDeepSeek-OCR-2
On-device mobile OCRNDLOCR-Lite + ONNX Runtime

VLM OCR is excellent at structure understanding and multitask recognition, but niche cases such as vertical text, old scans, and browser execution still give traditional OCR an edge. We are not at the point where one VLM can replace everything.


The 69.3 multilingual score still bothers me. I need to run my own Japanese vertical-text samples through Ollama to see how far it really goes.