The rise of VLM-based OCR - DeepSeek-OCR and the potential of hybrid use

Why I looked into this

I came across a GitHub tool called local_ai_ocr. It looked like a local AI OCR tool, so I dug into it and found that it was a wrapper around DeepSeek’s official model, DeepSeek-OCR.

As I looked deeper into DeepSeek-OCR, I realized that it takes a fundamentally different approach from conventional OCR.

I had previously written about web OCR implementations in 2025. In that article I compared Tesseract.js, NDLOCR, Cloud Vision API, and others, but VLM-based OCR was only mentioned briefly as “AI.” This time I wanted to go deeper.

What DeepSeek-OCR Is

The official project released by DeepSeek (deepseek-ai).

Repository: github.com/deepseek-ai/DeepSeek-OCR
Stars: 22k+
License: MIT
Release: October 2025
Paper: arXiv:2510.18234

Its formal name is “Contexts Optical Compression,” which is interesting because it intentionally avoids using the word OCR.

How VLM-OCR differs from conventional OCR

Here, “VLM-OCR” means a method that uses a VLM (vision-language model) to extract text from images. It is fundamentally different from conventional OCR.

Comparison table

Item	Conventional OCR	VLM-OCR
Representative examples	Tesseract.js, NDLOCR	DeepSeek-OCR, GPT-4V
Method	Pattern matching / image processing	Reasoning by an LLM
Mechanism	Recognizes character shapes	Sees the image and generates what is written there
Output	Plain text	Structured text such as Markdown
Speed	Fast	Relatively slow
Resources	Lightweight	GPU recommended, high memory usage

Strengths and weaknesses of each approach

Conventional OCR strengths

Fast and lightweight
Very accurate on clear printed text
Deterministic: same input, same output

Conventional OCR weaknesses

Weak against characters that look similar, such as 0/O or 1/l
Poor at handwritten or distorted text
Weak at understanding layout, such as tables and multi-column documents - NDLOCR also struggled with four-column layouts

VLM-OCR strengths

Can use context to fill in gaps, such as inferring “this is a price field, so the value is probably 0”
Stronger on handwriting and distorted text
Can structure tables and multi-column layouts in the output

VLM-OCR weaknesses

Hallucination risk: it may generate characters that are not actually present
Heavy to run: GPU recommended, large model size
Non-deterministic: output can change even for the same input

The possibility of hybrid OCR

This naturally raises the idea that the two approaches could complement each other.

Complementary relationship

Weaknesses of conventional OCR   ->  strengths of VLM
────────────────────────────────────────────────────
0/O confusion                    ->  Use context to decide "this is a price, so 0"
Handwritten / broken text        ->  Infer the meaning and fill in the gaps
Layout understanding             ->  Structure tables and multi-column text

Weaknesses of VLM                ->  strengths of conventional OCR
────────────────────────────────────────────────────
Hallucination                    ->  Cross-check against evidence of actual characters
Fine-grained numeric accuracy    ->  Pixel-level recognition results
Speed and cost                   ->  Lightweight and fast first-pass processing

Implementation idea

Run both OCRs - execute conventional OCR and VLM-OCR in parallel
Detect differences - identify where the results disagree
Merge or confirm - prefer the VLM result for the mismatch, or have a human review it

This seems especially useful in business workflows where accuracy matters, such as invoice processing or digitizing contracts.

About local_ai_ocr

The local_ai_ocr tool mentioned at the beginning is a third-party Windows package built around DeepSeek-OCR.

Repository: github.com/th1nhhdk/local_ai_ocr
License: Apache-2.0
Supported OS: Windows 10 and later
Initial setup: requires a 6.67GB model download

Running offline is attractive, but because it is third-party, long-term maintenance is uncertain. If you want to use it seriously, directly using DeepSeek-OCR itself may be safer.

OCR is moving from “reading” to “understanding”

With VLM-OCR, the meaning of OCR is changing.

Conventional OCR was a tool for “reading characters from an image.” VLM-OCR is closer to “understanding the content of an image and turning it into text.”

It is not a question of which is better overall. The important thing is choosing the right tool for the job. And there is still room to explore hybrid approaches that combine the two.

OCR: a summary of the limits and lessons from 2025 web implementations - comparison of OCR libraries
Tesseract.js OCR demo - browser-based OCR demo
NDLOCR Docker build guide - setup for high-accuracy Japanese OCR
Solving NDLOCR column detection with histogram analysis - practical layout analysis