The rise of VLM-based OCR - DeepSeek-OCR and the potential of hybrid use
Contents
Why I looked into this
I came across a GitHub tool called local_ai_ocr. It looked like a local AI OCR tool, so I dug into it and found that it was a wrapper around DeepSeek’s official model, DeepSeek-OCR.
As I looked deeper into DeepSeek-OCR, I realized that it takes a fundamentally different approach from conventional OCR.
I had previously written about web OCR implementations in 2025. In that article I compared Tesseract.js, NDLOCR, Cloud Vision API, and others, but VLM-based OCR was only mentioned briefly as “AI.” This time I wanted to go deeper.
What DeepSeek-OCR Is
The official project released by DeepSeek (deepseek-ai).
- Repository: github.com/deepseek-ai/DeepSeek-OCR
- Stars: 22k+
- License: MIT
- Release: October 2025
- Paper: arXiv:2510.18234
Its formal name is “Contexts Optical Compression,” which is interesting because it intentionally avoids using the word OCR.
How VLM-OCR differs from conventional OCR
Here, “VLM-OCR” means a method that uses a VLM (vision-language model) to extract text from images. It is fundamentally different from conventional OCR.
Comparison table
| Item | Conventional OCR | VLM-OCR |
|---|---|---|
| Representative examples | Tesseract.js, NDLOCR | DeepSeek-OCR, GPT-4V |
| Method | Pattern matching / image processing | Reasoning by an LLM |
| Mechanism | Recognizes character shapes | Sees the image and generates what is written there |
| Output | Plain text | Structured text such as Markdown |
| Speed | Fast | Relatively slow |
| Resources | Lightweight | GPU recommended, high memory usage |
Strengths and weaknesses of each approach
Conventional OCR strengths
- Fast and lightweight
- Very accurate on clear printed text
- Deterministic: same input, same output
Conventional OCR weaknesses
- Weak against characters that look similar, such as 0/O or 1/l
- Poor at handwritten or distorted text
- Weak at understanding layout, such as tables and multi-column documents - NDLOCR also struggled with four-column layouts
VLM-OCR strengths
- Can use context to fill in gaps, such as inferring “this is a price field, so the value is probably 0”
- Stronger on handwriting and distorted text
- Can structure tables and multi-column layouts in the output
VLM-OCR weaknesses
- Hallucination risk: it may generate characters that are not actually present
- Heavy to run: GPU recommended, large model size
- Non-deterministic: output can change even for the same input
The possibility of hybrid OCR
This naturally raises the idea that the two approaches could complement each other.
Complementary relationship
Weaknesses of conventional OCR -> strengths of VLM
────────────────────────────────────────────────────
0/O confusion -> Use context to decide "this is a price, so 0"
Handwritten / broken text -> Infer the meaning and fill in the gaps
Layout understanding -> Structure tables and multi-column text
Weaknesses of VLM -> strengths of conventional OCR
────────────────────────────────────────────────────
Hallucination -> Cross-check against evidence of actual characters
Fine-grained numeric accuracy -> Pixel-level recognition results
Speed and cost -> Lightweight and fast first-pass processing
Implementation idea
- Run both OCRs - execute conventional OCR and VLM-OCR in parallel
- Detect differences - identify where the results disagree
- Merge or confirm - prefer the VLM result for the mismatch, or have a human review it
This seems especially useful in business workflows where accuracy matters, such as invoice processing or digitizing contracts.
About local_ai_ocr
The local_ai_ocr tool mentioned at the beginning is a third-party Windows package built around DeepSeek-OCR.
- Repository: github.com/th1nhhdk/local_ai_ocr
- License: Apache-2.0
- Supported OS: Windows 10 and later
- Initial setup: requires a 6.67GB model download
Running offline is attractive, but because it is third-party, long-term maintenance is uncertain. If you want to use it seriously, directly using DeepSeek-OCR itself may be safer.
OCR is moving from “reading” to “understanding”
With VLM-OCR, the meaning of OCR is changing.
Conventional OCR was a tool for “reading characters from an image.” VLM-OCR is closer to “understanding the content of an image and turning it into text.”
It is not a question of which is better overall. The important thing is choosing the right tool for the job. And there is still room to explore hybrid approaches that combine the two.
Related Articles
- OCR: a summary of the limits and lessons from 2025 web implementations - comparison of OCR libraries
- Tesseract.js OCR demo - browser-based OCR demo
- NDLOCR Docker build guide - setup for high-accuracy Japanese OCR
- Solving NDLOCR column detection with histogram analysis - practical layout analysis