Tech 4 min read

The rise of VLM-based OCR - DeepSeek-OCR and the potential of hybrid use

IkesanContents

Why I looked into this

I came across a GitHub tool called local_ai_ocr. It looked like a local AI OCR tool, so I dug into it and found that it was a wrapper around DeepSeek’s official model, DeepSeek-OCR.

As I looked deeper into DeepSeek-OCR, I realized that it takes a fundamentally different approach from conventional OCR.

I had previously written about web OCR implementations in 2025. In that article I compared Tesseract.js, NDLOCR, Cloud Vision API, and others, but VLM-based OCR was only mentioned briefly as “AI.” This time I wanted to go deeper.

What DeepSeek-OCR Is

The official project released by DeepSeek (deepseek-ai).

Its formal name is “Contexts Optical Compression,” which is interesting because it intentionally avoids using the word OCR.

How VLM-OCR differs from conventional OCR

Here, “VLM-OCR” means a method that uses a VLM (vision-language model) to extract text from images. It is fundamentally different from conventional OCR.

Comparison table

ItemConventional OCRVLM-OCR
Representative examplesTesseract.js, NDLOCRDeepSeek-OCR, GPT-4V
MethodPattern matching / image processingReasoning by an LLM
MechanismRecognizes character shapesSees the image and generates what is written there
OutputPlain textStructured text such as Markdown
SpeedFastRelatively slow
ResourcesLightweightGPU recommended, high memory usage

Strengths and weaknesses of each approach

Conventional OCR strengths

  • Fast and lightweight
  • Very accurate on clear printed text
  • Deterministic: same input, same output

Conventional OCR weaknesses

VLM-OCR strengths

  • Can use context to fill in gaps, such as inferring “this is a price field, so the value is probably 0”
  • Stronger on handwriting and distorted text
  • Can structure tables and multi-column layouts in the output

VLM-OCR weaknesses

  • Hallucination risk: it may generate characters that are not actually present
  • Heavy to run: GPU recommended, large model size
  • Non-deterministic: output can change even for the same input

The possibility of hybrid OCR

This naturally raises the idea that the two approaches could complement each other.

Complementary relationship

Weaknesses of conventional OCR   ->  strengths of VLM
────────────────────────────────────────────────────
0/O confusion                    ->  Use context to decide "this is a price, so 0"
Handwritten / broken text        ->  Infer the meaning and fill in the gaps
Layout understanding             ->  Structure tables and multi-column text

Weaknesses of VLM                ->  strengths of conventional OCR
────────────────────────────────────────────────────
Hallucination                    ->  Cross-check against evidence of actual characters
Fine-grained numeric accuracy    ->  Pixel-level recognition results
Speed and cost                   ->  Lightweight and fast first-pass processing

Implementation idea

  1. Run both OCRs - execute conventional OCR and VLM-OCR in parallel
  2. Detect differences - identify where the results disagree
  3. Merge or confirm - prefer the VLM result for the mismatch, or have a human review it

This seems especially useful in business workflows where accuracy matters, such as invoice processing or digitizing contracts.

About local_ai_ocr

The local_ai_ocr tool mentioned at the beginning is a third-party Windows package built around DeepSeek-OCR.

Running offline is attractive, but because it is third-party, long-term maintenance is uncertain. If you want to use it seriously, directly using DeepSeek-OCR itself may be safer.

OCR is moving from “reading” to “understanding”

With VLM-OCR, the meaning of OCR is changing.

Conventional OCR was a tool for “reading characters from an image.” VLM-OCR is closer to “understanding the content of an image and turning it into text.”

It is not a question of which is better overall. The important thing is choosing the right tool for the job. And there is still room to explore hybrid approaches that combine the two.