OCR Correction on Showa-Era Documents with NDLOCR-Lite and Local LLMs

In the previous article I ran NDLOCR-Lite on Windows 11. This time I’m doing the same on an Apple Silicon Mac (M1 Max) and taking it further by passing the OCR results to a local LLM for correction.

Repository: ndl-lab/ndlocr-lite

Test Environment

Item	Spec
OS	macOS Tahoe 26.2
Chip	Apple M1 Max
RAM	64GB
Python	3.13.11 (Homebrew / miniconda)

CLI Setup

There’s a GUI version too, but I’m going with the CLI for batch processing and scripting:

cd ~/projects
git clone https://github.com/ndl-lab/ndlocr-lite.git
cd ndlocr-lite
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Apple Silicon wheels for onnxruntime are distributed on PyPI, so there’s no build trouble. All dependencies went in without issues.

Verifying with Sample Images

The repository includes 3 sample images in resource/. Create the output directory and run:

mkdir -p output
cd src
python ocr.py --sourceimg ../resource/digidepo_2531162_0024.jpg --output ../output --viz True

[INFO] Intialize Model
[INFO] Inference Image
44
[INFO] Saving result on /Users/.../output/viz_digidepo_2531162_0024.jpg
Total calculation time (Detection + Recognition): 2.3908920288085938

44 regions detected, processed in 2.39 seconds. TXT / JSON / XML / visualization image all generated.

Windows (Ryzen 7 5800HS) took 3.30 seconds on the same image, so M1 Max is marginally faster. Getting this kind of difference from CPU-only inference with no GPU says a lot about Apple Silicon’s memory bandwidth.

OCR Results

A 1963 National Diet Library staff manual:

(z)気送子送付管
気送子送付には、上記気送管にて送付するものと、空
気の圧縮を使用せず,直接落下させる装置の二通りがあ
る。後者の送付雪は出納台左側に設置されており.5
3.1の各層ステーションに直接落下するよう3本の管
が通じ投入ロのフタに層表示が記されている。

Same results as the Windows version. “管” misread as “雪”, “(ヱ)” as “(z)” — same errors appear. No Mac-specific issues; confirmed identical model output.

Correcting OCR Results with a Local LLM

Manually fixing OCR errors is tedious, and pattern matching can’t handle old kanji or context-dependent substitutions. The idea is to let an LLM reason about the context — “given what the document is about, this is probably a misread.”

I chose Qwen 3.5 (35B Dense, 24GB) for correction. M1 Max 64GB handles it easily. For OCR correction on older documents you need parameter count over speed — reasoning about historical character forms requires more capacity.

The ollama Version Problem

Hit a snag here. The Homebrew version of ollama was 0.17.0, but Qwen 3.5 was just released on 2/25, and the stable version doesn’t support it yet.

Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama
that may be in pre-release.

brew upgrade ollama left it at 0.17.0. The official install script (curl -fsSL https://ollama.com/install.sh | sh) also installs the same version.

Checking GitHub Releases, the pre-release v0.17.1-rc2 (published 2/24) was needed.

# Install the pre-release version
curl -L https://github.com/ollama/ollama/releases/download/v0.17.1-rc2/ollama-darwin -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama

In practice the asset was named ollama-darwin.tgz not ollama-darwin, so I used:

curl -L -o /tmp/ollama-darwin.tgz https://github.com/ollama/ollama/releases/download/v0.17.1-rc2/ollama-darwin.tgz
tar xzf /tmp/ollama-darwin.tgz -C /tmp
sudo mv /tmp/ollama /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama

Also, if Homebrew’s ollama is running as a service it holds port 11434, blocking ollama serve with the new binary. Run brew services stop ollama first.

Worth remembering: when using newly-released models, you may need to track ollama pre-releases.

Correction Test (Thinking Mode On)

Fed the OCR results into ollama run:

cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b \
  "The following is text OCR'd from a 1963 National Diet Library staff manual.
   Infer OCR misreads from context and suggest corrections.
   Show corrections in the format [original → corrected]."

Correction results:

Original	Corrected	Reason
(z)	(a)	Consistency with following symbols “b” and “c”
送付雪	送付箱	Equipment/container name
5 3.1	5.3.1	Section number
通じ	通じて	Particle addition
投入ロ	投入口	Katakana “ロ” → Kanji “口”
投入優	投入時	Timing of the action
5.3./心	5.3.1 の	Section number misread
待成	待機	Staff action
気送子送子管	気送子送付管	Consistency with section title
一方交通	一方通行	Corridor description
受けー方	受け口	Opening as facility feature

Correction accuracy is decent. “送付雪→送付箱”, “投入ロ→投入口”, “待成→待機” — those are all correctly inferred from context.

One problem though: Qwen 3.5 defaults to thinking mode (outputs its reasoning chain), and for ~4KB of OCR text it produced 144KB of thought logs. Long wait to get to the actual corrections. Putting /no_think in the prompt didn’t work.

Turning Off Thinking Mode

To disable thinking in Qwen 3.5 via ollama, use the --think=false flag with ollama run:

cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b --think=false \
  "The following is OCR text. Correct any misreads and output only a table of corrections.
   No explanations. Columns: [original | corrected] only."

Corrections with thinking OFF:

Original	Corrected
送付雪	送付管
5 3.1	5.3.1
投入ロ	投入口
投入優	投入は
5.3./心	5.3.の
待成	待機
気送子送子管	気送子送付管
受けー方	受け入れ方
たて二っ折り	たて二つ折り

Felt much faster. With the thought logs gone, output is cleaner too.

Correction Results Differ Between Thinking ON and OFF

The same OCR text produced different corrections depending on thinking mode:

Misread	Thinking ON	Thinking OFF
送付雪	送付箱	送付管
投入優	投入時	投入は
5.3./心	5.3.1 の	5.3.の
受けー方	受け口	受け入れ方

Thinking ON seems more accurate when checking against context thoroughly — “送付雪→送付管” (matching the section title) and “投入優→投入時” (action timing) show better reasoning. Thinking OFF is faster but shallower on context.

For a task like OCR correction where “inferring correct text from context-dependent typos has no single right answer,” thinking mode has a direct impact on accuracy. Whether to prioritize speed or accuracy depends on the use case, but for correction purposes the pragmatic approach might be to keep thinking ON and strip the thinking section from the output afterward.

Changing the Prompt: Full Corrected Text Output

Using a table format might let the output format constrain accuracy. Trying “just give me the corrected text”:

cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b --think=false \
  "The following is OCR'd text. Output only the corrected text.
   No explanations."

Diff of corrections vs. original:

Location	OCR original	Corrected	Verdict
送付雪	送付雪	送付管	OK
.5 3.1	.5 3.1	5.3.1	OK
投入ロ	投入ロ	投入口	OK
投入優	投入優	投入時	OK (was “投入は” in table format)
5.3./心	5.3./心	5.3. の	Borderline (“5.3.1 の” seems more accurate)
待成	待成	待機	OK
気送子送子管	気送子送子管	気送子送付管	OK
一方交通	一方交通	Unchanged	Missed
受けー方	受けー方	受け入れ方	Borderline
C	C	c	OK
二っ折り	二っ折り	二つ折り	OK
(z)	(z)	Unchanged	Missed

Full text output correctly got “投入優→投入時” where the table format had “投入は” — the output format affects correction accuracy.

On the other hand, “一方交通→一方通行” and “(z)” are missed. Without thinking mode the reading is shallower, so obvious character-shape errors (投入ロ→投入口) get caught but vocabulary-level errors (一方交通) and symbol inconsistencies ((z)→(a)) slip through.

Testing a Japanese-Focused Model: Qwen3 Swallow

Qwen3 Swallow is a Japanese-focused model jointly developed by Tokyo Tech (Okazaki Lab, Yokota Lab) and AIST. It’s Qwen3 with additional Japanese pre-training and RL applied, claiming state-of-the-art among open LLMs of comparable size at 8B and 32B.

No official GGUF is distributed, but community conversions exist on HuggingFace. I used the 30B-A3B (MoE, effectively 3B active parameters) Q4_K_M (18.6GB):

ollama pull hf.co/yuseiito/Qwen3-Swallow-30B-A3B-RL-v0.2-GGUF:Q4_K_M

Same prompt for the correction test:

cat output/digidepo_2531162_0024.txt | ollama run \
  hf.co/yuseiito/Qwen3-Swallow-30B-A3B-RL-v0.2-GGUF:Q4_K_M --think=false \
  "The following is OCR'd text. Output only the corrected text.
   No explanations."

Despite specifying --think=false, thinking content appeared in the output. Putting /no_think in the prompt also didn’t work. The OCR text itself was echoed into the output too.

The official ollama library version of Qwen3.5 controls thinking correctly, so it’s not a MoE issue — the thinking control templates and tokens likely got dropped during the community GGUF conversion. When using models with no official GGUF, control features like this may not work.

Correction comparison:

Location	Qwen3.5 35B	Swallow 30B-A3B
送付雪	送付管	装置
投入ロ	投入口	投入口
投入優	投入時	投入後
5.3./心	5.3. の	5.3. 心 (unchanged)
待成	待機	待機
気送子送子管	気送子送付管	気送子送管
一方交通	Unchanged	一方通行
受けー方	受け入れ方	受け側
二っ折り	二つ折り	二つ折り

Swallow caught “一方交通→一方通行” that Qwen3.5 missed. “受けー方→受け側” also reads more natural in Japanese. Japanese vocabulary capacity is making a difference.

On the other hand, “送付雪→装置” is Swallow’s own interpretation — “送付管” (Qwen3.5) matches the section title better. “気送子送子管→気送子送管” is also questionable; “気送子送付管” (Qwen3.5) fits the context better.

Both have their strengths and weaknesses — no single model produces perfect corrections. A practical workflow might be to diff both models’ outputs and have a human judge only the discrepancies.

Feeding Images Directly to the LLM

Qwen3.5 is multimodal and can read images. Skipping OCR and feeding the original image directly to the LLM avoids OCR misreads at the source. This gives us another axis to compare against NDLOCR-Lite’s output.

CLI Can’t Read Images

Passing an image file as an argument to ollama run returned “I don’t have the ability to process images.” Changing the argument position didn’t help either:

# Both fail
ollama run qwen3.5:35b --think=false "Read the text in this image" resource/digidepo_2531162_0024.jpg
ollama run qwen3.5:35b --think=false resource/digidepo_2531162_0024.jpg "Read the text in this image"

ollama show qwen3.5:35b shows vision support, but it can’t read images via CLI.

API Works

Passing a base64-encoded image to the API’s images parameter works:

import json, urllib.request, base64

img = base64.b64encode(open('resource/digidepo_2531162_0024.jpg', 'rb').read()).decode()
data = json.dumps({
    'model': 'qwen3.5:35b',
    'messages': [{
        'role': 'user',
        'content': 'Read all text in this image and output it. No explanations.',
        'images': [img]
    }],
    'stream': False,
    'think': False
}).encode()
req = urllib.request.Request(
    'http://localhost:11434/api/chat',
    data=data,
    headers={'Content-Type': 'application/json'}
)
resp = json.loads(urllib.request.urlopen(req, timeout=300).read())
print(resp['message']['content'])

NDLOCR-Lite vs. Qwen3.5 Direct Image Reading

Same image (1963 staff manual) comparison:

Location	NDLOCR-Lite	Qwen3.5 direct image
(z) / (2)	(z)	(2)
送付雪 / 送付管	送付雪	送付管
投入ロ / 投入口	投入ロ	投入口
投入優 / 投入後	投入優	投入後
待成 / 待機	待成	待機
気送子送子管	気送子送子管	気送子送子管
一方交通	一方交通	一方交通
受けー方 / 受け一方	受けー方	受け一方

Qwen3.5 correctly reads “(z)→(2)”, “送付雪→送付管”, “投入ロ→投入口”, “投入優→投入後”, and “待成→待機” from the image where NDLOCR-Lite misread them. “(z)→(2)” in particular was never gotten right by any text-based correction LLM — it took seeing the image.

“一方交通” and “気送子送子管” remain misread even by Qwen3.5’s direct image reading. For visually similar character confusion (ロ→口), LLMs outperform OCR, but vocabulary-level judgments (一方交通→一方通行) are stronger with text correction like Swallow.

Feeding Both Image and OCR Text Together

Would passing both the image and OCR text for “compare against the image and correct” yield better accuracy?

prompt = f'''The following is text OCR'd from this image.
Compare with the image and output only the corrected text. No explanations.

{ocr_text}'''

# Same as above with base64 image in the messages

Three-way comparison:

Location	Image only	Text only	Image + OCR text
(z)	(2)	(z)	(z)
送付雪	送付管	送付管	送付管
投入ロ	投入口	投入口	投入口
投入優	投入後	投入時	投入後
5.3./心	5,3,1	5.3. の	5、3、1
待成	待機	待機	待機
気送子送子管	気送子送子管	気送子送付管	気送子送子管
一方交通	一方交通	一方交通	一方交通
受けー方	受け一方	受け入れ方	受け一方

Something interesting happened. When image-only correctly read “(z)→(2)”, passing the OCR text alongside it reverted back to “(z)” — unchanged. The LLM was anchored by the OCR text, apparently reading the image and concluding “yes, it says (z).”

Showing the LLM an “answer key” upfront causes it to anchor there. Reading the image alone, without preconceptions, can actually produce better results. The same effect humans experience when checking someone else’s work after seeing their answers first also happens with LLMs.

Ground Truth: What a Human Saw

At this point I actually looked at the original image to document what a human eye sees:

Location	NDLOCR-Lite	Qwen3.5 direct	Swallow	Human
(z) / (2)	(z)	(2)	—	(2)
送付雪 / 送付管	送付雪	送付管	装置	送付管
投入ロ / 投入口	投入ロ	投入口	投入口	投入口
投入優	投入優	投入後	投入後	投入後
5 3.1 / 5,3,1	.5 3.1	5,3,1	—	5、3、1
待成 / 待機	待成	待機	待機	待機
気送子送子管	気送子送子管	気送子送子管	気送子送管	気送子送子管
一方交通	一方交通	一方交通	~~一方通行~~	一方交通
受けー方 / 受け一方	受けー方	受け一方	~~受け側~~	受け一方

“5、3、1” was a surprise. From the text alone it looks like section number “5.3.1”, but Figure 2 shows a diagram with three tubes labeled “5 floors”, “3 floors”, “1 floor” — tubes dropping to the 5th, 3rd, and 1st floor stations. “5、3、1の各層ステーション” (stations at floors 5, 3, and 1) is correct. Qwen3.5’s direct image reading produced “5,3,1” because it was actually looking at the diagram — while the text-correction LLM’s “5.3.1” is actually wrong. Information visible in the diagram can’t be recovered from text alone.

“一方交通” and “受け一方” appear in the original document as-is. Swallow’s corrections to “一方通行” and “受け側” are modernizations of 1963 vocabulary, not OCR error corrections — literally rewriting the original.

LLMs can’t distinguish between “correcting an OCR misread” and “updating archaic modern language to contemporary phrasing.” In 60-year-old documents, the vocabulary and expressions are of that era, and correcting them with a modern Japanese sensibility rewrites the original. Even with explicit prompting to “respect original vocabulary and only fix obvious misreads,” this can’t be fully prevented.

Best Combination

Based on these results, maximizing accuracy requires a three-stage pipeline:

NDLOCR-Lite for OCR (fast, includes coordinate data)
Qwen3.5 direct image reading for cross-referencing (catches what OCR misses)
Text-based LLM (Swallow, etc.) for vocabulary-level corrections

Image + OCR text combined input looks appealing but risks the LLM anchoring on OCR misreads. Processing each independently then cross-referencing produces better accuracy overall.

That said, stage 3 text correction risks “rewriting historically accurate original text into modern Japanese” for older documents. Don’t take LLM correction output at face value — a final human check against the original image is unavoidable. The practical division of labor: LLM narrows down the diffs, human confirms them.