Tech 13 min read

OCR Correction on Showa-Era Documents with NDLOCR-Lite and Local LLMs

In the previous article I ran NDLOCR-Lite on Windows 11. This time I’m doing the same on an Apple Silicon Mac (M1 Max) and taking it further by passing the OCR results to a local LLM for correction.

Repository: ndl-lab/ndlocr-lite

Test Environment

ItemSpec
OSmacOS Tahoe 26.2
ChipApple M1 Max
RAM64GB
Python3.13.11 (Homebrew / miniconda)

CLI Setup

There’s a GUI version too, but I’m going with the CLI for batch processing and scripting:

cd ~/projects
git clone https://github.com/ndl-lab/ndlocr-lite.git
cd ndlocr-lite
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Apple Silicon wheels for onnxruntime are distributed on PyPI, so there’s no build trouble. All dependencies went in without issues.

Verifying with Sample Images

The repository includes 3 sample images in resource/. Create the output directory and run:

mkdir -p output
cd src
python ocr.py --sourceimg ../resource/digidepo_2531162_0024.jpg --output ../output --viz True
[INFO] Intialize Model
[INFO] Inference Image
44
[INFO] Saving result on /Users/.../output/viz_digidepo_2531162_0024.jpg
Total calculation time (Detection + Recognition): 2.3908920288085938

44 regions detected, processed in 2.39 seconds. TXT / JSON / XML / visualization image all generated.

Windows (Ryzen 7 5800HS) took 3.30 seconds on the same image, so M1 Max is marginally faster. Getting this kind of difference from CPU-only inference with no GPU says a lot about Apple Silicon’s memory bandwidth.

OCR Results

A 1963 National Diet Library staff manual:

(z)気送子送付管
気送子送付には、上記気送管にて送付するものと、空
気の圧縮を使用せず,直接落下させる装置の二通りがあ
る。後者の送付雪は出納台左側に設置されており.5
3.1の各層ステーションに直接落下するよう3本の管
が通じ投入ロのフタに層表示が記されている。

Same results as the Windows version. “管” misread as “雪”, “(ヱ)” as “(z)” — same errors appear. No Mac-specific issues; confirmed identical model output.

Correcting OCR Results with a Local LLM

Manually fixing OCR errors is tedious, and pattern matching can’t handle old kanji or context-dependent substitutions. The idea is to let an LLM reason about the context — “given what the document is about, this is probably a misread.”

I chose Qwen 3.5 (35B Dense, 24GB) for correction. M1 Max 64GB handles it easily. For OCR correction on older documents you need parameter count over speed — reasoning about historical character forms requires more capacity.

The ollama Version Problem

Hit a snag here. The Homebrew version of ollama was 0.17.0, but Qwen 3.5 was just released on 2/25, and the stable version doesn’t support it yet.

Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama
that may be in pre-release.

brew upgrade ollama left it at 0.17.0. The official install script (curl -fsSL https://ollama.com/install.sh | sh) also installs the same version.

Checking GitHub Releases, the pre-release v0.17.1-rc2 (published 2/24) was needed.

# Install the pre-release version
curl -L https://github.com/ollama/ollama/releases/download/v0.17.1-rc2/ollama-darwin -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama

In practice the asset was named ollama-darwin.tgz not ollama-darwin, so I used:

curl -L -o /tmp/ollama-darwin.tgz https://github.com/ollama/ollama/releases/download/v0.17.1-rc2/ollama-darwin.tgz
tar xzf /tmp/ollama-darwin.tgz -C /tmp
sudo mv /tmp/ollama /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama

Also, if Homebrew’s ollama is running as a service it holds port 11434, blocking ollama serve with the new binary. Run brew services stop ollama first.

Worth remembering: when using newly-released models, you may need to track ollama pre-releases.

Correction Test (Thinking Mode On)

Fed the OCR results into ollama run:

cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b \
  "The following is text OCR'd from a 1963 National Diet Library staff manual.
   Infer OCR misreads from context and suggest corrections.
   Show corrections in the format [original → corrected]."

Correction results:

OriginalCorrectedReason
(z)(a)Consistency with following symbols “b” and “c”
送付雪送付箱Equipment/container name
5 3.15.3.1Section number
通じ通じてParticle addition
投入ロ投入口Katakana “ロ” → Kanji “口”
投入優投入時Timing of the action
5.3./心5.3.1 のSection number misread
待成待機Staff action
気送子送子管気送子送付管Consistency with section title
一方交通一方通行Corridor description
受けー方受け口Opening as facility feature

Correction accuracy is decent. “送付雪→送付箱”, “投入ロ→投入口”, “待成→待機” — those are all correctly inferred from context.

One problem though: Qwen 3.5 defaults to thinking mode (outputs its reasoning chain), and for ~4KB of OCR text it produced 144KB of thought logs. Long wait to get to the actual corrections. Putting /no_think in the prompt didn’t work.

Turning Off Thinking Mode

To disable thinking in Qwen 3.5 via ollama, use the --think=false flag with ollama run:

cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b --think=false \
  "The following is OCR text. Correct any misreads and output only a table of corrections.
   No explanations. Columns: [original | corrected] only."

Corrections with thinking OFF:

OriginalCorrected
送付雪送付管
5 3.15.3.1
投入ロ投入口
投入優投入は
5.3./心5.3.の
待成待機
気送子送子管気送子送付管
受けー方受け入れ方
たて二っ折りたて二つ折り

Felt much faster. With the thought logs gone, output is cleaner too.

Correction Results Differ Between Thinking ON and OFF

The same OCR text produced different corrections depending on thinking mode:

MisreadThinking ONThinking OFF
送付雪送付箱送付管
投入優投入時投入は
5.3./心5.3.1 の5.3.の
受けー方受け口受け入れ方

Thinking ON seems more accurate when checking against context thoroughly — “送付雪→送付管” (matching the section title) and “投入優→投入時” (action timing) show better reasoning. Thinking OFF is faster but shallower on context.

For a task like OCR correction where “inferring correct text from context-dependent typos has no single right answer,” thinking mode has a direct impact on accuracy. Whether to prioritize speed or accuracy depends on the use case, but for correction purposes the pragmatic approach might be to keep thinking ON and strip the thinking section from the output afterward.

Changing the Prompt: Full Corrected Text Output

Using a table format might let the output format constrain accuracy. Trying “just give me the corrected text”:

cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b --think=false \
  "The following is OCR'd text. Output only the corrected text.
   No explanations."

Diff of corrections vs. original:

LocationOCR originalCorrectedVerdict
送付雪送付雪送付管OK
.5 3.1.5 3.15.3.1OK
投入ロ投入ロ投入口OK
投入優投入優投入時OK (was “投入は” in table format)
5.3./心5.3./心5.3. のBorderline (“5.3.1 の” seems more accurate)
待成待成待機OK
気送子送子管気送子送子管気送子送付管OK
一方交通一方交通UnchangedMissed
受けー方受けー方受け入れ方Borderline
CCcOK
二っ折り二っ折り二つ折りOK
(z)(z)UnchangedMissed

Full text output correctly got “投入優→投入時” where the table format had “投入は” — the output format affects correction accuracy.

On the other hand, “一方交通→一方通行” and “(z)” are missed. Without thinking mode the reading is shallower, so obvious character-shape errors (投入ロ→投入口) get caught but vocabulary-level errors (一方交通) and symbol inconsistencies ((z)→(a)) slip through.

Testing a Japanese-Focused Model: Qwen3 Swallow

Qwen3 Swallow is a Japanese-focused model jointly developed by Tokyo Tech (Okazaki Lab, Yokota Lab) and AIST. It’s Qwen3 with additional Japanese pre-training and RL applied, claiming state-of-the-art among open LLMs of comparable size at 8B and 32B.

No official GGUF is distributed, but community conversions exist on HuggingFace. I used the 30B-A3B (MoE, effectively 3B active parameters) Q4_K_M (18.6GB):

ollama pull hf.co/yuseiito/Qwen3-Swallow-30B-A3B-RL-v0.2-GGUF:Q4_K_M

Same prompt for the correction test:

cat output/digidepo_2531162_0024.txt | ollama run \
  hf.co/yuseiito/Qwen3-Swallow-30B-A3B-RL-v0.2-GGUF:Q4_K_M --think=false \
  "The following is OCR'd text. Output only the corrected text.
   No explanations."

Despite specifying --think=false, thinking content appeared in the output. Putting /no_think in the prompt also didn’t work. The OCR text itself was echoed into the output too.

The official ollama library version of Qwen3.5 controls thinking correctly, so it’s not a MoE issue — the thinking control templates and tokens likely got dropped during the community GGUF conversion. When using models with no official GGUF, control features like this may not work.

Correction comparison:

LocationQwen3.5 35BSwallow 30B-A3B
送付雪送付管装置
投入ロ投入口投入口
投入優投入時投入後
5.3./心5.3. の5.3. 心 (unchanged)
待成待機待機
気送子送子管気送子送付管気送子送管
一方交通Unchanged一方通行
受けー方受け入れ方受け側
二っ折り二つ折り二つ折り

Swallow caught “一方交通→一方通行” that Qwen3.5 missed. “受けー方→受け側” also reads more natural in Japanese. Japanese vocabulary capacity is making a difference.

On the other hand, “送付雪→装置” is Swallow’s own interpretation — “送付管” (Qwen3.5) matches the section title better. “気送子送子管→気送子送管” is also questionable; “気送子送付管” (Qwen3.5) fits the context better.

Both have their strengths and weaknesses — no single model produces perfect corrections. A practical workflow might be to diff both models’ outputs and have a human judge only the discrepancies.

Feeding Images Directly to the LLM

Qwen3.5 is multimodal and can read images. Skipping OCR and feeding the original image directly to the LLM avoids OCR misreads at the source. This gives us another axis to compare against NDLOCR-Lite’s output.

CLI Can’t Read Images

Passing an image file as an argument to ollama run returned “I don’t have the ability to process images.” Changing the argument position didn’t help either:

# Both fail
ollama run qwen3.5:35b --think=false "Read the text in this image" resource/digidepo_2531162_0024.jpg
ollama run qwen3.5:35b --think=false resource/digidepo_2531162_0024.jpg "Read the text in this image"

ollama show qwen3.5:35b shows vision support, but it can’t read images via CLI.

API Works

Passing a base64-encoded image to the API’s images parameter works:

import json, urllib.request, base64

img = base64.b64encode(open('resource/digidepo_2531162_0024.jpg', 'rb').read()).decode()
data = json.dumps({
    'model': 'qwen3.5:35b',
    'messages': [{
        'role': 'user',
        'content': 'Read all text in this image and output it. No explanations.',
        'images': [img]
    }],
    'stream': False,
    'think': False
}).encode()
req = urllib.request.Request(
    'http://localhost:11434/api/chat',
    data=data,
    headers={'Content-Type': 'application/json'}
)
resp = json.loads(urllib.request.urlopen(req, timeout=300).read())
print(resp['message']['content'])

NDLOCR-Lite vs. Qwen3.5 Direct Image Reading

Same image (1963 staff manual) comparison:

LocationNDLOCR-LiteQwen3.5 direct image
(z) / (2)(z)(2)
送付雪 / 送付管送付雪送付管
投入ロ / 投入口投入ロ投入口
投入優 / 投入後投入優投入後
待成 / 待機待成待機
気送子送子管気送子送子管気送子送子管
一方交通一方交通一方交通
受けー方 / 受け一方受けー方受け一方

Qwen3.5 correctly reads “(z)→(2)”, “送付雪→送付管”, “投入ロ→投入口”, “投入優→投入後”, and “待成→待機” from the image where NDLOCR-Lite misread them. “(z)→(2)” in particular was never gotten right by any text-based correction LLM — it took seeing the image.

“一方交通” and “気送子送子管” remain misread even by Qwen3.5’s direct image reading. For visually similar character confusion (ロ→口), LLMs outperform OCR, but vocabulary-level judgments (一方交通→一方通行) are stronger with text correction like Swallow.

Feeding Both Image and OCR Text Together

Would passing both the image and OCR text for “compare against the image and correct” yield better accuracy?

prompt = f'''The following is text OCR'd from this image.
Compare with the image and output only the corrected text. No explanations.

{ocr_text}'''

# Same as above with base64 image in the messages

Three-way comparison:

LocationImage onlyText onlyImage + OCR text
(z)(2)(z)(z)
送付雪送付管送付管送付管
投入ロ投入口投入口投入口
投入優投入後投入時投入後
5.3./心5,3,15.3. の5、3、1
待成待機待機待機
気送子送子管気送子送子管気送子送付管気送子送子管
一方交通一方交通一方交通一方交通
受けー方受け一方受け入れ方受け一方

Something interesting happened. When image-only correctly read “(z)→(2)”, passing the OCR text alongside it reverted back to “(z)” — unchanged. The LLM was anchored by the OCR text, apparently reading the image and concluding “yes, it says (z).”

Showing the LLM an “answer key” upfront causes it to anchor there. Reading the image alone, without preconceptions, can actually produce better results. The same effect humans experience when checking someone else’s work after seeing their answers first also happens with LLMs.

Ground Truth: What a Human Saw

At this point I actually looked at the original image to document what a human eye sees:

LocationNDLOCR-LiteQwen3.5 directSwallowHuman
(z) / (2)(z)(2)(2)
送付雪 / 送付管送付雪送付管装置送付管
投入ロ / 投入口投入ロ投入口投入口投入口
投入優投入優投入後投入後投入後
5 3.1 / 5,3,1.5 3.15,3,15、3、1
待成 / 待機待成待機待機待機
気送子送子管気送子送子管気送子送子管気送子送管気送子送子管
一方交通一方交通一方交通一方通行一方交通
受けー方 / 受け一方受けー方受け一方受け側受け一方

“5、3、1” was a surprise. From the text alone it looks like section number “5.3.1”, but Figure 2 shows a diagram with three tubes labeled “5 floors”, “3 floors”, “1 floor” — tubes dropping to the 5th, 3rd, and 1st floor stations. “5、3、1の各層ステーション” (stations at floors 5, 3, and 1) is correct. Qwen3.5’s direct image reading produced “5,3,1” because it was actually looking at the diagram — while the text-correction LLM’s “5.3.1” is actually wrong. Information visible in the diagram can’t be recovered from text alone.

“一方交通” and “受け一方” appear in the original document as-is. Swallow’s corrections to “一方通行” and “受け側” are modernizations of 1963 vocabulary, not OCR error corrections — literally rewriting the original.

LLMs can’t distinguish between “correcting an OCR misread” and “updating archaic modern language to contemporary phrasing.” In 60-year-old documents, the vocabulary and expressions are of that era, and correcting them with a modern Japanese sensibility rewrites the original. Even with explicit prompting to “respect original vocabulary and only fix obvious misreads,” this can’t be fully prevented.

Best Combination

Based on these results, maximizing accuracy requires a three-stage pipeline:

  1. NDLOCR-Lite for OCR (fast, includes coordinate data)
  2. Qwen3.5 direct image reading for cross-referencing (catches what OCR misses)
  3. Text-based LLM (Swallow, etc.) for vocabulary-level corrections

Image + OCR text combined input looks appealing but risks the LLM anchoring on OCR misreads. Processing each independently then cross-referencing produces better accuracy overall.

That said, stage 3 text correction risks “rewriting historically accurate original text into modern Japanese” for older documents. Don’t take LLM correction output at face value — a final human check against the original image is unavoidable. The practical division of labor: LLM narrows down the diffs, human confirms them.