OCR Correction on Showa-Era Documents with NDLOCR-Lite and Local LLMs
In the previous article I ran NDLOCR-Lite on Windows 11. This time I’m doing the same on an Apple Silicon Mac (M1 Max) and taking it further by passing the OCR results to a local LLM for correction.
Repository: ndl-lab/ndlocr-lite
Test Environment
| Item | Spec |
|---|---|
| OS | macOS Tahoe 26.2 |
| Chip | Apple M1 Max |
| RAM | 64GB |
| Python | 3.13.11 (Homebrew / miniconda) |
CLI Setup
There’s a GUI version too, but I’m going with the CLI for batch processing and scripting:
cd ~/projects
git clone https://github.com/ndl-lab/ndlocr-lite.git
cd ndlocr-lite
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Apple Silicon wheels for onnxruntime are distributed on PyPI, so there’s no build trouble. All dependencies went in without issues.
Verifying with Sample Images
The repository includes 3 sample images in resource/. Create the output directory and run:
mkdir -p output
cd src
python ocr.py --sourceimg ../resource/digidepo_2531162_0024.jpg --output ../output --viz True
[INFO] Intialize Model
[INFO] Inference Image
44
[INFO] Saving result on /Users/.../output/viz_digidepo_2531162_0024.jpg
Total calculation time (Detection + Recognition): 2.3908920288085938
44 regions detected, processed in 2.39 seconds. TXT / JSON / XML / visualization image all generated.
Windows (Ryzen 7 5800HS) took 3.30 seconds on the same image, so M1 Max is marginally faster. Getting this kind of difference from CPU-only inference with no GPU says a lot about Apple Silicon’s memory bandwidth.
OCR Results
A 1963 National Diet Library staff manual:
(z)気送子送付管
気送子送付には、上記気送管にて送付するものと、空
気の圧縮を使用せず,直接落下させる装置の二通りがあ
る。後者の送付雪は出納台左側に設置されており.5
3.1の各層ステーションに直接落下するよう3本の管
が通じ投入ロのフタに層表示が記されている。
Same results as the Windows version. “管” misread as “雪”, “(ヱ)” as “(z)” — same errors appear. No Mac-specific issues; confirmed identical model output.
Correcting OCR Results with a Local LLM
Manually fixing OCR errors is tedious, and pattern matching can’t handle old kanji or context-dependent substitutions. The idea is to let an LLM reason about the context — “given what the document is about, this is probably a misread.”
I chose Qwen 3.5 (35B Dense, 24GB) for correction. M1 Max 64GB handles it easily. For OCR correction on older documents you need parameter count over speed — reasoning about historical character forms requires more capacity.
The ollama Version Problem
Hit a snag here. The Homebrew version of ollama was 0.17.0, but Qwen 3.5 was just released on 2/25, and the stable version doesn’t support it yet.
Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama
that may be in pre-release.
brew upgrade ollama left it at 0.17.0. The official install script (curl -fsSL https://ollama.com/install.sh | sh) also installs the same version.
Checking GitHub Releases, the pre-release v0.17.1-rc2 (published 2/24) was needed.
# Install the pre-release version
curl -L https://github.com/ollama/ollama/releases/download/v0.17.1-rc2/ollama-darwin -o /usr/local/bin/ollama
chmod +x /usr/local/bin/ollama
In practice the asset was named ollama-darwin.tgz not ollama-darwin, so I used:
curl -L -o /tmp/ollama-darwin.tgz https://github.com/ollama/ollama/releases/download/v0.17.1-rc2/ollama-darwin.tgz
tar xzf /tmp/ollama-darwin.tgz -C /tmp
sudo mv /tmp/ollama /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama
Also, if Homebrew’s ollama is running as a service it holds port 11434, blocking ollama serve with the new binary. Run brew services stop ollama first.
Worth remembering: when using newly-released models, you may need to track ollama pre-releases.
Correction Test (Thinking Mode On)
Fed the OCR results into ollama run:
cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b \
"The following is text OCR'd from a 1963 National Diet Library staff manual.
Infer OCR misreads from context and suggest corrections.
Show corrections in the format [original → corrected]."
Correction results:
| Original | Corrected | Reason |
|---|---|---|
| (z) | (a) | Consistency with following symbols “b” and “c” |
| 送付雪 | 送付箱 | Equipment/container name |
| 5 3.1 | 5.3.1 | Section number |
| 通じ | 通じて | Particle addition |
| 投入ロ | 投入口 | Katakana “ロ” → Kanji “口” |
| 投入優 | 投入時 | Timing of the action |
| 5.3./心 | 5.3.1 の | Section number misread |
| 待成 | 待機 | Staff action |
| 気送子送子管 | 気送子送付管 | Consistency with section title |
| 一方交通 | 一方通行 | Corridor description |
| 受けー方 | 受け口 | Opening as facility feature |
Correction accuracy is decent. “送付雪→送付箱”, “投入ロ→投入口”, “待成→待機” — those are all correctly inferred from context.
One problem though: Qwen 3.5 defaults to thinking mode (outputs its reasoning chain), and for ~4KB of OCR text it produced 144KB of thought logs. Long wait to get to the actual corrections. Putting /no_think in the prompt didn’t work.
Turning Off Thinking Mode
To disable thinking in Qwen 3.5 via ollama, use the --think=false flag with ollama run:
cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b --think=false \
"The following is OCR text. Correct any misreads and output only a table of corrections.
No explanations. Columns: [original | corrected] only."
Corrections with thinking OFF:
| Original | Corrected |
|---|---|
| 送付雪 | 送付管 |
| 5 3.1 | 5.3.1 |
| 投入ロ | 投入口 |
| 投入優 | 投入は |
| 5.3./心 | 5.3.の |
| 待成 | 待機 |
| 気送子送子管 | 気送子送付管 |
| 受けー方 | 受け入れ方 |
| たて二っ折り | たて二つ折り |
Felt much faster. With the thought logs gone, output is cleaner too.
Correction Results Differ Between Thinking ON and OFF
The same OCR text produced different corrections depending on thinking mode:
| Misread | Thinking ON | Thinking OFF |
|---|---|---|
| 送付雪 | 送付箱 | 送付管 |
| 投入優 | 投入時 | 投入は |
| 5.3./心 | 5.3.1 の | 5.3.の |
| 受けー方 | 受け口 | 受け入れ方 |
Thinking ON seems more accurate when checking against context thoroughly — “送付雪→送付管” (matching the section title) and “投入優→投入時” (action timing) show better reasoning. Thinking OFF is faster but shallower on context.
For a task like OCR correction where “inferring correct text from context-dependent typos has no single right answer,” thinking mode has a direct impact on accuracy. Whether to prioritize speed or accuracy depends on the use case, but for correction purposes the pragmatic approach might be to keep thinking ON and strip the thinking section from the output afterward.
Changing the Prompt: Full Corrected Text Output
Using a table format might let the output format constrain accuracy. Trying “just give me the corrected text”:
cat output/digidepo_2531162_0024.txt | ollama run qwen3.5:35b --think=false \
"The following is OCR'd text. Output only the corrected text.
No explanations."
Diff of corrections vs. original:
| Location | OCR original | Corrected | Verdict |
|---|---|---|---|
| 送付雪 | 送付雪 | 送付管 | OK |
| .5 3.1 | .5 3.1 | 5.3.1 | OK |
| 投入ロ | 投入ロ | 投入口 | OK |
| 投入優 | 投入優 | 投入時 | OK (was “投入は” in table format) |
| 5.3./心 | 5.3./心 | 5.3. の | Borderline (“5.3.1 の” seems more accurate) |
| 待成 | 待成 | 待機 | OK |
| 気送子送子管 | 気送子送子管 | 気送子送付管 | OK |
| 一方交通 | 一方交通 | Unchanged | Missed |
| 受けー方 | 受けー方 | 受け入れ方 | Borderline |
| C | C | c | OK |
| 二っ折り | 二っ折り | 二つ折り | OK |
| (z) | (z) | Unchanged | Missed |
Full text output correctly got “投入優→投入時” where the table format had “投入は” — the output format affects correction accuracy.
On the other hand, “一方交通→一方通行” and “(z)” are missed. Without thinking mode the reading is shallower, so obvious character-shape errors (投入ロ→投入口) get caught but vocabulary-level errors (一方交通) and symbol inconsistencies ((z)→(a)) slip through.
Testing a Japanese-Focused Model: Qwen3 Swallow
Qwen3 Swallow is a Japanese-focused model jointly developed by Tokyo Tech (Okazaki Lab, Yokota Lab) and AIST. It’s Qwen3 with additional Japanese pre-training and RL applied, claiming state-of-the-art among open LLMs of comparable size at 8B and 32B.
No official GGUF is distributed, but community conversions exist on HuggingFace. I used the 30B-A3B (MoE, effectively 3B active parameters) Q4_K_M (18.6GB):
ollama pull hf.co/yuseiito/Qwen3-Swallow-30B-A3B-RL-v0.2-GGUF:Q4_K_M
Same prompt for the correction test:
cat output/digidepo_2531162_0024.txt | ollama run \
hf.co/yuseiito/Qwen3-Swallow-30B-A3B-RL-v0.2-GGUF:Q4_K_M --think=false \
"The following is OCR'd text. Output only the corrected text.
No explanations."
Despite specifying --think=false, thinking content appeared in the output. Putting /no_think in the prompt also didn’t work. The OCR text itself was echoed into the output too.
The official ollama library version of Qwen3.5 controls thinking correctly, so it’s not a MoE issue — the thinking control templates and tokens likely got dropped during the community GGUF conversion. When using models with no official GGUF, control features like this may not work.
Correction comparison:
| Location | Qwen3.5 35B | Swallow 30B-A3B |
|---|---|---|
| 送付雪 | 送付管 | 装置 |
| 投入ロ | 投入口 | 投入口 |
| 投入優 | 投入時 | 投入後 |
| 5.3./心 | 5.3. の | 5.3. 心 (unchanged) |
| 待成 | 待機 | 待機 |
| 気送子送子管 | 気送子送付管 | 気送子送管 |
| 一方交通 | Unchanged | 一方通行 |
| 受けー方 | 受け入れ方 | 受け側 |
| 二っ折り | 二つ折り | 二つ折り |
Swallow caught “一方交通→一方通行” that Qwen3.5 missed. “受けー方→受け側” also reads more natural in Japanese. Japanese vocabulary capacity is making a difference.
On the other hand, “送付雪→装置” is Swallow’s own interpretation — “送付管” (Qwen3.5) matches the section title better. “気送子送子管→気送子送管” is also questionable; “気送子送付管” (Qwen3.5) fits the context better.
Both have their strengths and weaknesses — no single model produces perfect corrections. A practical workflow might be to diff both models’ outputs and have a human judge only the discrepancies.
Feeding Images Directly to the LLM
Qwen3.5 is multimodal and can read images. Skipping OCR and feeding the original image directly to the LLM avoids OCR misreads at the source. This gives us another axis to compare against NDLOCR-Lite’s output.
CLI Can’t Read Images
Passing an image file as an argument to ollama run returned “I don’t have the ability to process images.” Changing the argument position didn’t help either:
# Both fail
ollama run qwen3.5:35b --think=false "Read the text in this image" resource/digidepo_2531162_0024.jpg
ollama run qwen3.5:35b --think=false resource/digidepo_2531162_0024.jpg "Read the text in this image"
ollama show qwen3.5:35b shows vision support, but it can’t read images via CLI.
API Works
Passing a base64-encoded image to the API’s images parameter works:
import json, urllib.request, base64
img = base64.b64encode(open('resource/digidepo_2531162_0024.jpg', 'rb').read()).decode()
data = json.dumps({
'model': 'qwen3.5:35b',
'messages': [{
'role': 'user',
'content': 'Read all text in this image and output it. No explanations.',
'images': [img]
}],
'stream': False,
'think': False
}).encode()
req = urllib.request.Request(
'http://localhost:11434/api/chat',
data=data,
headers={'Content-Type': 'application/json'}
)
resp = json.loads(urllib.request.urlopen(req, timeout=300).read())
print(resp['message']['content'])
NDLOCR-Lite vs. Qwen3.5 Direct Image Reading
Same image (1963 staff manual) comparison:
| Location | NDLOCR-Lite | Qwen3.5 direct image |
|---|---|---|
| (z) / (2) | (z) | (2) |
| 送付雪 / 送付管 | 送付雪 | 送付管 |
| 投入ロ / 投入口 | 投入ロ | 投入口 |
| 投入優 / 投入後 | 投入優 | 投入後 |
| 待成 / 待機 | 待成 | 待機 |
| 気送子送子管 | 気送子送子管 | 気送子送子管 |
| 一方交通 | 一方交通 | 一方交通 |
| 受けー方 / 受け一方 | 受けー方 | 受け一方 |
Qwen3.5 correctly reads “(z)→(2)”, “送付雪→送付管”, “投入ロ→投入口”, “投入優→投入後”, and “待成→待機” from the image where NDLOCR-Lite misread them. “(z)→(2)” in particular was never gotten right by any text-based correction LLM — it took seeing the image.
“一方交通” and “気送子送子管” remain misread even by Qwen3.5’s direct image reading. For visually similar character confusion (ロ→口), LLMs outperform OCR, but vocabulary-level judgments (一方交通→一方通行) are stronger with text correction like Swallow.
Feeding Both Image and OCR Text Together
Would passing both the image and OCR text for “compare against the image and correct” yield better accuracy?
prompt = f'''The following is text OCR'd from this image.
Compare with the image and output only the corrected text. No explanations.
{ocr_text}'''
# Same as above with base64 image in the messages
Three-way comparison:
| Location | Image only | Text only | Image + OCR text |
|---|---|---|---|
| (z) | (2) | (z) | (z) |
| 送付雪 | 送付管 | 送付管 | 送付管 |
| 投入ロ | 投入口 | 投入口 | 投入口 |
| 投入優 | 投入後 | 投入時 | 投入後 |
| 5.3./心 | 5,3,1 | 5.3. の | 5、3、1 |
| 待成 | 待機 | 待機 | 待機 |
| 気送子送子管 | 気送子送子管 | 気送子送付管 | 気送子送子管 |
| 一方交通 | 一方交通 | 一方交通 | 一方交通 |
| 受けー方 | 受け一方 | 受け入れ方 | 受け一方 |
Something interesting happened. When image-only correctly read “(z)→(2)”, passing the OCR text alongside it reverted back to “(z)” — unchanged. The LLM was anchored by the OCR text, apparently reading the image and concluding “yes, it says (z).”
Showing the LLM an “answer key” upfront causes it to anchor there. Reading the image alone, without preconceptions, can actually produce better results. The same effect humans experience when checking someone else’s work after seeing their answers first also happens with LLMs.
Ground Truth: What a Human Saw
At this point I actually looked at the original image to document what a human eye sees:
| Location | NDLOCR-Lite | Qwen3.5 direct | Swallow | Human |
|---|---|---|---|---|
| (z) / (2) | (z) | (2) | — | (2) |
| 送付雪 / 送付管 | 送付雪 | 送付管 | 装置 | 送付管 |
| 投入ロ / 投入口 | 投入ロ | 投入口 | 投入口 | 投入口 |
| 投入優 | 投入優 | 投入後 | 投入後 | 投入後 |
| 5 3.1 / 5,3,1 | .5 3.1 | 5,3,1 | — | 5、3、1 |
| 待成 / 待機 | 待成 | 待機 | 待機 | 待機 |
| 気送子送子管 | 気送子送子管 | 気送子送子管 | 気送子送管 | 気送子送子管 |
| 一方交通 | 一方交通 | 一方交通 | 一方交通 | |
| 受けー方 / 受け一方 | 受けー方 | 受け一方 | 受け一方 |
“5、3、1” was a surprise. From the text alone it looks like section number “5.3.1”, but Figure 2 shows a diagram with three tubes labeled “5 floors”, “3 floors”, “1 floor” — tubes dropping to the 5th, 3rd, and 1st floor stations. “5、3、1の各層ステーション” (stations at floors 5, 3, and 1) is correct. Qwen3.5’s direct image reading produced “5,3,1” because it was actually looking at the diagram — while the text-correction LLM’s “5.3.1” is actually wrong. Information visible in the diagram can’t be recovered from text alone.
“一方交通” and “受け一方” appear in the original document as-is. Swallow’s corrections to “一方通行” and “受け側” are modernizations of 1963 vocabulary, not OCR error corrections — literally rewriting the original.
LLMs can’t distinguish between “correcting an OCR misread” and “updating archaic modern language to contemporary phrasing.” In 60-year-old documents, the vocabulary and expressions are of that era, and correcting them with a modern Japanese sensibility rewrites the original. Even with explicit prompting to “respect original vocabulary and only fix obvious misreads,” this can’t be fully prevented.
Best Combination
Based on these results, maximizing accuracy requires a three-stage pipeline:
- NDLOCR-Lite for OCR (fast, includes coordinate data)
- Qwen3.5 direct image reading for cross-referencing (catches what OCR misses)
- Text-based LLM (Swallow, etc.) for vocabulary-level corrections
Image + OCR text combined input looks appealing but risks the LLM anchoring on OCR misreads. Processing each independently then cross-referencing produces better accuracy overall.
That said, stage 3 text correction risks “rewriting historically accurate original text into modern Japanese” for older documents. Don’t take LLM correction output at face value — a final human check against the original image is unavoidable. The practical division of labor: LLM narrows down the diffs, human confirms them.