A read of arXiv:2604.26622 OCR-Memory. It renders agent execution history into images, uses Set-of-Mark to let a VLM pick relevant segments, then retrieves verbatim text from the original logs.
Designing field-level confidence thresholds for human-in-the-loop document extraction, and the OCR and threshold walls hit when automating journal entries with freee MCP.
The three-stage pipeline of BERT perplexity scan → LLM judgment → escalation packaged as a cross-platform Python tool. The installer automatically downloads llama-server and GGUF models.
Zhipu AI's GLM-OCR reaches 94.62% on OmniDocBench v1.5 despite using only 0.9B parameters. I dug into its layout parsing, vertical text handling, and math recognition.
Bundling NDLOCR-Lite's DEIMv2 + PARSeq with ONNX Runtime Mobile in an iOS app to run camera capture → perspective correction → layout detection → text recognition → confidence-based correction entirely on device.
Experiment log: from LUKE/BERT fill-mask fine-tuning, to perplexity-based error detection, to Qwen2.5 7B correction judgment with human escalation on mismatch. A complete pipeline running on a single RTX 4060 Laptop with 8GB VRAM.
From Docker hell to Lite + LLM correction. A retrospective on my own experimentation, plus an introduction to someone else's browser-based NDLOCR-Lite implementation.
Set up the CLI version of NDLOCR-Lite on Apple Silicon Mac, then tested OCR result correction with Qwen 3.5 and Swallow. Includes experiments with direct image reading and the anchoring effect.
Tried the lightweight OCR tool NDLOCR-Lite released by the National Diet Library — installed it on Windows 11 and tested both the CLI and GUI versions.
I looked into PageIndex, a RAG system that builds hierarchical document trees using only LLM reasoning, without chunking or vector databases. I also consider how it fits with layout detection and OCR pipelines.
Baidu's PaddleOCR-VL-1.5 reaches 94.5% accuracy on OmniDocBench v1.5 with just 0.9B parameters, surpassing large models such as GPT-4o and Qwen2.5-VL-72B.