Using Confidence Scores to Reduce Human Review in Document Extraction
Contents
Demos that extract JSON from PDFs usually work smoothly. But once invoices, contracts, scanned images, and handwritten notes start mixing in, failures get annoying in a specific way — not everything breaks, just certain fields go wrong with full confidence.
Iteration Layer published Human in the Loop: Using Confidence Scores to Build Reliable Document Extraction on DEV Community, addressing this “partially suspicious” state with field-level confidence scores. Their Document Extraction API returns a confidence value between 0 and 1, along with a source citation, for each extracted field. Instead of pass/fail on the entire document, it evaluates invoice number, vendor name, total amount, and shipping address separately.
This granularity matters. If the address scores 0.72 while amounts and invoice numbers sit around 0.95, there’s no reason to send the whole document back to a human. Queue just the address field for review and let the rest flow through.
Confidence is not accuracy
The original article emphasizes this: confidence is not accuracy. A score of 0.95 doesn’t guarantee correctness, and 0.65 doesn’t guarantee an error. It’s a signal of how certain the model is about a value, and the operations team uses that signal to narrow down where humans need to look.
Getting this wrong causes incidents. Even high-confidence fields can misread labels, grab the wrong column from a table, or pick a stale amount from elsewhere in the document. For values like amounts and dates — plausible-looking but high-impact downstream — adding arithmetic checks or cross-document validation on top of confidence is safer than trusting confidence alone.
The original article sets different thresholds per field type: 0.95 for amounts, 0.88 for vendor names, 0.90 for invoice numbers. This makes practical sense. A vendor name typo can be absorbed as a variant, but a payment amount error goes straight to a wire transfer.
| Field | Auto-pass threshold | Cost of getting it wrong |
|---|---|---|
| Invoice number | 0.90 | Reconciliation headaches |
| Vendor name | 0.88 | Often absorbable as a name variant |
| Tax/Total amount | 0.95 | Directly hits payments and accounting |
| Address | ~0.90 | Shipping errors, identity verification failures |
Different from OCR-only problems
I previously compared browser OCR, NDLOCR, Cloud Vision, and Vision LLMs in the 2025 web OCR limitations article. That article focused on “can it read the text,” “how good is it with Japanese,” and “does it run in a browser.”
This is one layer deeper. Given that the text was read correctly, can the extracted value be trusted as business data? Did it distinguish “Total” from the subtotal line? Can the JSON that OCR or VLM returned go straight into the database?
In the article on cutting NDLOCR column recognition with histograms, I covered how layout recognition failures corrupt reading order itself. Confidence scores don’t magically fix preprocessing failures like that. They’re a component for narrowing the review scope after a suspicious region is detected.
The same issue applies to VLM-based JSON extraction. In the local Vision LLM RPG parameter extraction experiment, JSON structure was stable but the values reflected the model’s interpretation — or invention. For document extraction too, getting schema-compliant JSON back isn’t enough. Without value provenance, confidence, and post-review correction history, there’s no way to trace where things went wrong later.
Keep the review queue field-level
The heavy part of human-in-the-loop design is when it degenerates into “review everything anyway.” The original article’s review queue design assumes field-level review. Put the extracted value, confidence, source citation text, and original document viewer on the same screen, and let the reviewer fix only the flagged fields.
The flow looks like this:
flowchart TD
A["Document upload"] --> B["Extract with schema"]
B --> C["Assign per-field confidence"]
C --> D{Above threshold?}
D -->|High| E["Auto-accept"]
D -->|Medium| F["Pre-fill for human review"]
D -->|Low| G["Manual entry or re-extract"]
F --> H["Save correction history"]
G --> H
E --> I["Downstream processing"]
H --> I
An important detail in the review UI: don’t blank out the model’s output. Even at low confidence, having a candidate value and its source citation speeds up review. Full manual entry should be reserved for cases where the value is near-null, no source citation was captured, or an arithmetic check failed.
Another metric to watch is post-review correction rate. If most items in the 0.70–0.90 review band are approved as-is, the threshold is too strict. Conversely, if the correction rate in the review band is high, errors are likely leaking through on the auto-accept side too. Correction rate by confidence band is more actionable for tuning than average confidence.
Make branching conditions explicit for agents
The original article also touches on agent usage via MCP. Iteration Layer’s MCP documentation shows a path for calling their document and image processing APIs from tools like Claude and Cursor. The agent receives a document, calls the extraction tool, and branches on confidence: proceed, request review, or reject.
This is riskier than a human review queue. A human can pause and think “does this address look right?” — an agent moves on to the next tool call. When an agent reads an invoice amount and proceeds to a payment workflow, uses an extracted contract date to create an alert, or feeds a resume’s skills section into an evaluation — confidence becomes an execution condition, not just metadata.
In the Cloudflare Browser Run article, I covered Human in the Loop for login and MFA, where the agent hits a wall it can’t operate past. Document extraction HITL is subtler — the agent can operate, but doesn’t know whether to trust the value.
As seen in the Pinterest internal MCP infrastructure article, enterprise MCP systems are adding approval flows for dangerous operations. Document extraction needs the same: if confidence falls below threshold, halt tool execution and route to user confirmation or the review queue. Writing “check with human if confidence is low” in the agent’s prompt is too weak. Hardcode the branching in code after the tool call returns.
Don’t try to nail thresholds from the start
The original article proposes processing a few hundred documents, reviewing the results, then adjusting thresholds. This matters a lot. If you assign significance to 0.92 or 0.95 too early, it breaks the moment the document type changes.
Start conservatively. Cast a wide review net, and for each field record whether it was “approved,” “corrected,” or “manually entered.” Then shift confidence bands with near-zero corrections into auto-accept. For bands with frequent corrections, raise the threshold or reinforce with schema descriptions, preprocessing, cross-document checks, or computed fields.
There are cases where confidence alone shouldn’t be the final word. For amounts like subtotal, tax, and total, stop if the arithmetic doesn’t add up even when each field has high confidence. For contract dates, the body, cover page, and amendment clauses might disagree. For resumes, contact info may extract at high confidence while skill summaries are over-abstracted.
What production document extraction needs isn’t perfect automation — it’s plumbing where wrong values don’t silently flow downstream. Confidence is the valve, but the valve’s opening has to be tuned by review results.
Building a multi-stage filter
The threshold logic above uses a single-stage per-field threshold to sort into “pass / review / reject.” An alternative is to stack stages like a cascade classifier.
The first stage does a rough sort by confidence. The second stage runs arithmetic consistency and format validation (date formats, postal codes, tax totals). A third stage checks against external master data (partial match on vendor names, invoice number sequencing rules).
flowchart TD
A["Extracted field value + confidence"] --> B{"Stage 1<br/>Confidence threshold"}
B -->|0.95+| C["Tentative accept"]
B -->|Below 0.60| D["Manual entry"]
B -->|0.60 to 0.95| E["To Stage 2"]
C --> F{"Stage 2<br/>Arithmetic & format check"}
E --> F
F -->|Pass| G["Auto-accept"]
F -->|Mismatch| H{"Stage 3<br/>External master lookup"}
H -->|Match| G
H -->|No match| I["Review queue"]
D --> I
The key is not letting confidence alone make the final call. Stage 1 tentatively accepts high-confidence values, but Stage 2 arithmetic checks don’t get skipped. Even at confidence 0.98, if subtotal + tax ≠ total, it stops.
Going the other way: a field at confidence 0.72 that landed in the middle zone can be promoted to auto-accept if the postal code format checks out and the address matches the external master. Low confidence doesn’t have to mean “show everything to a human” — if other validations can compensate, there’s still room for automation.
The idea is the same as cascade classifiers in machine learning. Just as Viola-Jones face detection eliminates non-faces stage by stage, each stage clears the obvious OKs and obvious NGs first, passing only the gray zone to the next. In document extraction, the stages aren’t model re-inference but rule-based validation and master data lookups.
Managing stage ordering and pass conditions is the annoying part in practice. For document types with no lookup target for Stage 2 (free-format quotes, for example), you’re stuck with confidence and source citations alone. Which stages are effective varies by document type, so the stage configuration itself needs to be tied to the document schema.
It’s heavier to build than single-stage confidence thresholds, but it catches both “high confidence but arithmetic fails” and “low confidence but deterministic by rule.”
The wall hit when trying to automate journal entries with freee MCP
Everything written above landed directly when I tried to create journal entries from documents in freee’s file box via MCP.
freee’s file box is a feature for uploading and storing images of receipts and invoices. In the web UI, internal OCR runs and displays recognized amounts and vendor names. But when fetching file box data via API, the OCR results don’t come along. Trying to “read file box documents and create journal entries” through MCP means starting from scratch with your own OCR.
At the self-OCR stage for reading invoice amounts, the “high confidence but wrong” problem hits head-on. Misread an amount by one digit and it enters the accounting ledger — potentially flowing all the way to a bank transfer. What I ended up doing was cranking the threshold to effectively maximum: only auto-process items with absolute certainty. Anything with even slightly reduced confidence goes to human review.
This kills the “wrong journal entry silently passes” failure mode. But the automation rate is low. Most documents end up in human review, not much different from cutting journal entries by hand.
freee’s internal OCR is accurate. In the web UI, even poorly scanned receipts are read correctly most of the time. On top of that, freee has journal entry rules — a feature that auto-sets account categories based on vendor name and description patterns; once a rule exists, matching documents are journaled automatically. When you go through an external self-OCR path outside the API, you lose both that accuracy and those rules.
Building an end-to-end pipeline of “auto-journal documents from the file box, upload to transaction details, and attach” needs more than multi-stage filter design. Either re-implement freee’s journal entry rule logic externally, or wait for freee to start returning OCR results via API. As it stands, the choice is between a conservative setup that only flows items with maximum confidence, or a flow that keeps human final review.
There’s also the account category estimation wall, separate from OCR accuracy. Even with correctly read amounts and vendor names, “is this ¥3,980 from Amazon office supplies or books?” and “is this ¥680 from Starbucks meeting expenses or entertainment?” come down to that business’s accounting judgment. freee’s journal rules accumulate this mapping from past journal patterns, but the API doesn’t expose that accumulation. Estimating account categories from an external pipeline means either writing your own rules based on vendor names and amounts, or having an LLM guess each time. Neither comes close to the judgment built up by humans inside freee.
Both OCR recognition accuracy and journal rule accumulation are locked inside the API boundary, out of reach for external agents. As long as the file box API returns only image binaries and metadata, the agent side has to redo both recognition and judgment from scratch.