Tech 7 min read

Running NDLOCR-Lite in an iOS Native App for On-Device OCR

After trying Windows, Mac, and browser (someone else’s implementation), next is an iOS native app.

A browser version (ndlocrlite-web) already exists, but it requires downloading and initializing a 146MB WASM model. A native app with the model bundled can go from cold start to camera OCR in seconds. Since startup speed directly impacts usability, there’s a real reason to build native.

Test Environment

ItemSpec
DeviceiPhone 13 Pro Max
iOS18.6
ChipA15 Bionic
RAM6 GB

Model Configuration

Four ONNX models bundled in the app:

ModelPurposeSizeInput shape
deim-s-1024x1024.onnxLayout detection (DEIMv2)38.4 MB[1, 3, 800, 800] float32
parseq-ndl-16x256-30Text recognition (~30 chars)34.2 MB[1, 3, 16, 256] float32
parseq-ndl-16x384-50Text recognition (~50 chars)35.2 MB[1, 3, 16, 384] float32
parseq-ndl-16x768-100Text recognition (~100 chars)39.1 MB[1, 3, 16, 768] float32

DEIMv2’s filename says 1024x1024, but the actual input size is 800x800. Reading the Python inference code shows input_size=800. Don’t trust the filename.

Total ~147MB. Large for an iOS app bundle, but well within App Store’s 4GB limit.

Inference Pipeline

flowchart TD
    A["Camera capture"] --> B["Vision perspective correction"]
    B --> C["DEIMv2<br/>Layout detection"]
    C --> D["Text line<br/>crop + vertical rotation"]
    D --> E["PARSeq<br/>Text recognition (cascade)"]
    E --> F["Confidence scan<br/>detect low-confidence chars via softmax probability"]
    F --> G["Reading order sort"]
    G --> H["Show correction results<br/>highlight suspicious areas"]

DEIMv2 detects text line bounding boxes in the image; PARSeq reads characters from each box. PARSeq uses 3 models in cascade based on string length (short lines → 256 model, long lines → 768 model).

Running on real device (iPhone 13 Pro Max). Input is a MinaTokenPAY flyer (horizontal text).

Input image (MinaTokenPAY flyer)

Input

OCR result (iPhone 13 Pro Max)

Correction result

Xcode Project Setup

ONNX Runtime via SPM

Add via Swift Package Manager:

  1. Xcode → File → Add Package Dependencies
  2. URL: https://github.com/microsoft/onnxruntime-swift-package-manager
  3. Version: 1.20.0 or later
  4. Add onnxruntime library to target

For xcodegen (project.yml):

packages:
  onnxruntime:
    url: https://github.com/microsoft/onnxruntime-swift-package-manager
    from: "1.20.0"

targets:
  NDLOCRLite:
    type: application
    platform: iOS
    dependencies:
      - package: onnxruntime
        product: onnxruntime

Adding Model Files

Drag and drop the 4 .onnx files from src/model/ in the ndlocr-lite repository into the Xcode project. Confirm they’re included in the target’s Bundle Resources.

The character set definition NDLmoji.yaml is also needed. Used to map PARSeq output indices to characters. Defines ~7142 characters including EOS (hiragana, katakana, kanji, alphanumerics, symbols).

Model Loading and Inference

ONNX Runtime’s iOS API is Objective-C based, accessed from Swift via bridge.

import OnnxRuntimeBindings

let env = try ORTEnv(loggingLevel: .warning)
let options = try ORTSessionOptions()
try options.setIntraOpNumThreads(2)
try options.setGraphOptimizationLevel(.all)

let modelPath = Bundle.main.path(forResource: "deim-s-1024x1024", ofType: "onnx")!
let session = try ORTSession(env: env, modelPath: modelPath, sessionOptions: options)

DEIMv2 Inference

Two inputs: image tensor images and orig_target_sizes.

let imageTensor = try ORTValue(
    tensorData: NSMutableData(data: imageData),
    elementType: .float,
    shape: [1, 3, 800, 800] as [NSNumber]
)

// orig_target_sizes: pass input_size to match Python implementation (not original image size)
var sizeArray: [Int64] = [800, 800]
let sizeData = NSMutableData(bytes: &sizeArray, length: MemoryLayout<Int64>.size * 2)
let sizeTensor = try ORTValue(
    tensorData: sizeData, elementType: .int64, shape: [1, 2] as [NSNumber]
)

let outputs = try session.run(
    withInputs: ["images": imageTensor, "orig_target_sizes": sizeTensor],
    outputNames: ["labels", "boxes", "scores", "char_count"],
    runOptions: nil
)

Get text line bounding box coordinates from output boxes, filter with scores. Output labels are 1-indexed, so classIndex = label - 1 maps to ndl.yaml class definitions. Text line class indices are [1, 2, 3, 4, 5, 16] (line_main, line_caption, line_ad, line_note, line_note_tochu, line_title).

PARSeq Inference

Crop each text line box detected by DEIMv2, select model based on line width, and pass to PARSeq.

// estimate character count from aspect ratio, select model
let aspectRatio = Float(width) / Float(height)
let estimatedChars = Int(aspectRatio * 2)

let config = parseqConfigs.first { estimatedChars <= $0.maxChars }
    ?? parseqConfigs.last!

let inputTensor = try ORTValue(
    tensorData: NSMutableData(data: inputData),
    elementType: .float,
    shape: [1, 3, 16, config.width as NSNumber] as [NSNumber]
)

Note that output tensor names differ per model ("13469", "21189", "40488"). Check the Python ONNX graph for correct names.

From the output float array, get character indices via argmax and decode with NDLmoji.yaml’s character set. Index 0 is the EOS (End of Sequence) token.

Correction

Why BERT Character Mask Prediction Didn’t Work

Initially tried detecting misrecognitions with BERT (bert-base-japanese-v3) perplexity scanning — replacing each character position with [MASK], running BERT inference, and flagging positions where the prediction probability for the original character is low.

bert-base-japanese-v3 expects input tokenized with MeCab + WordPiece. “児童虐待” (child abuse) would be split by MeCab into “児童” + “虐待”, then subword tokenized. BERT’s pretraining was done at this granularity.

Since loading MeCab on iOS is heavy dependency, I used character-level tokenization — “児”, “童”, “虐”, “待” as individual tokens. Single-character tokens exist in BERT’s vocabulary, but their contexts during training were sparse as standalone tokens, making the probability distribution unstable. Even correct text like “児童虐待” and “オレンジリボン” (orange ribbon) got high anomaly scores, turning everything red.

Lowering thresholds, switching to ratio-based scoring (BERT’s top-1 probability / original character probability) — none of it fixed the root issue of wrong flagging locations. Character-level tokenization for Japanese BERT correction is a dead end.

The model was reduced to 521MB (FP32) → 131MB (INT8 quantized) to fit in the bundle, but since the judgments were useless, it was removed.

PARSeq Confidence as Alternative

Dropped BERT, using PARSeq’s own softmax output for correction. When PARSeq outputs each character, the post-softmax maximum probability directly indicates “how confident the model is about that character.” No additional model needed — it’s available from the decoding process.

private func decodeOutput(data: Data, charCount: Int) -> (String, [CharConfidence]) {
    let floats = data.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) }
    let seqLength = floats.count / charCount

    var result = ""
    var confidences: [CharConfidence] = []

    for pos in 0..<seqLength {
        let logits = Array(floats[(pos * charCount)..<((pos + 1) * charCount)])

        // softmax
        let maxLogit = logits.max() ?? 0
        let exps = logits.map { exp($0 - maxLogit) }
        let sumExps = exps.reduce(0, +)
        let probs = exps.map { $0 / sumExps }

        // argmax
        let (maxIdx, maxProb) = probs.enumerated().max(by: { $0.1 < $1.1 })!
        if maxIdx == 0 { break }  // EOS

        let char = charset[maxIdx]
        result.append(char)

        confidences.append(CharConfidence(
            char: char,
            confidence: maxProb,
            topCandidates: /* top3 */ ...
        ))
    }

    return (result, confidences)
}

Threshold and Detection Patterns

Flag characters below 0.5 confidence as “suspicious,” highlighting up to 15 in red. 0.5 means “the model chose this character with less than 50% probability.”

In early development, DEIMv2 duplicate detections (same line grabbed as both line_main and line_caption) and mismatches between camera preview area and actual capture area meant PARSeq was receiving inaccurate bounding boxes, producing many low-confidence characters. Observed patterns:

ConfidencePatternCause
~0.1Characters at line start/endCharacter cut off at bounding box boundary
0.2–0.3Symbol misrecognitionsHyphen↔dash, comma↔period
0.3–0.5Similar glyph confusionMix-up of visually similar characters

After implementing NMS to remove duplicate bounding boxes (IoU > 0.5 suppressed) and correct image cropping aligned to camera’s resizeAspectFill, the image quality going into PARSeq improved and low-confidence characters mostly disappeared. No red highlights in the screenshot above either.

Confident errors (high confidence but still wrong) can’t be caught, but that’s the same limitation for any post-correction method. In practice, improving preprocessing accuracy was the most effective way to reduce correction burden.

The UI highlights low-confidence characters in red; tapping opens a sheet showing PARSeq’s Top 3 candidates for selection. Apply the fix and export via share sheet.

Image Preprocessing Pitfalls

This took the most time. What’s a few lines with OpenCV + NumPy in Python hides coordinate system, pixel format, and scale traps on iOS.

Retina Scale Trap

UIGraphicsImageRenderer(size: CGSize(width: 800, height: 800)) produces a 2400x2400 pixel image on iPhone 13 Pro Max (3x Retina). Nothing like the 800x800 the model expects.

// WRONG: Retina scale multiplied → 2400x2400
let renderer = UIGraphicsImageRenderer(size: CGSize(width: 800, height: 800))

// CORRECT: explicitly specify scale=1.0 to guarantee 800x800 pixels
UIGraphicsBeginImageContextWithOptions(CGSize(width: 800, height: 800), true, 1.0)
defer { UIGraphicsEndImageContext() }

UIGraphicsImageRenderer can also take a scale parameter, but UIGraphicsBeginImageContextWithOptions is more reliable.

Pixel Byte Order Is BGRA

Reading pixel data from a UIKit bitmap context, byte order is BGRA (for ARM iOS: kCGImageAlphaNoneSkipFirst | kCGBitmapByteOrder32Little). Treating it as RGBA causes colors to swap and inference results become completely wrong.

let bytes = ctxData.assumingMemoryBound(to: UInt8.self)
// bytes[offset + 0] = B
// bytes[offset + 1] = G
// bytes[offset + 2] = R
// bytes[offset + 3] = A (or skip)

bytesPerRow Is Not width*4

The context’s bytesPerRow may not equal pixel width × 4. Padding can be added for memory alignment. Always use bytesPerRow for row start address calculations.

let bytesPerRow = ctx.bytesPerRow
for row in 0..<targetSize {
    let rowOffset = row * bytesPerRow  // NOT width * 4, use bytesPerRow
    for col in 0..<targetSize {
        let pixelOffset = rowOffset + col * 4
        // ...
    }
}

Different Normalization for Each Model

DEIMv2 and PARSeq use different normalization. Mixing them destroys accuracy.

  • DEIMv2: ImageNet normalization (pixel / 255.0 - mean) / std
  • PARSeq: pixel / 127.5 - 1.0

Perspective Correction

Image distortion from smartphone photos is handled with Vision framework’s VNDetectDocumentSegmentationRequest to detect document corners, then CIPerspectiveCorrection for perspective correction. No OpenCV needed.

Vertical Text Rotation

Tall line images (vertical text) need 90 degree CCW rotation before passing to PARSeq.

// .right = "raw data is 90° CW rotated" → UIKit displays it 90° CCW
let rotated = UIImage(cgImage: cgImage, scale: 1.0, orientation: .right)
UIGraphicsBeginImageContextWithOptions(rotated.size, false, 1.0)
defer { UIGraphicsEndImageContext() }
rotated.draw(at: .zero)

Using CGContext.rotate(by:) directly causes rotation in the opposite direction from what you’d expect due to UIKit’s y-flip coordinate system. Specifying UIImage.Orientation is safer.

Parsing the Character Set (NDLmoji.yaml)

Processing NDLmoji.yaml’s charset_test: "..." string into a character array, using Swift’s Character type causes variant selectors (U+FE0E) to combine with the preceding character.

// WRONG: Character type combines grapheme clusters
// "☆\u{FE0E}" becomes 1 character → total is 7141 (correct is 7142)
for char in charsetString { ... }

// CORRECT: read at Unicode scalar level
let scalars = Array(content.unicodeScalars)
for scalar in scalars {
    chars.append(String(scalar))
}

Even one character off means all PARSeq output tensor readings are shifted, giving wrong argmax results for every position. If output is completely garbled, first check the character count.

Memory Management

The memory available to iOS apps is roughly half the device RAM. Even on iPhone 13 Pro Max (6GB RAM), you have about 1.5GB in practice.

ONNX Runtime sessions use several times the model size in memory when expanded. Loading all 3 PARSeq models simultaneously strains memory, so use lazy loading + autoreleasepool.

// lazy-load PARSeq on demand
if parseqSessions[config.width] == nil {
    parseqSessions[config.width] = try ORTSession(env: env, modelPath: path, sessionOptions: options)
}

// wrap each line inference in autoreleasepool
for box in boxes {
    let result: LineResult? = try autoreleasepool {
        // crop → rotate → recognize
    }
}

The BERT correction model (131MB) was dropped as mentioned. No additional models needed, bundle size stays at ~147MB.