Tech 9 min read

BERT for search and OCR: MLM mechanics, WordPiece, and encoder successors

IkesanContents

What changed when Google Search added BERT was not “the index now uses an LLM” — it was the way short queries get read. The DEV Community BERT post opens with examples like jaguar speed and can you get medicine for someone pharmacy. The point is that BERT did not turn the whole search engine into an LLM chat. It let the engine read relationships between words inside the query before ranking.

Google’s 2019 announcement noted that BERT was being used on roughly 1 in 10 US English queries. The examples were things like for someone carrying the meaning “pick up medicine on behalf of someone else”, or parking on a hill with no curb where dropping the no reverses the answer. Neither is long-form understanding. Short search strings where prepositions and negations flip the result.

BERT reads context from both sides via masked tokens

The core of BERT is the masked language model. Some tokens in the input are hidden, and the model has to guess them from both left and right context. Training this way means the model cannot just predict the next word left-to-right.

bank shifts toward the riverbank sense when river is nearby, and toward the financial sense when deposit is nearby. Same string, different internal representation depending on surrounding words.

This property is exactly what I leaned on in Encoder model + local LLM for OCR error detection and correction. For OCR correction I used BERT-family models not as “what comes next” but as “does this character fit the surrounding context”. Given 投入優 (a misread), check if the left/right context pulls it back toward 投入後. A generative LLM, asked to fix the whole page, tends to also “fix” the parts that were actually correct. BERT only looks around the hole, so it works well as the detection step.

15% mask, 80/10/10 split

The “mask” in masked language modeling is not a simple fill-in-the-blank. 15% of the input tokens are picked as training targets, and within those the treatment splits further. 80% are replaced with [MASK], 10% are replaced with a random token, and the remaining 10% are left as the original token.

The reason is that [MASK] never appears in inputs at inference time. If every training target were [MASK], the model would learn a representation that only works when it sees [MASK]. Mixing in random tokens and originals forces the model to keep re-reading every position from its left/right context, no matter what symbol is sitting there.

flowchart LR
  A["Original<br/>The cat sat on the mat"] --> B["Pick 15%<br/>e.g. cat, mat"]
  B --> C["80% -> [MASK]<br/>10% -> random token<br/>10% -> original"]
  C --> D["Bidirectional<br/>Self-Attention<br/>12 or 24 layers"]
  D --> E["Vocab distribution<br/>at chosen positions"]
  E --> F["Cross-entropy<br/>vs ground truth"]

Original BERT also had a second pre-training task, Next Sentence Prediction. Two sentences are concatenated with [SEP] between them, and the model decides whether the second is the actual continuation of the first. Later, RoBERTa showed that downstream scores do not drop when NSP is removed, so most modern derivatives skip it. Pre-training data: original BERT used BooksCorpus + English Wikipedia (3.3B words), and RoBERTa-era recipes pushed this to ~160 GB of text.

WordPiece splits OOV words into known fragments

The English BERT vocabulary has 30,522 tokens and uses WordPiece subword segmentation. Out-of-vocabulary words do not get dropped into [UNK] — they get split into fragments the model already knows.

InputSplit
playingplay, ##ing
unhappinessun, ##happiness
tokenizationtoken, ##ization
Qwenq, ##wen

The ## prefix means “continues the previous token”. The word-initial play and the in-word ##ing are separate vocabulary IDs. Proper nouns or new terms that are not in the vocab get split into 2–3 fragments on input.

Inputs always start with [CLS], sentence boundaries are marked with [SEP], and padding uses [PAD]. For classification, the output vector at the [CLS] position is usually fed into the task head. For fill-mask use — checking candidates at a specific position — you have to place [MASK] at a subword boundary on purpose. Put a mask in the middle of a word and the model will return ##-prefixed fragments, so the consumer has to reassemble them.

Search splits into candidate generation and meaning judgement

Keyword search is strong at the “pull candidate documents fast” stage, via inverted indexes. In Data structures that make search fast I went through inverted indexes, tries, n-grams, and BK-trees. These answer “which documents contain the word” and “which strings are similar” quickly.

What BERT adds is a downstream layer that asks “does this candidate match the meaning of the query”. For parking on a hill with no curb, just pulling pages that contain curb will lean toward the opposite answer. You need to see the relationship between no and curb to reorder the candidates.

Modern RAG and on-site search work the same way. First BM25 or vector search narrows the candidate set, then a reranker reads each query–candidate pair. The embedding-and-reranker split I wrote about in Sentence Transformers v5.4: unified embeddings across text, image, audio, video sits in the same lineage. BERT does not replace the search backbone by itself — it slots in as the model that reads candidates.

Encoder-only BERT is easier to handle for judgement tasks

The DEV post also notes this near the end: BERT is an encoder. It is not a left-to-right generator like the GPT family. It does not directly answer questions or write paragraphs.

On the other hand, for classification, named entity recognition, fill-mask, similarity, and rerank — tasks that read input and produce a judgement — failures are still easier to read. The output space is narrow, which makes testing easier than with generative LLMs. You can collapse outputs to FIX/KEEP/ESCALATE for OCR correction, per-candidate scores for search, or labels for classification.

This matters when running models locally. bert-base-uncased is 110M parameters with a 512-token max, an old model, but classification and fill-mask still run on CPU. Accuracy drops when the domain shifts. Japanese OCR wants a Japanese BERT or LUKE; medical text wants BioBERT; legal text wants LegalBERT. The distance from the pre-training corpus shows up directly as misjudgements.

The encoder lineage after BERT is still growing

BERT itself is a 2018 model — 512-token max and an attention implementation from that era. A whole line of derivatives and successors followed, and for current judgement tasks you usually pick from the lower rows of this table.

ModelYearMain change
RoBERTa2019NSP removed, larger batches, dynamic masking, more data
ALBERT2020Cross-layer parameter sharing, lighter weight
DeBERTa2020Disentangled attention, position info injected separately
DeBERTa-v32021ELECTRA-style replaced-token detection, the de facto cross-encoder choice
ModernBERT20248,192-token context, Flash Attention, a modern training recipe

DeBERTa-v3-based cross-encoders are the standard rerank pick, and ModernBERT is starting to show up wherever long context or long-document embedding matters. Lined up by parameter count, the classic BERT looks weak — but for a CPU-side classifier or quick fill-mask experimenting at your desk, bert-base-uncased still runs fine.

Picking a Japanese BERT means picking pre-training data and tokenizer

Japanese cannot rely on whitespace splitting, and it mixes hiragana, katakana, kanji, ASCII, and symbols. You choose a model by pre-training domain and tokenizer scheme.

ModelNotesWhere it fits
cl-tohoku/bert-base-japanese-v3MeCab + WordPiece, Wikipedia + CC-100General-purpose, web text classification
cl-tohoku/bert-base-japanese-char-v3Character-levelOCR correction, text with heavy spelling variation
LUKESeparate entity embeddingsNER, relation extraction
nlp-waseda/roberta-base-japaneseRoBERTa recipeSentence classification, similarity
ku-nlp/deberta-v2-base-japaneseDeBERTa familyRerank, sentence-pair judgement
line-corporation/line-distilbert-base-japaneseDistilBERT, lightweightEdge / CPU inference

For tasks like “is this character contextually wrong” (OCR error detection), morpheme-based models have a failure mode where the misread character itself breaks MeCab segmentation. Character-level models are less sensitive to subword boundaries, so the per-position probabilities are easier to read directly.

BERT scores have task-dependent distributions

BERT-family outputs need to be read differently per task. A fill-mask probability is not “ground-truth likelihood” — it is the relative strength of that token under the model’s vocabulary and context. Cosine similarity for similarity search shifts with the model, normalization choice, sequence length, and domain.

In the OCR correction setup, taking BERT’s top-1 candidate at face value produced false positives. I layered a Qwen judgement on top and routed disagreements back to a human. The same principle holds for search: treating BERT or reranker scores as a single source of truth tends to break in ways you only catch by cross-checking against existing search scores, click logs, and your own evaluation set.

Google’s BERT rollout often gets retold as “search moved from keyword matching to context understanding”. The real footprint is narrower. It enters where, inside a short input, negations, prepositions, word order, and surrounding words flip the candidate ranking.

References