Tech 8 min read

A Complete Roundup of Data Structures for Fast Search — Trie, Inverted Index, Suffix Array, Double-Array Trie

There are many kinds of “search”: exact match, prefix match, substring search, fuzzy search, full‑text search… The best data structure changes depending on what kind of search you want to do.

It’s tempting to just toss everything into a DB and call it a day, but that’s not really a good idea.

So, I put together 10 major data structures used for search workloads. For each one, I outline how it works, which search types it excels at, complexity, and real‑world use cases.

Let’s first organize the kinds of search.

Search TypeDescriptionExample
Exact matchKey matches exactlyDictionary lookup, hash table
Prefix matchStarts with the given prefixAutocomplete “abc…”
SubstringContained somewhere in the stringgrep, Ctrl+F
Fuzzy searchSmall edit distanceSpell correction “teh → the”
Full‑text searchFind documents that contain termsGoogle Search, on‑site search
Range queryValue range“Price between 1,000–2,000”

Trie

A tree that follows one character per node. The basic structure for prefix search.

        root
       / | \
      c  d  t
      |  |  |
      a  o  o
     /|  |  |
    t r  g  p

A trie that stores “cat”, “car”, “dog”, and “top”.

Good at: Prefix search, autocomplete, detecting common prefixes

Complexity: O(m) — m is the key length

Drawbacks: Memory‑inefficient. Each node keeps pointers to its children, so the tree tends to be sparse.

Use cases: IME candidate lists, URL routing, IP address lookup (Patricia trie)

Double‑Array (Double‑Array Trie)

A speed‑up technique that represents a trie with two arrays. Used internally by morphological analyzers such as MeCab and Sudachi.

インデックス:  0   1   2   3   4   5   6   7
BASE:        [-   1   3   -   5   -   -   2 ]
CHECK:       [-   0   1   -   2   -   -   1 ]

How it works:

  • Compute transition with BASE[s] + c = t
  • Verify origin with CHECK[t] = s

From state s with character code c:

  1. Compute t = BASE[s] + c
  2. If CHECK[t] == s, the transition exists; otherwise there is no transition for that character

Good at: Dictionary lookup; morphological analysis (fast lookup over large lexicons)

Complexity: O(m) — same as a trie, but with better cache locality

Drawbacks: Complex to build. Poor at dynamic insert/delete (requires relocating array regions).

Use cases: MeCab, Sudachi, ATOK, Google Japanese Input

Inverted Index

A mapping from “term → list of documents that contain the term”. The foundation of full‑text search.

【元データ】
Doc1: "猫が寝ている"
Doc2: "犬が走っている"
Doc3: "猫が走っている"

【転置インデックス】
猫   → [Doc1, Doc3]
犬   → [Doc2]
寝   → [Doc1]
走   → [Doc2, Doc3]

It’s called “inverted” because it reverses the usual “document → terms” relationship.

Good at: Full‑text search, AND/OR queries, phrase search

Complexity: O(1) to fetch the postings list. AND queries are O(n+m) for list merging.

Additional fields:

  • TF (term frequency)
  • IDF (inverse document frequency)
  • Positional data (for phrase search)

Use cases: Elasticsearch, Apache Lucene, Pagefind, Algolia

Suffix Array

An array of all suffixes of a string, sorted. Enables substring search via binary search.

文字列: "banana"

接尾辞:
0: banana
1: anana
2: nana
3: ana
4: na
5: a

ソート後(接尾辞配列):
[5, 3, 1, 0, 4, 2]
 a  ana anana banana na nana

When searching for “ana”, it’s sorted so you can use binary search.

Good at: Substring search, longest common substring

Complexity: O(m log n) for search — m is the pattern length, n is the string length

Merits: More memory‑efficient than a suffix tree (no pointers)

Use cases: DNA sequence search, large‑scale text search, bzip2 compression

Suffix Tree

A trie of all suffixes with path compression. More capable than a suffix array but uses more memory.

文字列: "banana$"

        root
       /    \
      a      banana$
     / \
   na$  $
   |
  na$

Common prefixes are compressed into single edges.

Good at: Substring search, longest common substring, longest repeated substring

Complexity: O(m) for search — faster than a suffix array

Drawbacks: High memory usage (10–20× the original string)

Use cases: Bioinformatics (DNA/protein sequence analysis)

BK‑tree (Burkhard–Keller Tree)

A tree structure based on edit distance (Levenshtein distance). Specialized for fuzzy search.

        "book"
       /  |  \
      1   2   3
     /    |    \
  "cook" "back" "banana"

Nodes are placed according to their edit distance from the root.

How it works:

  • To find words whose edit distance to the query is ≤ d
  • If the distance to the current node is k, only explore children with edges in the range k−d to k+d

By the triangle inequality, you can prune the search space substantially.

Good at: Fuzzy search, spell correction, similar‑string search

Complexity: Depends on the distance threshold; the smaller the threshold, the faster.

Use cases: Spell checkers, “Did you mean?” features

n‑gram Index

Split strings into n‑character chunks and index them. Commonly used for Japanese partial‑match search.

文字列: "検索エンジン"

2-gram(bigram):
["検索", "索エ", "エン", "ンジ", "ジン"]

3-gram(trigram):
["検索エ", "索エン", "エンジ", "ンジン"]

Good at: Partial‑match search; Japanese search without morphological analysis

Complexity: O(1) to fetch an n‑gram’s postings list

Drawbacks: Large index size (several times the original data)

Use cases: PostgreSQL pg_trgm, MySQL ngram parser, a helper for Japanese full‑text search

Bloom Filter

A probabilistic data structure that quickly says “not present”. It allows false positives but no false negatives.

ビット配列: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

"cat"を追加(ハッシュ値: 2, 5, 8):
           [0, 0, 1, 0, 0, 1, 0, 0, 1, 0]

"dog"を追加(ハッシュ値: 1, 5, 9):
           [0, 1, 1, 0, 0, 1, 0, 0, 1, 1]

How it works:

  • Insert: set bits using k hash functions
  • Query: if all k bits are 1 → “probably present”; if any bit is 0 → “definitely not present”

Good at: Pre‑filtering for existence checks; avoiding unnecessary disk access

Complexity: O(k) — k is the number of hash functions

Properties: False‑positive rate is tunable (larger bit arrays reduce it)

Use cases: Pre‑filter before SSTable search in LevelDB/RocksDB, negative lookups in caches, spam filtering

B+ tree

The balanced tree used by RDB indexes. Leaf nodes are connected via links.

        [30 | 60]
       /    |    \
   [10|20] [40|50] [70|80|90]
     ↓       ↓         ↓
   リーフ → リーフ → リーフ(リンクで連結)

How it works:

  • Internal nodes store keys and pointers to children
  • Leaf nodes hold the actual records (or pointers)
  • Leaves are linked → fast range scans

Good at: Range queries, ordered scans, equality lookup

Complexity: O(log n) for search/insert/delete

Merits: Optimized for disk I/O (1 node = 1 page)

Use cases: MySQL InnoDB, PostgreSQL, SQLite, file systems (NTFS, ext4)

LSM tree (Log‑Structured Merge Tree)

Appends writes sequentially and merges them in the background. Strong for write‑heavy workloads.

メモリ:     [MemTable] ← 新規書き込み
              ↓ フラッシュ
ディスク:  Level 0: [SST] [SST] [SST]
              ↓ コンパクション
           Level 1: [SST] [SST]

           Level 2: [SST]

How it works:

  1. Writes append to an in‑memory MemTable (also write a Write‑Ahead Log)
  2. When the MemTable reaches a threshold, flush to disk as an immutable SSTable
  3. SSTables are immutable; background compaction merges them
flowchart LR
    subgraph Memory
        W[Write] --> MT[MemTable]
        MT --> WAL[WAL]
    end
    subgraph Disk
        MT -->|flush| L0[Level 0]
        L0 -->|compact| L1[Level 1]
        L1 -->|compact| L2[Level 2]
    end
    R[Read] --> MT
    R --> L0
    R --> L1
    R --> L2

On reads, search in order: MemTable → Level 0 → Level 1 → … Stop once found.

Good at: Write‑heavy workloads, time‑series data

Complexity:

  • Writes: O(1) amortized (sequential writes)
  • Reads: O(log n) × number of levels (scan multiple SSTables)

Drawbacks: Reads often need to touch multiple levels and can be slower than B+ trees

Use cases: LevelDB, RocksDB, Cassandra, InfluxDB

Comparison Table

Data StructureSearch TypeSearch ComplexityMemoryPrimary Uses
TriePrefix matchO(m)LargeAutocomplete
Double‑ArrayExact/PrefixO(m)SmallMorphological analysis
Inverted IndexFull‑textO(1)MediumSearch engines
Suffix ArraySubstringO(m log n)Smallgrep, DNA
Suffix TreeSubstringO(m)LargeString analysis
BK‑treeFuzzyVariableMediumSpell correction
n‑gramPartial matchO(1)LargeJapanese search
Bloom FilterExistence checkO(k)TinyFiltering
B+ treeRange queriesO(log n)MediumRDB indexes
LSM treeWrite‑optimizedO(log n)×MediumKVS, time‑series DB

Note: m is key/pattern length, n is the number of items, and k is the number of hash functions.

Combinations in Real Systems

In real systems, multiple data structures are often combined.

Elasticsearch / Apache Lucene:

  • Inverted index (text search)
  • BKD tree (numeric/geo range queries)
  • FST (dictionary compression)

MeCab / Sudachi:

  • Double‑Array (dictionary lookup)
  • Minimum‑cost method (select the optimal morpheme sequence)

Pagefind:

  • Inverted index
  • gzip compression + chunking (bandwidth optimization)

PostgreSQL:

  • B+ tree (primary index)
  • GIN (inverted index for full‑text search)
  • GiST (spatial data, range types)
  • BRIN (block ranges for very large tables)

LevelDB / RocksDB:

  • LSM tree (main structure)
  • Bloom filter (pre‑filter before SSTable lookup)

Redis:

  • Hash table (core KVS)
  • Skip list (sorted sets)
  • Radix tree (prefix search on keys)

Choose the right structure according to the search type and the characteristics of your data (read/write ratio, data size, query patterns) — but it’s easy to get stuck deciding what to use. There’s no one‑size‑fits‑all answer; in practice, you combine structures that fit your use case. In particular, structures whose cost grows quickly as the search scope widens require careful application.