A Complete Roundup of Data Structures for Fast Search — Trie, Inverted Index, Suffix Array, Double-Array Trie
There are many kinds of “search”: exact match, prefix match, substring search, fuzzy search, full‑text search… The best data structure changes depending on what kind of search you want to do.
It’s tempting to just toss everything into a DB and call it a day, but that’s not really a good idea.
So, I put together 10 major data structures used for search workloads. For each one, I outline how it works, which search types it excels at, complexity, and real‑world use cases.
Types of Search
Let’s first organize the kinds of search.
| Search Type | Description | Example |
|---|---|---|
| Exact match | Key matches exactly | Dictionary lookup, hash table |
| Prefix match | Starts with the given prefix | Autocomplete “abc…” |
| Substring | Contained somewhere in the string | grep, Ctrl+F |
| Fuzzy search | Small edit distance | Spell correction “teh → the” |
| Full‑text search | Find documents that contain terms | Google Search, on‑site search |
| Range query | Value range | “Price between 1,000–2,000” |
Trie
A tree that follows one character per node. The basic structure for prefix search.
root
/ | \
c d t
| | |
a o o
/| | |
t r g p
A trie that stores “cat”, “car”, “dog”, and “top”.
Good at: Prefix search, autocomplete, detecting common prefixes
Complexity: O(m) — m is the key length
Drawbacks: Memory‑inefficient. Each node keeps pointers to its children, so the tree tends to be sparse.
Use cases: IME candidate lists, URL routing, IP address lookup (Patricia trie)
Double‑Array (Double‑Array Trie)
A speed‑up technique that represents a trie with two arrays. Used internally by morphological analyzers such as MeCab and Sudachi.
インデックス: 0 1 2 3 4 5 6 7
BASE: [- 1 3 - 5 - - 2 ]
CHECK: [- 0 1 - 2 - - 1 ]
How it works:
- Compute transition with BASE[s] + c = t
- Verify origin with CHECK[t] = s
From state s with character code c:
- Compute t = BASE[s] + c
- If CHECK[t] == s, the transition exists; otherwise there is no transition for that character
Good at: Dictionary lookup; morphological analysis (fast lookup over large lexicons)
Complexity: O(m) — same as a trie, but with better cache locality
Drawbacks: Complex to build. Poor at dynamic insert/delete (requires relocating array regions).
Use cases: MeCab, Sudachi, ATOK, Google Japanese Input
Inverted Index
A mapping from “term → list of documents that contain the term”. The foundation of full‑text search.
【元データ】
Doc1: "猫が寝ている"
Doc2: "犬が走っている"
Doc3: "猫が走っている"
【転置インデックス】
猫 → [Doc1, Doc3]
犬 → [Doc2]
寝 → [Doc1]
走 → [Doc2, Doc3]
It’s called “inverted” because it reverses the usual “document → terms” relationship.
Good at: Full‑text search, AND/OR queries, phrase search
Complexity: O(1) to fetch the postings list. AND queries are O(n+m) for list merging.
Additional fields:
- TF (term frequency)
- IDF (inverse document frequency)
- Positional data (for phrase search)
Use cases: Elasticsearch, Apache Lucene, Pagefind, Algolia
Suffix Array
An array of all suffixes of a string, sorted. Enables substring search via binary search.
文字列: "banana"
接尾辞:
0: banana
1: anana
2: nana
3: ana
4: na
5: a
ソート後(接尾辞配列):
[5, 3, 1, 0, 4, 2]
a ana anana banana na nana
When searching for “ana”, it’s sorted so you can use binary search.
Good at: Substring search, longest common substring
Complexity: O(m log n) for search — m is the pattern length, n is the string length
Merits: More memory‑efficient than a suffix tree (no pointers)
Use cases: DNA sequence search, large‑scale text search, bzip2 compression
Suffix Tree
A trie of all suffixes with path compression. More capable than a suffix array but uses more memory.
文字列: "banana$"
root
/ \
a banana$
/ \
na$ $
|
na$
Common prefixes are compressed into single edges.
Good at: Substring search, longest common substring, longest repeated substring
Complexity: O(m) for search — faster than a suffix array
Drawbacks: High memory usage (10–20× the original string)
Use cases: Bioinformatics (DNA/protein sequence analysis)
BK‑tree (Burkhard–Keller Tree)
A tree structure based on edit distance (Levenshtein distance). Specialized for fuzzy search.
"book"
/ | \
1 2 3
/ | \
"cook" "back" "banana"
Nodes are placed according to their edit distance from the root.
How it works:
- To find words whose edit distance to the query is ≤ d
- If the distance to the current node is k, only explore children with edges in the range k−d to k+d
By the triangle inequality, you can prune the search space substantially.
Good at: Fuzzy search, spell correction, similar‑string search
Complexity: Depends on the distance threshold; the smaller the threshold, the faster.
Use cases: Spell checkers, “Did you mean?” features
n‑gram Index
Split strings into n‑character chunks and index them. Commonly used for Japanese partial‑match search.
文字列: "検索エンジン"
2-gram(bigram):
["検索", "索エ", "エン", "ンジ", "ジン"]
3-gram(trigram):
["検索エ", "索エン", "エンジ", "ンジン"]
Good at: Partial‑match search; Japanese search without morphological analysis
Complexity: O(1) to fetch an n‑gram’s postings list
Drawbacks: Large index size (several times the original data)
Use cases: PostgreSQL pg_trgm, MySQL ngram parser, a helper for Japanese full‑text search
Bloom Filter
A probabilistic data structure that quickly says “not present”. It allows false positives but no false negatives.
ビット配列: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
"cat"を追加(ハッシュ値: 2, 5, 8):
[0, 0, 1, 0, 0, 1, 0, 0, 1, 0]
"dog"を追加(ハッシュ値: 1, 5, 9):
[0, 1, 1, 0, 0, 1, 0, 0, 1, 1]
How it works:
- Insert: set bits using k hash functions
- Query: if all k bits are 1 → “probably present”; if any bit is 0 → “definitely not present”
Good at: Pre‑filtering for existence checks; avoiding unnecessary disk access
Complexity: O(k) — k is the number of hash functions
Properties: False‑positive rate is tunable (larger bit arrays reduce it)
Use cases: Pre‑filter before SSTable search in LevelDB/RocksDB, negative lookups in caches, spam filtering
B+ tree
The balanced tree used by RDB indexes. Leaf nodes are connected via links.
[30 | 60]
/ | \
[10|20] [40|50] [70|80|90]
↓ ↓ ↓
リーフ → リーフ → リーフ(リンクで連結)
How it works:
- Internal nodes store keys and pointers to children
- Leaf nodes hold the actual records (or pointers)
- Leaves are linked → fast range scans
Good at: Range queries, ordered scans, equality lookup
Complexity: O(log n) for search/insert/delete
Merits: Optimized for disk I/O (1 node = 1 page)
Use cases: MySQL InnoDB, PostgreSQL, SQLite, file systems (NTFS, ext4)
LSM tree (Log‑Structured Merge Tree)
Appends writes sequentially and merges them in the background. Strong for write‑heavy workloads.
メモリ: [MemTable] ← 新規書き込み
↓ フラッシュ
ディスク: Level 0: [SST] [SST] [SST]
↓ コンパクション
Level 1: [SST] [SST]
↓
Level 2: [SST]
How it works:
- Writes append to an in‑memory MemTable (also write a Write‑Ahead Log)
- When the MemTable reaches a threshold, flush to disk as an immutable SSTable
- SSTables are immutable; background compaction merges them
flowchart LR
subgraph Memory
W[Write] --> MT[MemTable]
MT --> WAL[WAL]
end
subgraph Disk
MT -->|flush| L0[Level 0]
L0 -->|compact| L1[Level 1]
L1 -->|compact| L2[Level 2]
end
R[Read] --> MT
R --> L0
R --> L1
R --> L2
On reads, search in order: MemTable → Level 0 → Level 1 → … Stop once found.
Good at: Write‑heavy workloads, time‑series data
Complexity:
- Writes: O(1) amortized (sequential writes)
- Reads: O(log n) × number of levels (scan multiple SSTables)
Drawbacks: Reads often need to touch multiple levels and can be slower than B+ trees
Use cases: LevelDB, RocksDB, Cassandra, InfluxDB
Comparison Table
| Data Structure | Search Type | Search Complexity | Memory | Primary Uses |
|---|---|---|---|---|
| Trie | Prefix match | O(m) | Large | Autocomplete |
| Double‑Array | Exact/Prefix | O(m) | Small | Morphological analysis |
| Inverted Index | Full‑text | O(1) | Medium | Search engines |
| Suffix Array | Substring | O(m log n) | Small | grep, DNA |
| Suffix Tree | Substring | O(m) | Large | String analysis |
| BK‑tree | Fuzzy | Variable | Medium | Spell correction |
| n‑gram | Partial match | O(1) | Large | Japanese search |
| Bloom Filter | Existence check | O(k) | Tiny | Filtering |
| B+ tree | Range queries | O(log n) | Medium | RDB indexes |
| LSM tree | Write‑optimized | O(log n)× | Medium | KVS, time‑series DB |
Note: m is key/pattern length, n is the number of items, and k is the number of hash functions.
Combinations in Real Systems
In real systems, multiple data structures are often combined.
Elasticsearch / Apache Lucene:
- Inverted index (text search)
- BKD tree (numeric/geo range queries)
- FST (dictionary compression)
MeCab / Sudachi:
- Double‑Array (dictionary lookup)
- Minimum‑cost method (select the optimal morpheme sequence)
Pagefind:
- Inverted index
- gzip compression + chunking (bandwidth optimization)
PostgreSQL:
- B+ tree (primary index)
- GIN (inverted index for full‑text search)
- GiST (spatial data, range types)
- BRIN (block ranges for very large tables)
LevelDB / RocksDB:
- LSM tree (main structure)
- Bloom filter (pre‑filter before SSTable lookup)
Redis:
- Hash table (core KVS)
- Skip list (sorted sets)
- Radix tree (prefix search on keys)
Choose the right structure according to the search type and the characteristics of your data (read/write ratio, data size, query patterns) — but it’s easy to get stuck deciding what to use. There’s no one‑size‑fits‑all answer; in practice, you combine structures that fit your use case. In particular, structures whose cost grows quickly as the search scope widens require careful application.