3GB SQLite FTS to 10MB Rust FST: conditions for 300× on a Finnish dictionary
Contents
A 3GB SQLite database shrinking down to a 10MB FST binary is a flashy number.
But this is not a story where swapping SQLite for FST always cuts size by 300×.
Reading Andrew Quinn’s original post, the Finnish dictionary’s data shape happened to fit FST very well.
The search target is essentially static at runtime, and the keys are a huge set of inflected word forms.
On top of that, Finnish repeats suffixes constantly.
Under these conditions, a data structure that compresses the string set itself and searches inside it fits better than a generic full-text engine.
tsk’s problem didn’t end at prefix matching
The target is Taskusanakirja, abbreviated tsk, a Finnish-English dictionary.
It has incremental search that returns candidates as you type, so the starting problem is prefix matching.
For prefix matching, a Trie is the obvious fit.
It’s a tree branching on each character, so storing words like katu and kadun as keys lets you reach candidates in time proportional to input length.
This blog has a separate piece on search data structures that walks through Trie and inverted indexes.
Quinn’s first implementation was in Go: it memoized combinations of 1, 2, and 3 characters with a cap on the candidate count.
About 400k entries fit in 50MB to 60MB.
Then Finnish inflection becomes the problem.
Finnish is agglutinative — the root takes case endings, plurals, possessive suffixes, and clitics stacked together.
The original post mentions that with inflected forms, entries balloon to 40M-60M.
Trie can share common prefixes.
But it can’t easily share large numbers of words that end the same way.
Even when the same tail like -ssa-mme-kin shows up repeatedly, the prefix-tree structure leaves the trailing duplication in place.
SQLite FTS wasn’t a wrong turn, it was a working detour
So at one point the project moved to SQLite FTS.
SQLite FTS5 is a virtual-table module that offers MATCH search, prefix queries, phrase search, NEAR, and boolean queries over a document set.
The official docs describe FTS5 as a feature for efficiently finding the subset of documents containing a search term within a large corpus.
In tsk, the SQLite FTS version also worked without perceptible latency.
The problem was distribution size.
Even if it’s only a one-time download, requiring 3GB is heavy.
For a product positioned as “a pocket dictionary that runs on an old laptop,” that 3GB becomes direct shipping weight.
It’s hard to call this stage a SQLite failure.
Having a working search path gave them a baseline to compare against when they later built the FST version.
The Hacker News thread also framed it the same way — build the simple implementation first, then optimize against it.
FST shares suffixes too
The Rust fst crate represents an ordered set or map of byte strings as a finite-state machine.
The docs.rs description says it compresses both common prefixes and common suffixes, and supports search via regex, Levenshtein automata, and lexicographic range.
The difference from a Trie is suffix sharing.
Not only do words with the same prefix share state — paths that fall into the same suffix also collapse into the same state.
flowchart TD
A["Expand inflected forms"] --> B["Massive set of string keys"]
B --> C["Sort lexicographically"]
C --> D["Build FST"]
D --> E["Share common prefixes"]
D --> F["Share common suffixes"]
E --> G["Small search-time binary"]
F --> G
Looking at the diagram alone it can read as a compression algorithm, but FST is also a search structure.
You don’t decompress and then search like gzip — you walk the compressed state machine directly.
In Quinn’s case, 3GB of SQLite FTS data turned into a 10MB FST binary.
The numbers are rounded in the original post, but in terms of orders of magnitude, that’s a 300× reduction.
When the conditions line up it wins, but it’s not a general DB
Looking at the fst crate’s constraints shows just how condition-dependent this compression is.
Keys are byte strings and the map’s value is an unsigned 64-bit integer.
At build time, keys must be inserted in lexicographic order.
It’s not suited to workloads that frequently add or remove entries at runtime.
There’s also a quirk in how it reads at search time.
The FST file can be memory-mapped, so you don’t have to load the whole thing into process heap to search it.
But the traversal is closer to random access, so on slow disks or when the page cache misses, you lose ground.
With SSDs it’s quite practical, but you can’t drop FST in as a casual SQLite replacement.
For tsk, the dictionary data is static at runtime, the search keys are abundant, the same suffixes repeat, and the value is a small reference like a definition ID.
In that combination, the FST constraints have almost no impact.
| Condition | SQLite FTS | Rust FST |
|---|---|---|
| Updates | Natural with INSERT/UPDATE | Rebuild-oriented |
| Search target | Documents and tokens | Ordered string keys |
| Search features | MATCH, ranking, snippets | Prefix, range, automaton |
| Data shape | General-purpose | Stronger with shared prefixes and suffixes |
| Distribution size | Tends to bloat with the index | Can be very small when conditions match |
SQLite FTS is a component for document search.
It has ranking, snippets, phrase search, NEAR.
FST is narrower — a component to hold a string set or string map compactly and traverse it fast.
So reading this story as “SQLite is heavy” misses the point.
The accurate reading is that an FTS engine, when used as a static map of inflected-form keys, was replaced with a narrower representation that fit the use case, and the result shrank dramatically.
It smells like morphological dictionaries and offline search
The Finnish example looks extreme, but “ship a large static dictionary on-device” is a common use case.
Morphological-analysis dictionaries, IME dictionaries, spell checkers, offline dictionaries, in-browser search, and CLI tool completion candidates all sit in adjacent territory.
That said, a Japanese dictionary won’t necessarily hit the same compression ratio.
Japanese involves word segmentation, readings, spelling variants, inflection, part-of-speech, and cost tables.
Morphological analyzers like MeCab and Sudachi typically combine a double-array Trie with a minimum-cost method, so the question is which tables can be FST-ified, not whether FST alone replaces everything.
Conversely, when candidate keys are pre-enumerated, results fit in an ID reference, and updates happen only at release time, it’s worth trying.
The fst crate examples show writing the build output to a file as a stream and reading at search time via memory map.
That pairs well with dictionary-style tools that want a single-file distribution.