The kuromoji.js path.join() Bug and How to Load Dictionary from CDN
What I Was Trying to Do
I wanted to add a Japanese morphological analysis tool to the Lab page — one that shows part-of-speech information.
First Option: Sudachi
Sudachi has a WASM build, but the dictionary file is 50MB or more. That would instantly eat through Vercel’s transfer quota, so it was a non-starter.
Second Option: kuromoji.js
kuromoji.js has a dictionary of about 12MB (gzip compressed), which is relatively light. If the dictionary is served from jsDelivr’s CDN, it won’t consume Vercel’s bandwidth.
kuromoji.builder({
dicPath: 'https://cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/'
}).build((err, tokenizer) => {
// ...
});
Works Locally
No problems in the development environment. But in the production build…
404 Error in Production
GET https://lilting.ch/cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/base.dat.gz 404
For some reason, it’s becoming a relative path: lilting.ch/cdn.jsdelivr.net/....
Root Cause: path.join() Breaks the URL
kuromoji.js internally uses Node.js’s path.join() to construct dictionary paths.
// Inside kuromoji.js
path.join(dicPath, 'base.dat.gz')
path.join() is a filesystem function, so it normalizes the double slashes in https://:
path.join('https://cdn.jsdelivr.net/dict/', 'base.dat.gz')
// → 'https:/cdn.jsdelivr.net/dict/base.dat.gz' // one slash gets removed
As a result, the browser interprets this as a relative path from the current domain.
This is a known bug reported in GitHub Issue #37.
Solution: @patdx/kuromoji with a Custom Loader
@patdx/kuromoji is a fork of kuromoji that supports custom loaders.
pnpm add @patdx/kuromoji
Implement a custom loader to completely bypass path.join():
import * as kuromoji from '@patdx/kuromoji';
const CDN_DICT_BASE = 'https://cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/';
// Gzip decompression
async function decompressGzip(data: ArrayBuffer): Promise<ArrayBuffer> {
const ds = new DecompressionStream('gzip');
const stream = new Response(data).body!.pipeThrough(ds);
return new Response(stream).arrayBuffer();
}
const customLoader: kuromoji.LoaderConfig = {
async loadArrayBuffer(filename: string): Promise<ArrayBufferLike> {
const url = CDN_DICT_BASE + filename;
const res = await fetch(url);
if (!res.ok) throw new Error(`Failed: ${url}`);
const data = await res.arrayBuffer();
// .gz files need decompression
return filename.endsWith('.gz') ? decompressGzip(data) : data;
}
};
const tokenizer = await new kuromoji.TokenizerBuilder({
loader: customLoader
}).build();
Key Points
- Direct fetch: Build the URL with string concatenation instead of
path.join() - Gzip decompression: The dictionary is
.gzcompressed, so decompress it using the browser’sDecompressionStreamAPI - Zero Vercel bandwidth: The dictionary is served from jsDelivr, so it doesn’t consume Vercel transfer quota
Summary
| Approach | Problem |
|---|---|
| Sudachi WASM | Dictionary 50MB+, too heavy |
| kuromoji.js + CDN | path.join() bug breaks the URL |
| @patdx/kuromoji + custom loader | Works |
When implementing morphological analysis in the browser, kuromoji.js is convenient but watch out for the path.join() issue when loading dictionaries from CDN. Using the fork with a custom loader is the reliable approach.