Tech 3 min read

The kuromoji.js path.join() Bug and How to Load Dictionary from CDN

What I Was Trying to Do

I wanted to add a Japanese morphological analysis tool to the Lab page — one that shows part-of-speech information.

First Option: Sudachi

Sudachi has a WASM build, but the dictionary file is 50MB or more. That would instantly eat through Vercel’s transfer quota, so it was a non-starter.

Second Option: kuromoji.js

kuromoji.js has a dictionary of about 12MB (gzip compressed), which is relatively light. If the dictionary is served from jsDelivr’s CDN, it won’t consume Vercel’s bandwidth.

kuromoji.builder({
  dicPath: 'https://cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/'
}).build((err, tokenizer) => {
  // ...
});

Works Locally

No problems in the development environment. But in the production build…

404 Error in Production

GET https://lilting.ch/cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/base.dat.gz 404

For some reason, it’s becoming a relative path: lilting.ch/cdn.jsdelivr.net/....

Root Cause: path.join() Breaks the URL

kuromoji.js internally uses Node.js’s path.join() to construct dictionary paths.

// Inside kuromoji.js
path.join(dicPath, 'base.dat.gz')

path.join() is a filesystem function, so it normalizes the double slashes in https://:

path.join('https://cdn.jsdelivr.net/dict/', 'base.dat.gz')
// → 'https:/cdn.jsdelivr.net/dict/base.dat.gz'  // one slash gets removed

As a result, the browser interprets this as a relative path from the current domain.

This is a known bug reported in GitHub Issue #37.

Solution: @patdx/kuromoji with a Custom Loader

@patdx/kuromoji is a fork of kuromoji that supports custom loaders.

pnpm add @patdx/kuromoji

Implement a custom loader to completely bypass path.join():

import * as kuromoji from '@patdx/kuromoji';

const CDN_DICT_BASE = 'https://cdn.jsdelivr.net/npm/kuromoji@0.1.2/dict/';

// Gzip decompression
async function decompressGzip(data: ArrayBuffer): Promise<ArrayBuffer> {
  const ds = new DecompressionStream('gzip');
  const stream = new Response(data).body!.pipeThrough(ds);
  return new Response(stream).arrayBuffer();
}

const customLoader: kuromoji.LoaderConfig = {
  async loadArrayBuffer(filename: string): Promise<ArrayBufferLike> {
    const url = CDN_DICT_BASE + filename;
    const res = await fetch(url);
    if (!res.ok) throw new Error(`Failed: ${url}`);

    const data = await res.arrayBuffer();
    // .gz files need decompression
    return filename.endsWith('.gz') ? decompressGzip(data) : data;
  }
};

const tokenizer = await new kuromoji.TokenizerBuilder({
  loader: customLoader
}).build();

Key Points

  1. Direct fetch: Build the URL with string concatenation instead of path.join()
  2. Gzip decompression: The dictionary is .gz compressed, so decompress it using the browser’s DecompressionStream API
  3. Zero Vercel bandwidth: The dictionary is served from jsDelivr, so it doesn’t consume Vercel transfer quota

Summary

ApproachProblem
Sudachi WASMDictionary 50MB+, too heavy
kuromoji.js + CDNpath.join() bug breaks the URL
@patdx/kuromoji + custom loaderWorks

When implementing morphological analysis in the browser, kuromoji.js is convenient but watch out for the path.join() issue when loading dictionaries from CDN. Using the fork with a custom loader is the reliable approach.


Morphological Analysis Tool