LLM-jp-4-32B-A3B on ROCm + Strix Halo: 41% Faster Than Qwen3.5

NII (National Institute of Informatics, Japan) released “LLM-jp-4-32B-A3B-thinking” on April 3, 2026. I benchmarked it on my EVO-X2 against Qwen3.5-35B-A3B, comparing speed, Japanese quality, code generation, and knowledge base.

What “Pure Japanese-Made” Actually Means

LLM-jp-4 uses the Qwen3MoE architecture, but it’s not a fine-tune or distillation of an existing model. NII trained it from scratch using 11.7 trillion tokens.

There are several models that call themselves “Japanese LLMs,” but the actual approaches differ significantly.

Model	Developer	Base	Training Method
LLM-jp-4	NII	Qwen3MoE architecture only	Scratch training (11.7T tokens)
Namazu	Sakana AI	DeepSeek-V3.1 etc.	Post-training on existing models
Rakuten AI 3.0	Rakuten	DeepSeek-V3	Continued pre-training + fine-tuning

Sakana AI’s Namazu applies post-training to multiple existing models including DeepSeek-V3.1-Terminus and Llama 3.1 405B, focusing on correcting biases related to Japanese politics and history. An interesting approach, but the model weights are borrowed.

Rakuten AI 3.0 is a more problematic case. In March 2026, Rakuten announced it as “Japan’s largest high-performance AI model” under GENIAC (a government-subsidized project by METI). Shortly after release, the community found "model_type": "deepseek_v3" in the config.json, revealing it was based on DeepSeek-V3. The initial release had DeepSeek’s MIT license file removed and the official blog made no mention of DeepSeek. After community backlash, Rakuten added license attribution to a NOTICE file and finally acknowledged on March 27 that they “adopted DeepSeek’s open model.” Using DeepSeek-V3 under MIT is perfectly legal, but hiding the base model, removing license attribution, and presenting it as a “domestic model” built with government subsidies drew justified criticism.

LLM-jp-4 is fundamentally different. Building on existing weights is far easier and faster, but NII chose the hardest path: borrowing only the architecture definition and training 11.7 trillion tokens from random initialization.

Aspect	LLM-jp-4
Pre-trained weights	None. Trained from random initialization
Synthetic data	Not used. Data generated or filtered by commercial LLMs (GPT, Claude, etc.) explicitly excluded
Developer	NII (National Institute of Informatics). Academic institution, not a commercial entity
License	Apache 2.0. Free for commercial use, modification, and redistribution

An academic institution scratch-training an LLM and releasing it under Apache 2.0 is rare internationally. “Japanese-specialized” doesn’t mean they bolted Japanese onto an English model via fine-tuning. They designed the corpus from the ground up with Japanese at the center, training 11.7 trillion tokens from scratch. As the sampling ratios below show, they oversampled Japanese 4.5x during training to boost it from 3.5% of the corpus to 15.9%.

Model Overview

Item	Spec
Name	LLM-jp-4-32B-A3B-thinking
Released	2026-04-03
Developer	NII (National Institute of Informatics)
Architecture	Qwen3MoE (MoE, 128 experts, 8 active)
Total Parameters	32.1B
Active Parameters	3.8B
Context Length	65,536 tokens
Training Data	11.7T tokens (Japanese-focused)
License	Apache 2.0
Benchmark	MT-Bench JA 7.82

Benchmarks (llm-jp-judge, evaluated by GPT-5.4)

Model	MT-Bench JA	MT-Bench EN	AnswerCarefully	llm-jp-instructions
GPT-5.4 (high)	8.98	8.85	4.41	4.83
LLM-jp-4-32B-A3B (medium)	7.82	7.86	3.70	3.61
LLM-jp-4-32B-A3B (low)	7.57	7.70	3.61	3.61
LLM-jp-4-8B (medium)	7.54	7.79	3.69	3.54
GPT-4o (2024-08-06)	7.29	7.69	4.00	4.07
Qwen3-8B	7.14	7.69	—	—

MT-Bench JA 7.82 exceeds GPT-4o’s 7.29. The lower AnswerCarefully and llm-jp-instructions scores compared to GPT-4o reflect safety filter strictness and instruction-following tuning differences, not base model capability. Detailed scores from llm-jp-eval v2.1.3 (42 tasks) are only published as a radar chart; numerical tables remain unreleased. No technical report as of April 2026.

Upcoming releases: LLM-jp-4 32B (Dense model) and LLM-jp-4 332B-A31B (332B total, 31B active MoE) planned for FY2026.

The architecture is Qwen3MoE-based but structurally quite different from Qwen3.5-35B-A3B.

	LLM-jp-4 32B-A3B	Qwen3.5-35B-A3B
Total Parameters	32.1B (active 3.8B)	34.66B (active 3B)
Experts	128 / 8 active	256 / 8 active
Layers	32	40
Hidden size	2,560	2,048
SSM/Mamba	None (Attention only)	Yes (SSM+Attention hybrid)
KV cache (ctx=65K)	2,176 MiB	680 MiB

LLM-jp-4 uses Attention in all layers, so all 32 layers consume KV cache. Qwen3.5-35B-A3B, as previously tested, is an SSM (State Space Model) + Attention hybrid with 30 SSM layers and only 10 Attention layers. At ctx=65K, Qwen3.5’s KV cache is just 680 MiB while LLM-jp-4 needs 2,176 MiB (3x more).

However, Japanese knowledge accuracy benefits from the oversampled training data described below. And with half the experts (128 vs 256) and fewer layers (32 vs 40), MoE routing overhead is lower, giving a significant inference speed advantage.

Why Full Attention

LLM-jp-4’s rationale for full Attention isn’t public, but we can speculate. Hybrid designs (SSM+Attention) like Qwen3.5-35B-A3B are efficient for long contexts, but alternating between SSM and Attention layers adds compression/decompression overhead at each layer transition. With only 32 layers, LLM-jp-4 may have been better served by keeping all layers as Attention and focusing optimization efforts on MoE routing.

The architecture’s efficiency shows in the speed benchmarks. 62.9 t/s is 41% faster than Qwen3.5’s 44.7 t/s.

How MoE (Mixture of Experts) Works

LLM-jp-4’s MoE has 128 experts, with the top 8 activated per token. The 3.8B active parameters represent these 8 experts’ combined weights. Not all 128 experts compute simultaneously; a Router (gating network) dynamically selects which 8 to use based on each input token’s characteristics.

Experts	Active Experts	Activation Rate
LLM-jp-4	128 / 8	6.25%
Qwen3.5-35B-A3B	256 / 8	3.1%

Half the experts (128 vs 256) means the Router has fewer candidates, making the entire routing logic (Router output computation → Top-K selection → gating value calculation) cheaper. This directly translates to the speed difference.

Test Environment

Item	Spec
PC	GMKtec EVO-X2
CPU	AMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T)
GPU	Radeon 8060S (RDNA 4 / gfx1151)
Memory	64GB UMA (BIOS: VRAM 48GB / System 16GB)
Backend	ROCm (HIP) llama-server b8183
Model	LLM-jp-4-32B-A3B Q5_K_M (24.4 GB)

BIOS is set to VRAM 48GB, System 16GB. As tested in the Strix Halo memory allocation optimization article, this allocation is optimal for loading 24.4GB models.

GGUF Options

Quantized versions by mmnga-o.

Quantization	Size
Q3_K_M	16.7 GB
MXFP4_MOE	18.1 GB
Q4_K_M	21.4 GB
Q5_K_M	24.4 GB

Q5_K_M selected since VRAM is set to 48GB.

Loading

ROCm0 model buffer size = 22942.08 MiB
KV buffer = 2176.00 MiB (ctx=65536, q8_0)
CPU model buffer = 330 MiB
33/33 layers offloaded to GPU

All 33 layers offloaded to GPU successfully. Model buffer 22.9 GiB + KV cache 2.2 GiB + CPU buffer 0.3 GiB = ~25.4 GiB total. Plenty of headroom with 48GB VRAM.

Speed: 62.9 t/s

Model	Speed	Model Size
LLM-jp-4 32B-A3B Q5_K_M (ROCm)	62.9 t/s	24.4 GB
Qwen3.5-35B-A3B abliterated Q6_K (Vulkan)	44.7 t/s	27 GB
Qwen3.5-35B-A3B Q4_K_XL (Lemonade)	37.8 t/s	21 GB

41% faster than Qwen3.5. Half the experts (128 vs 256) and fewer layers (32 vs 40) are the key factors. LLM-jp-4 has more active parameters (3.8B vs 3B), but routing efficiency and layer count offset this.

Japanese Quality Comparison: LLM-jp-4 vs Qwen3.5

Compared using 3 identical prompts.

Speed

Prompt	LLM-jp-4	Qwen3.5
Quantum computing explanation	62.8 t/s (585 tok)	44.6 t/s (157 tok)
Rainy library scene (creative)	61.3 t/s (1500 tok)	44.3 t/s (223 tok)
Declining birthrate (reasoning)	62.1 t/s (988 tok)	44.0 t/s (1122 tok)

LLM-jp-4 consistently hits 60+ t/s. Qwen3.5 is stable around 44 t/s.

Quality Comparison

Quantum Computing (Knowledge)

LLM-jp-4: Accurately explains superposition and quantum entanglement. Mentions experimental-stage limitations. Precise and textbook-like
Qwen3.5: Uses metaphors like “solving a maze” and “intuitive intelligence.” Accessible but somewhat abstract

Rainy Library Scene (Creative)

LLM-jp-4: Content field empty. Thinking consumed all 1,500 tokens, no body text generated. Caused by not specifying --reasoning-budget
Qwen3.5: Generates an atmospheric short story with literary expressions

Declining Birthrate (Reasoning)

LLM-jp-4: Organizes causes and countermeasures in tables. Includes specific figures (40% non-regular employment rate, 30,000 yen/month subsidy). Policy-report style output
Qwen3.5: Structures around 3 pillars (economic, social, labor). Thorough explanations, 1,122 tokens

Quality Summary

Aspect	LLM-jp-4	Qwen3.5 abliterated
Speed	62 t/s	44 t/s
Knowledge	Accurate, textbook-like	Accessible, metaphor-heavy
Creative	Failed (thinking consumed)	Normal, literary
Reasoning	Tabular, data-driven	Structured, thorough
Thinking control	reasoning-budget required	reasoning-budget 0 stable

LLM-jp-4 is a thinking model that explicitly externalizes an “internal reasoning” process before generating responses, similar to OpenAI’s o1 or Claude 4.6 (Opus 4.6). In LLM-jp-4, the reasoning process wrapped in <|thinking> tags is included in the chat completion response.

By default, tokens are allocated to this thinking region, which means creative prompts can have thinking consume all tokens, leaving content empty. llama-server offers --reasoning-budget 0 to disable thinking entirely, or reasoning_effort (low/medium/high) via chat template. But disabling thinking on a model whose main selling point is thinking is a contradiction.

For API use, you need to switch between --reasoning-budget high for reasoning tasks (complex problem-solving, math, logic) and --reasoning-budget 0 for knowledge/creative tasks.

Code Generation Test

Prompt: “Simple BBS, posting only, localStorage, Japanese UI, single HTML file”

Results:

61.2 t/s generating 1,433 tokens
Working product on first attempt
XSS escaping (escapeHtml) implemented
localStorage, newest-first display, form clearing
Japanese UI (bulletin board, name, message, post)
Japanese comments (// データ取得・保存, // XSS対策)

Simple BBS generated by LLM-jp-4

Good security awareness. The third post in the screenshot shows <script>alert('test')</script> properly escaped.

One caveat: thinking remnants (<|channel|> analysis) can leak into the content field. Avoidable with --reasoning-budget 0.

Training Data Composition

Full Corpus

The full corpus is 19.5 trillion tokens (actual pre-training used 10.5 trillion).

Language	Tokens	Share
English	17.8T	91%
Other (Chinese, Korean)	850B	4.4%
Japanese	700B	3.6%
Code	200B	1%

Pre-training Sampling Ratios

According to developer @odashi_t’s X post (2026-04-04), Japanese was heavily oversampled during training.

Language	Corpus Share	Sampling Share	Multiplier
English	91%	72.9%	0.8x
Japanese	3.5%	15.9%	4.5x
Code	1%	7.6%	7.6x
Chinese	~4%	3.4%	~0.9x
Korean	~0.4%	0.2%	~0.5x

The reality of “Japanese-specialized”: English dominates the corpus at 91%, but training-time sampling boosts Japanese 4.5x to 15.9%. Japanese data consists mainly of web, government, and scientific/technical content. License policy explicitly excludes data generated or filtered by GPT and other commercial LLMs.

Knowledge Cutoff: December 2023

11.7 trillion tokens of training, but the corpus cutoff is late 2023.

Question	Answer	Correct?
Japan’s PM (Japanese)	Kishida Fumio	✗ (Shigeru Ishiba is correct)
Japan’s PM (English)	Refused: “uncertain”	—
US President	Joe Biden	✗ (Trump’s 2nd term is correct)
2024 Japan news	Unable to respond (thinking consumed)	✗

Internal date recognition is inconsistent (varies between 2022-11 and 2024-10 depending on phrasing). The model tends to say “I don’t know” rather than hallucinate uncertain information, which is the safer approach.

Safety Filter: Very Strong

Lock picking instructions → refused
Molotov cocktail recipe → refused
Offensive jokes → refused
Romantic scenes → self-censored to PG-13

No abliterated version exists since this is an official NII model. Qwen3.5-abliterated remains necessary for uncensored use cases.

Countries That Can Train LLMs From Scratch

Outside the US and China, how many countries have done full pre-training from scratch rather than fine-tuning existing models? More than expected.

Country	Organization	Model	Scale
France	Mistral AI	Mistral Large 3	675B MoE
France	BigScience (HuggingFace + CNRS-led)	BLOOM	176B
UAE	TII	Falcon 3	1B-10B (14T tokens trained)
South Korea	Naver	HyperCLOVA X	Undisclosed (6T tokens trained)
South Korea	SK Telecom	A.X 3.1	7B-34B
South Korea	LG AI Research	EXAONE 4.0	236B MoE
South Korea	Upstage	Solar Open 100B	102B
India	Sarvam AI	Sarvam 105B	105B MoE
India	BharatGen (govt-backed)	Param-1	2.9B
Germany	Aleph Alpha	Luminous / Pharia	13B-70B
Israel	AI21 Labs	Jamba 1.5	52B MoE
Canada	Cohere	Command R+	104B
Sweden	AI Sweden	GPT-SW3	40B
Russia	Sber	GigaChat	Undisclosed MoE
EU	Multi-national consortium	EuroLLM	22B
Japan	NII	LLM-jp-4	32.1B MoE
Japan	PFN	PLaMo	Undisclosed
Japan	NTT	tsuzumi 2	30B

South Korea stands out with four organizations independently running scratch training: Naver, SKT, LG, and Upstage. India’s Sarvam AI scaled to 105B MoE in 2026.

Japan has three organizations doing scratch training: NII, PFN, and NTT. LLM-jp-4 riding on the proven Qwen3MoE architecture was the right call. Use a design that already works, then pull out Japanese performance through corpus design and sampling strategy. Scoring 7.82 on MT-Bench JA, beating GPT-4o’s 7.29, proves this approach works.

The other half of the equation would be developing original architectures. NTT’s tsuzumi claims a proprietary design, and AI21 Labs’ Jamba demonstrated that novel structures like Mamba-Transformer hybrids can be built from scratch. Whether NII pushes into architecture design next, alongside the upcoming 332B-A31B model release, is worth watching.