Tech 11 min read

LLM-jp-4-32B-A3B on ROCm + Strix Halo: 41% Faster Than Qwen3.5

IkesanContents

NII (National Institute of Informatics, Japan) released “LLM-jp-4-32B-A3B-thinking” on April 3, 2026. I benchmarked it on my EVO-X2 against Qwen3.5-35B-A3B, comparing speed, Japanese quality, code generation, and knowledge base.

What “Pure Japanese-Made” Actually Means

LLM-jp-4 uses the Qwen3MoE architecture, but it’s not a fine-tune or distillation of an existing model. NII trained it from scratch using 11.7 trillion tokens.

There are several models that call themselves “Japanese LLMs,” but the actual approaches differ significantly.

ModelDeveloperBaseTraining Method
LLM-jp-4NIIQwen3MoE architecture onlyScratch training (11.7T tokens)
NamazuSakana AIDeepSeek-V3.1 etc.Post-training on existing models
Rakuten AI 3.0RakutenDeepSeek-V3Continued pre-training + fine-tuning

Sakana AI’s Namazu applies post-training to multiple existing models including DeepSeek-V3.1-Terminus and Llama 3.1 405B, focusing on correcting biases related to Japanese politics and history. An interesting approach, but the model weights are borrowed.

Rakuten AI 3.0 is a more problematic case. In March 2026, Rakuten announced it as “Japan’s largest high-performance AI model” under GENIAC (a government-subsidized project by METI). Shortly after release, the community found "model_type": "deepseek_v3" in the config.json, revealing it was based on DeepSeek-V3. The initial release had DeepSeek’s MIT license file removed and the official blog made no mention of DeepSeek. After community backlash, Rakuten added license attribution to a NOTICE file and finally acknowledged on March 27 that they “adopted DeepSeek’s open model.” Using DeepSeek-V3 under MIT is perfectly legal, but hiding the base model, removing license attribution, and presenting it as a “domestic model” built with government subsidies drew justified criticism.

LLM-jp-4 is fundamentally different. Building on existing weights is far easier and faster, but NII chose the hardest path: borrowing only the architecture definition and training 11.7 trillion tokens from random initialization.

AspectLLM-jp-4
Pre-trained weightsNone. Trained from random initialization
Synthetic dataNot used. Data generated or filtered by commercial LLMs (GPT, Claude, etc.) explicitly excluded
DeveloperNII (National Institute of Informatics). Academic institution, not a commercial entity
LicenseApache 2.0. Free for commercial use, modification, and redistribution

An academic institution scratch-training an LLM and releasing it under Apache 2.0 is rare internationally. “Japanese-specialized” doesn’t mean they bolted Japanese onto an English model via fine-tuning. They designed the corpus from the ground up with Japanese at the center, training 11.7 trillion tokens from scratch. As the sampling ratios below show, they oversampled Japanese 4.5x during training to boost it from 3.5% of the corpus to 15.9%.

Model Overview

ItemSpec
NameLLM-jp-4-32B-A3B-thinking
Released2026-04-03
DeveloperNII (National Institute of Informatics)
ArchitectureQwen3MoE (MoE, 128 experts, 8 active)
Total Parameters32.1B
Active Parameters3.8B
Context Length65,536 tokens
Training Data11.7T tokens (Japanese-focused)
LicenseApache 2.0
BenchmarkMT-Bench JA 7.82

Benchmarks (llm-jp-judge, evaluated by GPT-5.4)

ModelMT-Bench JAMT-Bench ENAnswerCarefullyllm-jp-instructions
GPT-5.4 (high)8.988.854.414.83
LLM-jp-4-32B-A3B (medium)7.827.863.703.61
LLM-jp-4-32B-A3B (low)7.577.703.613.61
LLM-jp-4-8B (medium)7.547.793.693.54
GPT-4o (2024-08-06)7.297.694.004.07
Qwen3-8B7.147.69

MT-Bench JA 7.82 exceeds GPT-4o’s 7.29. The lower AnswerCarefully and llm-jp-instructions scores compared to GPT-4o reflect safety filter strictness and instruction-following tuning differences, not base model capability. Detailed scores from llm-jp-eval v2.1.3 (42 tasks) are only published as a radar chart; numerical tables remain unreleased. No technical report as of April 2026.

Upcoming releases: LLM-jp-4 32B (Dense model) and LLM-jp-4 332B-A31B (332B total, 31B active MoE) planned for FY2026.

The architecture is Qwen3MoE-based but structurally quite different from Qwen3.5-35B-A3B.

LLM-jp-4 32B-A3BQwen3.5-35B-A3B
Total Parameters32.1B (active 3.8B)34.66B (active 3B)
Experts128 / 8 active256 / 8 active
Layers3240
Hidden size2,5602,048
SSM/MambaNone (Attention only)Yes (SSM+Attention hybrid)
KV cache (ctx=65K)2,176 MiB680 MiB

LLM-jp-4 uses Attention in all layers, so all 32 layers consume KV cache. Qwen3.5-35B-A3B, as previously tested, is an SSM (State Space Model) + Attention hybrid with 30 SSM layers and only 10 Attention layers. At ctx=65K, Qwen3.5’s KV cache is just 680 MiB while LLM-jp-4 needs 2,176 MiB (3x more).

However, Japanese knowledge accuracy benefits from the oversampled training data described below. And with half the experts (128 vs 256) and fewer layers (32 vs 40), MoE routing overhead is lower, giving a significant inference speed advantage.

Why Full Attention

LLM-jp-4’s rationale for full Attention isn’t public, but we can speculate. Hybrid designs (SSM+Attention) like Qwen3.5-35B-A3B are efficient for long contexts, but alternating between SSM and Attention layers adds compression/decompression overhead at each layer transition. With only 32 layers, LLM-jp-4 may have been better served by keeping all layers as Attention and focusing optimization efforts on MoE routing.

The architecture’s efficiency shows in the speed benchmarks. 62.9 t/s is 41% faster than Qwen3.5’s 44.7 t/s.

How MoE (Mixture of Experts) Works

LLM-jp-4’s MoE has 128 experts, with the top 8 activated per token. The 3.8B active parameters represent these 8 experts’ combined weights. Not all 128 experts compute simultaneously; a Router (gating network) dynamically selects which 8 to use based on each input token’s characteristics.

ExpertsActive ExpertsActivation Rate
LLM-jp-4128 / 86.25%
Qwen3.5-35B-A3B256 / 83.1%

Half the experts (128 vs 256) means the Router has fewer candidates, making the entire routing logic (Router output computation → Top-K selection → gating value calculation) cheaper. This directly translates to the speed difference.

Test Environment

ItemSpec
PCGMKtec EVO-X2
CPUAMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T)
GPURadeon 8060S (RDNA 4 / gfx1151)
Memory64GB UMA (BIOS: VRAM 48GB / System 16GB)
BackendROCm (HIP) llama-server b8183
ModelLLM-jp-4-32B-A3B Q5_K_M (24.4 GB)

BIOS is set to VRAM 48GB, System 16GB. As tested in the Strix Halo memory allocation optimization article, this allocation is optimal for loading 24.4GB models.

GGUF Options

Quantized versions by mmnga-o.

QuantizationSize
Q3_K_M16.7 GB
MXFP4_MOE18.1 GB
Q4_K_M21.4 GB
Q5_K_M24.4 GB

Q5_K_M selected since VRAM is set to 48GB.

Loading

ROCm0 model buffer size = 22942.08 MiB
KV buffer = 2176.00 MiB (ctx=65536, q8_0)
CPU model buffer = 330 MiB
33/33 layers offloaded to GPU

All 33 layers offloaded to GPU successfully. Model buffer 22.9 GiB + KV cache 2.2 GiB + CPU buffer 0.3 GiB = ~25.4 GiB total. Plenty of headroom with 48GB VRAM.

Speed: 62.9 t/s

ModelSpeedModel Size
LLM-jp-4 32B-A3B Q5_K_M (ROCm)62.9 t/s24.4 GB
Qwen3.5-35B-A3B abliterated Q6_K (Vulkan)44.7 t/s27 GB
Qwen3.5-35B-A3B Q4_K_XL (Lemonade)37.8 t/s21 GB

41% faster than Qwen3.5. Half the experts (128 vs 256) and fewer layers (32 vs 40) are the key factors. LLM-jp-4 has more active parameters (3.8B vs 3B), but routing efficiency and layer count offset this.

Japanese Quality Comparison: LLM-jp-4 vs Qwen3.5

Compared using 3 identical prompts.

Speed

PromptLLM-jp-4Qwen3.5
Quantum computing explanation62.8 t/s (585 tok)44.6 t/s (157 tok)
Rainy library scene (creative)61.3 t/s (1500 tok)44.3 t/s (223 tok)
Declining birthrate (reasoning)62.1 t/s (988 tok)44.0 t/s (1122 tok)

LLM-jp-4 consistently hits 60+ t/s. Qwen3.5 is stable around 44 t/s.

Quality Comparison

Quantum Computing (Knowledge)

  • LLM-jp-4: Accurately explains superposition and quantum entanglement. Mentions experimental-stage limitations. Precise and textbook-like
  • Qwen3.5: Uses metaphors like “solving a maze” and “intuitive intelligence.” Accessible but somewhat abstract

Rainy Library Scene (Creative)

  • LLM-jp-4: Content field empty. Thinking consumed all 1,500 tokens, no body text generated. Caused by not specifying --reasoning-budget
  • Qwen3.5: Generates an atmospheric short story with literary expressions

Declining Birthrate (Reasoning)

  • LLM-jp-4: Organizes causes and countermeasures in tables. Includes specific figures (40% non-regular employment rate, 30,000 yen/month subsidy). Policy-report style output
  • Qwen3.5: Structures around 3 pillars (economic, social, labor). Thorough explanations, 1,122 tokens

Quality Summary

AspectLLM-jp-4Qwen3.5 abliterated
Speed62 t/s44 t/s
KnowledgeAccurate, textbook-likeAccessible, metaphor-heavy
CreativeFailed (thinking consumed)Normal, literary
ReasoningTabular, data-drivenStructured, thorough
Thinking controlreasoning-budget requiredreasoning-budget 0 stable

LLM-jp-4 is a thinking model that explicitly externalizes an “internal reasoning” process before generating responses, similar to OpenAI’s o1 or Claude 4.6 (Opus 4.6). In LLM-jp-4, the reasoning process wrapped in <|thinking> tags is included in the chat completion response.

By default, tokens are allocated to this thinking region, which means creative prompts can have thinking consume all tokens, leaving content empty. llama-server offers --reasoning-budget 0 to disable thinking entirely, or reasoning_effort (low/medium/high) via chat template. But disabling thinking on a model whose main selling point is thinking is a contradiction.

For API use, you need to switch between --reasoning-budget high for reasoning tasks (complex problem-solving, math, logic) and --reasoning-budget 0 for knowledge/creative tasks.

Code Generation Test

Prompt: “Simple BBS, posting only, localStorage, Japanese UI, single HTML file”

Results:

  • 61.2 t/s generating 1,433 tokens
  • Working product on first attempt
  • XSS escaping (escapeHtml) implemented
  • localStorage, newest-first display, form clearing
  • Japanese UI (bulletin board, name, message, post)
  • Japanese comments (// データ取得・保存, // XSS対策)

Simple BBS generated by LLM-jp-4

Good security awareness. The third post in the screenshot shows <script>alert('test')</script> properly escaped.

One caveat: thinking remnants (<|channel|> analysis) can leak into the content field. Avoidable with --reasoning-budget 0.

Training Data Composition

Full Corpus

The full corpus is 19.5 trillion tokens (actual pre-training used 10.5 trillion).

LanguageTokensShare
English17.8T91%
Other (Chinese, Korean)850B4.4%
Japanese700B3.6%
Code200B1%

Pre-training Sampling Ratios

According to developer @odashi_t’s X post (2026-04-04), Japanese was heavily oversampled during training.

LanguageCorpus ShareSampling ShareMultiplier
English91%72.9%0.8x
Japanese3.5%15.9%4.5x
Code1%7.6%7.6x
Chinese~4%3.4%~0.9x
Korean~0.4%0.2%~0.5x

The reality of “Japanese-specialized”: English dominates the corpus at 91%, but training-time sampling boosts Japanese 4.5x to 15.9%. Japanese data consists mainly of web, government, and scientific/technical content. License policy explicitly excludes data generated or filtered by GPT and other commercial LLMs.

Knowledge Cutoff: December 2023

11.7 trillion tokens of training, but the corpus cutoff is late 2023.

QuestionAnswerCorrect?
Japan’s PM (Japanese)Kishida Fumio✗ (Shigeru Ishiba is correct)
Japan’s PM (English)Refused: “uncertain”
US PresidentJoe Biden✗ (Trump’s 2nd term is correct)
2024 Japan newsUnable to respond (thinking consumed)

Internal date recognition is inconsistent (varies between 2022-11 and 2024-10 depending on phrasing). The model tends to say “I don’t know” rather than hallucinate uncertain information, which is the safer approach.

Safety Filter: Very Strong

  • Lock picking instructions → refused
  • Molotov cocktail recipe → refused
  • Offensive jokes → refused
  • Romantic scenes → self-censored to PG-13

No abliterated version exists since this is an official NII model. Qwen3.5-abliterated remains necessary for uncensored use cases.

Countries That Can Train LLMs From Scratch

Outside the US and China, how many countries have done full pre-training from scratch rather than fine-tuning existing models? More than expected.

CountryOrganizationModelScale
FranceMistral AIMistral Large 3675B MoE
FranceBigScience (HuggingFace + CNRS-led)BLOOM176B
UAETIIFalcon 31B-10B (14T tokens trained)
South KoreaNaverHyperCLOVA XUndisclosed (6T tokens trained)
South KoreaSK TelecomA.X 3.17B-34B
South KoreaLG AI ResearchEXAONE 4.0236B MoE
South KoreaUpstageSolar Open 100B102B
IndiaSarvam AISarvam 105B105B MoE
IndiaBharatGen (govt-backed)Param-12.9B
GermanyAleph AlphaLuminous / Pharia13B-70B
IsraelAI21 LabsJamba 1.552B MoE
CanadaCohereCommand R+104B
SwedenAI SwedenGPT-SW340B
RussiaSberGigaChatUndisclosed MoE
EUMulti-national consortiumEuroLLM22B
JapanNIILLM-jp-432.1B MoE
JapanPFNPLaMoUndisclosed
JapanNTTtsuzumi 230B

South Korea stands out with four organizations independently running scratch training: Naver, SKT, LG, and Upstage. India’s Sarvam AI scaled to 105B MoE in 2026.

Japan has three organizations doing scratch training: NII, PFN, and NTT. LLM-jp-4 riding on the proven Qwen3MoE architecture was the right call. Use a design that already works, then pull out Japanese performance through corpus design and sampling strategy. Scoring 7.82 on MT-Bench JA, beating GPT-4o’s 7.29, proves this approach works.

The other half of the equation would be developing original architectures. NTT’s tsuzumi claims a proprietary design, and AI21 Labs’ Jamba demonstrated that novel structures like Mamba-Transformer hybrids can be built from scratch. Whether NII pushes into architecture design next, alongside the upcoming 332B-A31B model release, is worth watching.