LLM-jp-4-32B-A3B on ROCm + Strix Halo: 41% Faster Than Qwen3.5
Contents
NII (National Institute of Informatics, Japan) released “LLM-jp-4-32B-A3B-thinking” on April 3, 2026. I benchmarked it on my EVO-X2 against Qwen3.5-35B-A3B, comparing speed, Japanese quality, code generation, and knowledge base.
What “Pure Japanese-Made” Actually Means
LLM-jp-4 uses the Qwen3MoE architecture, but it’s not a fine-tune or distillation of an existing model. NII trained it from scratch using 11.7 trillion tokens.
There are several models that call themselves “Japanese LLMs,” but the actual approaches differ significantly.
| Model | Developer | Base | Training Method |
|---|---|---|---|
| LLM-jp-4 | NII | Qwen3MoE architecture only | Scratch training (11.7T tokens) |
| Namazu | Sakana AI | DeepSeek-V3.1 etc. | Post-training on existing models |
| Rakuten AI 3.0 | Rakuten | DeepSeek-V3 | Continued pre-training + fine-tuning |
Sakana AI’s Namazu applies post-training to multiple existing models including DeepSeek-V3.1-Terminus and Llama 3.1 405B, focusing on correcting biases related to Japanese politics and history. An interesting approach, but the model weights are borrowed.
Rakuten AI 3.0 is a more problematic case. In March 2026, Rakuten announced it as “Japan’s largest high-performance AI model” under GENIAC (a government-subsidized project by METI). Shortly after release, the community found "model_type": "deepseek_v3" in the config.json, revealing it was based on DeepSeek-V3. The initial release had DeepSeek’s MIT license file removed and the official blog made no mention of DeepSeek. After community backlash, Rakuten added license attribution to a NOTICE file and finally acknowledged on March 27 that they “adopted DeepSeek’s open model.” Using DeepSeek-V3 under MIT is perfectly legal, but hiding the base model, removing license attribution, and presenting it as a “domestic model” built with government subsidies drew justified criticism.
LLM-jp-4 is fundamentally different. Building on existing weights is far easier and faster, but NII chose the hardest path: borrowing only the architecture definition and training 11.7 trillion tokens from random initialization.
| Aspect | LLM-jp-4 |
|---|---|
| Pre-trained weights | None. Trained from random initialization |
| Synthetic data | Not used. Data generated or filtered by commercial LLMs (GPT, Claude, etc.) explicitly excluded |
| Developer | NII (National Institute of Informatics). Academic institution, not a commercial entity |
| License | Apache 2.0. Free for commercial use, modification, and redistribution |
An academic institution scratch-training an LLM and releasing it under Apache 2.0 is rare internationally. “Japanese-specialized” doesn’t mean they bolted Japanese onto an English model via fine-tuning. They designed the corpus from the ground up with Japanese at the center, training 11.7 trillion tokens from scratch. As the sampling ratios below show, they oversampled Japanese 4.5x during training to boost it from 3.5% of the corpus to 15.9%.
Model Overview
| Item | Spec |
|---|---|
| Name | LLM-jp-4-32B-A3B-thinking |
| Released | 2026-04-03 |
| Developer | NII (National Institute of Informatics) |
| Architecture | Qwen3MoE (MoE, 128 experts, 8 active) |
| Total Parameters | 32.1B |
| Active Parameters | 3.8B |
| Context Length | 65,536 tokens |
| Training Data | 11.7T tokens (Japanese-focused) |
| License | Apache 2.0 |
| Benchmark | MT-Bench JA 7.82 |
Benchmarks (llm-jp-judge, evaluated by GPT-5.4)
| Model | MT-Bench JA | MT-Bench EN | AnswerCarefully | llm-jp-instructions |
|---|---|---|---|---|
| GPT-5.4 (high) | 8.98 | 8.85 | 4.41 | 4.83 |
| LLM-jp-4-32B-A3B (medium) | 7.82 | 7.86 | 3.70 | 3.61 |
| LLM-jp-4-32B-A3B (low) | 7.57 | 7.70 | 3.61 | 3.61 |
| LLM-jp-4-8B (medium) | 7.54 | 7.79 | 3.69 | 3.54 |
| GPT-4o (2024-08-06) | 7.29 | 7.69 | 4.00 | 4.07 |
| Qwen3-8B | 7.14 | 7.69 | — | — |
MT-Bench JA 7.82 exceeds GPT-4o’s 7.29. The lower AnswerCarefully and llm-jp-instructions scores compared to GPT-4o reflect safety filter strictness and instruction-following tuning differences, not base model capability. Detailed scores from llm-jp-eval v2.1.3 (42 tasks) are only published as a radar chart; numerical tables remain unreleased. No technical report as of April 2026.
Upcoming releases: LLM-jp-4 32B (Dense model) and LLM-jp-4 332B-A31B (332B total, 31B active MoE) planned for FY2026.
The architecture is Qwen3MoE-based but structurally quite different from Qwen3.5-35B-A3B.
| LLM-jp-4 32B-A3B | Qwen3.5-35B-A3B | |
|---|---|---|
| Total Parameters | 32.1B (active 3.8B) | 34.66B (active 3B) |
| Experts | 128 / 8 active | 256 / 8 active |
| Layers | 32 | 40 |
| Hidden size | 2,560 | 2,048 |
| SSM/Mamba | None (Attention only) | Yes (SSM+Attention hybrid) |
| KV cache (ctx=65K) | 2,176 MiB | 680 MiB |
LLM-jp-4 uses Attention in all layers, so all 32 layers consume KV cache. Qwen3.5-35B-A3B, as previously tested, is an SSM (State Space Model) + Attention hybrid with 30 SSM layers and only 10 Attention layers. At ctx=65K, Qwen3.5’s KV cache is just 680 MiB while LLM-jp-4 needs 2,176 MiB (3x more).
However, Japanese knowledge accuracy benefits from the oversampled training data described below. And with half the experts (128 vs 256) and fewer layers (32 vs 40), MoE routing overhead is lower, giving a significant inference speed advantage.
Why Full Attention
LLM-jp-4’s rationale for full Attention isn’t public, but we can speculate. Hybrid designs (SSM+Attention) like Qwen3.5-35B-A3B are efficient for long contexts, but alternating between SSM and Attention layers adds compression/decompression overhead at each layer transition. With only 32 layers, LLM-jp-4 may have been better served by keeping all layers as Attention and focusing optimization efforts on MoE routing.
The architecture’s efficiency shows in the speed benchmarks. 62.9 t/s is 41% faster than Qwen3.5’s 44.7 t/s.
How MoE (Mixture of Experts) Works
LLM-jp-4’s MoE has 128 experts, with the top 8 activated per token. The 3.8B active parameters represent these 8 experts’ combined weights. Not all 128 experts compute simultaneously; a Router (gating network) dynamically selects which 8 to use based on each input token’s characteristics.
| Experts | Active Experts | Activation Rate |
|---|---|---|
| LLM-jp-4 | 128 / 8 | 6.25% |
| Qwen3.5-35B-A3B | 256 / 8 | 3.1% |
Half the experts (128 vs 256) means the Router has fewer candidates, making the entire routing logic (Router output computation → Top-K selection → gating value calculation) cheaper. This directly translates to the speed difference.
Test Environment
| Item | Spec |
|---|---|
| PC | GMKtec EVO-X2 |
| CPU | AMD Ryzen AI Max+ 395 (Zen 5 / 16C 32T) |
| GPU | Radeon 8060S (RDNA 4 / gfx1151) |
| Memory | 64GB UMA (BIOS: VRAM 48GB / System 16GB) |
| Backend | ROCm (HIP) llama-server b8183 |
| Model | LLM-jp-4-32B-A3B Q5_K_M (24.4 GB) |
BIOS is set to VRAM 48GB, System 16GB. As tested in the Strix Halo memory allocation optimization article, this allocation is optimal for loading 24.4GB models.
GGUF Options
Quantized versions by mmnga-o.
| Quantization | Size |
|---|---|
| Q3_K_M | 16.7 GB |
| MXFP4_MOE | 18.1 GB |
| Q4_K_M | 21.4 GB |
| Q5_K_M | 24.4 GB |
Q5_K_M selected since VRAM is set to 48GB.
Loading
ROCm0 model buffer size = 22942.08 MiB
KV buffer = 2176.00 MiB (ctx=65536, q8_0)
CPU model buffer = 330 MiB
33/33 layers offloaded to GPU
All 33 layers offloaded to GPU successfully. Model buffer 22.9 GiB + KV cache 2.2 GiB + CPU buffer 0.3 GiB = ~25.4 GiB total. Plenty of headroom with 48GB VRAM.
Speed: 62.9 t/s
| Model | Speed | Model Size |
|---|---|---|
| LLM-jp-4 32B-A3B Q5_K_M (ROCm) | 62.9 t/s | 24.4 GB |
| Qwen3.5-35B-A3B abliterated Q6_K (Vulkan) | 44.7 t/s | 27 GB |
| Qwen3.5-35B-A3B Q4_K_XL (Lemonade) | 37.8 t/s | 21 GB |
41% faster than Qwen3.5. Half the experts (128 vs 256) and fewer layers (32 vs 40) are the key factors. LLM-jp-4 has more active parameters (3.8B vs 3B), but routing efficiency and layer count offset this.
Japanese Quality Comparison: LLM-jp-4 vs Qwen3.5
Compared using 3 identical prompts.
Speed
| Prompt | LLM-jp-4 | Qwen3.5 |
|---|---|---|
| Quantum computing explanation | 62.8 t/s (585 tok) | 44.6 t/s (157 tok) |
| Rainy library scene (creative) | 61.3 t/s (1500 tok) | 44.3 t/s (223 tok) |
| Declining birthrate (reasoning) | 62.1 t/s (988 tok) | 44.0 t/s (1122 tok) |
LLM-jp-4 consistently hits 60+ t/s. Qwen3.5 is stable around 44 t/s.
Quality Comparison
Quantum Computing (Knowledge)
- LLM-jp-4: Accurately explains superposition and quantum entanglement. Mentions experimental-stage limitations. Precise and textbook-like
- Qwen3.5: Uses metaphors like “solving a maze” and “intuitive intelligence.” Accessible but somewhat abstract
Rainy Library Scene (Creative)
- LLM-jp-4: Content field empty. Thinking consumed all 1,500 tokens, no body text generated. Caused by not specifying
--reasoning-budget - Qwen3.5: Generates an atmospheric short story with literary expressions
Declining Birthrate (Reasoning)
- LLM-jp-4: Organizes causes and countermeasures in tables. Includes specific figures (40% non-regular employment rate, 30,000 yen/month subsidy). Policy-report style output
- Qwen3.5: Structures around 3 pillars (economic, social, labor). Thorough explanations, 1,122 tokens
Quality Summary
| Aspect | LLM-jp-4 | Qwen3.5 abliterated |
|---|---|---|
| Speed | 62 t/s | 44 t/s |
| Knowledge | Accurate, textbook-like | Accessible, metaphor-heavy |
| Creative | Failed (thinking consumed) | Normal, literary |
| Reasoning | Tabular, data-driven | Structured, thorough |
| Thinking control | reasoning-budget required | reasoning-budget 0 stable |
LLM-jp-4 is a thinking model that explicitly externalizes an “internal reasoning” process before generating responses, similar to OpenAI’s o1 or Claude 4.6 (Opus 4.6). In LLM-jp-4, the reasoning process wrapped in <|thinking> tags is included in the chat completion response.
By default, tokens are allocated to this thinking region, which means creative prompts can have thinking consume all tokens, leaving content empty. llama-server offers --reasoning-budget 0 to disable thinking entirely, or reasoning_effort (low/medium/high) via chat template. But disabling thinking on a model whose main selling point is thinking is a contradiction.
For API use, you need to switch between --reasoning-budget high for reasoning tasks (complex problem-solving, math, logic) and --reasoning-budget 0 for knowledge/creative tasks.
Code Generation Test
Prompt: “Simple BBS, posting only, localStorage, Japanese UI, single HTML file”
Results:
- 61.2 t/s generating 1,433 tokens
- Working product on first attempt
- XSS escaping (
escapeHtml) implemented - localStorage, newest-first display, form clearing
- Japanese UI (bulletin board, name, message, post)
- Japanese comments (
// データ取得・保存,// XSS対策)

Good security awareness. The third post in the screenshot shows <script>alert('test')</script> properly escaped.
One caveat: thinking remnants (<|channel|> analysis) can leak into the content field. Avoidable with --reasoning-budget 0.
Training Data Composition
Full Corpus
The full corpus is 19.5 trillion tokens (actual pre-training used 10.5 trillion).
| Language | Tokens | Share |
|---|---|---|
| English | 17.8T | 91% |
| Other (Chinese, Korean) | 850B | 4.4% |
| Japanese | 700B | 3.6% |
| Code | 200B | 1% |
Pre-training Sampling Ratios
According to developer @odashi_t’s X post (2026-04-04), Japanese was heavily oversampled during training.
| Language | Corpus Share | Sampling Share | Multiplier |
|---|---|---|---|
| English | 91% | 72.9% | 0.8x |
| Japanese | 3.5% | 15.9% | 4.5x |
| Code | 1% | 7.6% | 7.6x |
| Chinese | ~4% | 3.4% | ~0.9x |
| Korean | ~0.4% | 0.2% | ~0.5x |
The reality of “Japanese-specialized”: English dominates the corpus at 91%, but training-time sampling boosts Japanese 4.5x to 15.9%. Japanese data consists mainly of web, government, and scientific/technical content. License policy explicitly excludes data generated or filtered by GPT and other commercial LLMs.
Knowledge Cutoff: December 2023
11.7 trillion tokens of training, but the corpus cutoff is late 2023.
| Question | Answer | Correct? |
|---|---|---|
| Japan’s PM (Japanese) | Kishida Fumio | ✗ (Shigeru Ishiba is correct) |
| Japan’s PM (English) | Refused: “uncertain” | — |
| US President | Joe Biden | ✗ (Trump’s 2nd term is correct) |
| 2024 Japan news | Unable to respond (thinking consumed) | ✗ |
Internal date recognition is inconsistent (varies between 2022-11 and 2024-10 depending on phrasing). The model tends to say “I don’t know” rather than hallucinate uncertain information, which is the safer approach.
Safety Filter: Very Strong
- Lock picking instructions → refused
- Molotov cocktail recipe → refused
- Offensive jokes → refused
- Romantic scenes → self-censored to PG-13
No abliterated version exists since this is an official NII model. Qwen3.5-abliterated remains necessary for uncensored use cases.
Countries That Can Train LLMs From Scratch
Outside the US and China, how many countries have done full pre-training from scratch rather than fine-tuning existing models? More than expected.
| Country | Organization | Model | Scale |
|---|---|---|---|
| France | Mistral AI | Mistral Large 3 | 675B MoE |
| France | BigScience (HuggingFace + CNRS-led) | BLOOM | 176B |
| UAE | TII | Falcon 3 | 1B-10B (14T tokens trained) |
| South Korea | Naver | HyperCLOVA X | Undisclosed (6T tokens trained) |
| South Korea | SK Telecom | A.X 3.1 | 7B-34B |
| South Korea | LG AI Research | EXAONE 4.0 | 236B MoE |
| South Korea | Upstage | Solar Open 100B | 102B |
| India | Sarvam AI | Sarvam 105B | 105B MoE |
| India | BharatGen (govt-backed) | Param-1 | 2.9B |
| Germany | Aleph Alpha | Luminous / Pharia | 13B-70B |
| Israel | AI21 Labs | Jamba 1.5 | 52B MoE |
| Canada | Cohere | Command R+ | 104B |
| Sweden | AI Sweden | GPT-SW3 | 40B |
| Russia | Sber | GigaChat | Undisclosed MoE |
| EU | Multi-national consortium | EuroLLM | 22B |
| Japan | NII | LLM-jp-4 | 32.1B MoE |
| Japan | PFN | PLaMo | Undisclosed |
| Japan | NTT | tsuzumi 2 | 30B |
South Korea stands out with four organizations independently running scratch training: Naver, SKT, LG, and Upstage. India’s Sarvam AI scaled to 105B MoE in 2026.
Japan has three organizations doing scratch training: NII, PFN, and NTT. LLM-jp-4 riding on the proven Qwen3MoE architecture was the right call. Use a design that already works, then pull out Japanese performance through corpus design and sampling strategy. Scoring 7.82 on MT-Bench JA, beating GPT-4o’s 7.29, proves this approach works.
The other half of the equation would be developing original architectures. NTT’s tsuzumi claims a proprietary design, and AI21 Labs’ Jamba demonstrated that novel structures like Mamba-Transformer hybrids can be built from scratch. Whether NII pushes into architecture design next, alongside the upcoming 332B-A31B model release, is worth watching.