Hy-MT2 1.8B Q4_K_M on M1 Max 64GB: 1.25bit 440MB build does not load on stock llama.cpp yet

Tencent Hunyuan’s Hy-MT2 translation models showed up on Hugging Face and ModelScope.
There are three sizes: 1.8B, 7B, and 30B-A3B (MoE), all supporting translation between 33 languages, and the 1.8B has GGUF, 2bit GGUF, and 1.25bit GGUF variants.

The headline-grabber here is less the 30B-A3B benchmark than dropping the 1.8B down to 440MB.
Local translation matters for short text not worth hitting an API for, documents that should not leave the org, and offline-ish devices.
I had written before about hitting the MyMemory API directly from the browser, and that path assumes an external API.
Hy-MT2 is the same translation use case, but pushed toward running the model on-device.

I put 1.8B Q4_K_M (1.08GB) on llama-server (build 8990) on M1 Max 64GB and pushed JSON, SRT, HTML, glossary, and minority-language prompts through it. The full input-output pairs and what does and does not work are in the second half of the article. The 1.25bit ~440MB build does not load on stock llama.cpp yet, and that is also covered below.

What was released

The Hugging Face model card says Hy-MT2-1.8B, Hy-MT2-7B, Hy-MT2-30B-A3B, and IFMTBench were released on 2026-05-21.
1.8B and 7B are dense, 30B-A3B is MoE (Mixture of Experts).
MoE does not run every parameter for every token, only a subset of experts selected per input. The “A3B” in 30B-A3B means active parameters are around 3B.

flowchart TD
  Text[Input text] --> Router[MoE router]
  Router --> E1[Translation expert]
  Router --> E2[Style and terminology expert]
  Router --> E3[Format-preservation expert]
  E1 --> Out[Translation result]
  E2 --> Out
  E3 --> Out

For inference, the card shows examples with Transformers, vLLM, and SGLang.
For 1.8B and 7B the recommended sampling is temperature: 0.7, top_p: 0.6, top_k: 20, repetition_penalty: 1.05.
For 30B-A3B it shifts to top_p: 1.0, top_k: -1, repetition_penalty: 1.0. Same translation model, different inference settings.

33 languages and minority-language slots

The language list includes Japanese, English, Chinese, Korean, French, German, Spanish, Russian, Arabic, Vietnamese, Thai, and similar majors.
It also includes Tibetan, Kazakh, Mongolian, Uyghur, and Cantonese.
The paper calls the Chinese ↔ minority-language pair “Mandarin-minority translation” and benchmarks it separately.

This is where it differs from generic LLM translation.
ChatGPT or Gemini do fine on majors, but real-world breakage usually comes from terminology constraints, formatting, named entities, or code fragments that must not be translated.
Hy-MT2 puts that side, “translation instruction following”, on the front of its pitch.

IFMTBench is a translation-specific instruction-following benchmark

IFMTBench, released alongside Hy-MT2, is a translation-specific instruction-following benchmark.
The paper says it uses 7,344 human-aligned samples with constraints on terminology, format, style, and so on.
Single-constraint samples are 4,506, multi-constraint are 2,838.

The prompt examples on the model card are not just “translate this to English”.
They include hold a glossary, follow a specific style, preserve the same number and position of delimiters, do not translate keys and tags inside JSON/HTML, and reflect supplied background information.

That is where translation apps break.
UI strings, subtitles, contracts, code comments, JSON resources — the sentence can be correct, but if the format is broken, the output is unusable.
In a write-up about wiring WebRTC voice into real-time translation I touched on translation API selection lightly. Even live conversation subtitles change shape if you have to preserve SRT and speaker labels.

The 440MB build is AngelSlim’s 1.25bit quantization

There is an AngelSlim 1.25bit GGUF of the 1.8B model.
The paper and the card say the 1.8B becomes about 440MB at 1.25bit extreme-low-bit quantization and runs about 1.5x faster than the usual 4bit quant.
On the AngelSlim side, the STQ1_0 kernel was published on 2026-05-08 and PR’d to llama.cpp.

Quantization is not just shrinking the file.
FP16 and BF16 use 16 bits per weight, 4bit is 4 bits, 2bit is 2 bits.
1.25bit goes below that, so the integer quantization constraints get tighter.
The paper uses something called Sherry, a 1.25bit sparse ternary quantization, distilled from a teacher model in a QAT-flavored pipeline.

The compression rate looks strong on paper, but the quantization eval in the paper shows that 2bit and 1.25bit lose ground on instruction following.
FP8 holds close to BF16 across 1.8B, 7B, and 30B-A3B, and Q4_K_M also survives on FLORES-200 and domain translation.
On-device short-text translation can lean on 1.25bit; format preservation and terminology constraints need 4bit or FP8 minimum, and that is the split.

The benchmarks are Tencent’s own evaluation numbers

The paper uses FLORES-200, WMT25, Mandarin-minority, DomainMTBench, WildMTBench, and IFMTBench.
For general translation it pairs XCOMET-XXL, CometKiwi, and GEMBA.
On WMT25, Hy-MT2-30B-A3B scores 62.89 / 71.08 / 84.34, and on GEMBA it claims to beat Gemini 3.1 Pro and GPT-5.5.
On FLORES-200 XX↔XX, Hy-MT2-30B-A3B is 87.47, which the paper calls 98.6% of Gemini 3.1 Pro.

For 1.8B, the paper reports beating Microsoft Translator and Doubao Translator on all three WMT25 metrics.
WildMTBench is said to also beat commercial translation systems.
These are Tencent Hunyuan’s in-house evaluation numbers, not something I reproduced on Japanese long-form text, named entities, doujin-style quirks, or UI strings on my own machine.

On Hugging Face, the 1.8B model has Inference API exposed, while 30B-A3B’s card shows no Inference Provider configured.
Spaces exist, but a few days after release there is still not much third-party verification in Japanese.
Rather than reading “matches GPT-5.5” and swapping things out, the practical move is feeding the same input you currently send to your translation memory or API and lining up the outputs.

If you want to try it locally, start from 1.8B GGUF

The model card has Transformers examples, but for a starting point, 1.8B GGUF is easier.
Same as in the FastAPI + Chroma + Open WebUI + Ollama local RAG write-up, you can put it behind an llama.cpp-family server with an OpenAI-compatible API.

The first input should not be a short news sentence; it should be something that breaks if you mishandle it.
Translate only the values inside JSON. Keep the SRT index numbers and timecodes. Preserve HTML tags. Do not translate proper nouns inside a game. Keep the same term mapped to the same translation across a contract.
That is where Hy-MT2 puts its emphasis.

Running 1.8B Q4_K_M on M1 Max 64GB

That was the paper and official material.
To check whether it actually runs locally, I put tencent/Hy-MT2-1.8B-GGUF’s Q4_K_M on llama-server on M1 Max 64GB and threw translation prompts at it.

Environment

Item	Value
Machine	MacBook Pro M1 Max 64GB unified memory
OS	macOS 26.5 (Darwin 25.5.0)
llama.cpp	build 8990 (Homebrew)
ggml	0.10.1
Model	`tencent/Hy-MT2-1.8B-GGUF` / `Hy-MT2-1.8B-Q4_K_M.gguf` (1.08GB)
Architecture	`hunyuan_v1_dense` (llama.cpp PR #14878 merged 2025-08)
Backend	Metal (M1 Max)
Context	4096 tokens (KV cache 256MB)
Inference params	`temperature=0.7`, `top_p=0.6`, `top_k=20`, `repeat_penalty=1.05` (model-card recommendation for 1.8B/7B)

The llama-server command was this:

llama-server -m Hy-MT2-1.8B-Q4_K_M.gguf \
  -c 4096 -ngl 99 --port 8765 --jinja \
  --temp 0.7 --top-p 0.6 --top-k 20 --repeat-penalty 1.05

--jinja makes it pick up the chat template (<｜hy_User｜> / <｜hy_Assistant｜>) bundled with the model card.
Without it, the server runs in raw-completion mode and the translation instructions stop being interpreted as instructions.

Speed

For each test I hit /v1/chat/completions once and read timings.predicted_ms and timings.predicted_per_second from the response. Each row uses a different prompt — the prompts themselves are in the “Prompts and outputs” section.

Case	Output length (approx)	Latency	Speed
EN→JA short (`The quick brown fox jumps over the lazy dog.`)	~22 tok	370ms	59.5 tok/s
JA→EN news (Yamanote suspension, 4 sentences)	~33 tok	590ms	56.1 tok/s
JSON UI strings (6 entries)	~106 tok	1.9s	56.1 tok/s
SRT 3 blocks	~312 tok	5.4s	57.4 tok/s
Glossary (short, warm cache)	~30 tok	220ms	139 tok/s

M1 Max 64GB stays in the 50-60 tok/s range.
Even longer format-preserving prompts do not drop below that band, and a hot prompt-prefix pushes the rate up to ~140 tok/s.
Short prompts return in under a second, three SRT blocks land in the 5-second range.

Prompts and outputs

I wanted to see the “instruction following” pitch in action myself, so I ran eight patterns: bare translation in both directions, plus JSON, HTML, glossary, SRT, Markdown, and minority languages.

EN→JA short (used for the speed table)

The instruction is “Translate the following text into Japanese. Note that you should only output the translated result without any additional explanation.” The English I sent:

The quick brown fox jumps over the lazy dog.

The translation that came back:

素早い茶色のキツネが、怠け者の犬を飛び越える。

Natural Japanese in meaning, word order, and vocabulary. 1.8B Q4_K_M has no problem at this difficulty.

JA→EN news

The instruction was “以下のテキストを英語に翻訳してください。翻訳結果のみを出力し、追加の説明は不要です。” The Japanese I sent:

本日午後、東京都心では激しい雷雨があり、JR山手線の一部区間で運転を見合わせた。気象庁は引き続き警戒を呼びかけている。

The translation:

This afternoon, severe rainstorms occurred in Tokyo, causing traffic delays on some sections of the JR Yamanote Line. The Meteorological Agency continues to issue warnings.

For rail usage, suspended service is more accurate than traffic delays, which leans road-side. As a news summary it is within range, but for a railway industry doc you would want to post-edit the noun phrase.

JSON UI strings

Using the model card’s Structured Data 1 template (“Translate the user-facing text within the following JSON data into Japanese. Preserve structure, keys, and placeholders exactly.” plus the three strict rules), the JSON I sent:

{
  "login": {
    "title": "Sign in",
    "placeholder": "Enter your email",
    "button": "Continue",
    "error_required": "Email is required",
    "welcome": "Welcome back, {{username}}!",
    "footer": "By signing in you agree to our Terms of Service."
  }
}

What came back:

{
  "login": {
    "title": "ログイン",
    "placeholder": "メールアドレスを入力してください",
    "button": "続行",
    "error_required": "メールが必要です",
    "welcome": "お帰りいただきありがとう、{{username}}！",
    "footer": "ログインすることで、当社の利用規約に同意したことになります。"
  }
}

The JSON structure, the keys, and the {{username}} placeholder all survived.
The welcome translation came out a bit stiff. “おかえりなさい” would read more naturally, but the model traded translation naturalness for structure preservation here.

HTML (tag and attribute preservation)

The instruction was “Translate the following HTML into Japanese. Preserve all HTML tags exactly. Do not translate tag names or attribute keys. Only translate visible text.” The HTML I sent:

<div class="card">
  <h2>Pricing</h2>
  <p>Choose the plan that fits your team.</p>
  <a href="/signup" class="btn">Start free trial</a>
</div>

The output:

<div class="card">
  <h2>料金</h2>
  <p>チームに合ったプランを選択してください。</p>
  <a href="/signup" class="btn">無料トライアルを開始</a>
</div>

class, href, tag names — all untouched. Only the visible text changed.
For UI string files and static-site i18n templates, this output can be dropped straight in.

Glossary

Using the model card’s Terminology template, I forced unnatural translations on purpose. The relevant slice of the prompt:

Reference the following translations:
agent translates to 代理人
workflow translates to 作業の流れ
prompt translates to 指示文

Translate the following text into Japanese. ...

The agent runs a multi-step workflow and refines the prompt automatically.

The translation that came back: 代理人が複数段階の作業の流れを実行し、指示文を自動的に最適化します。

Without the reference list, the same English would normally translate as 「エージェントは複数段階のワークフローを実行し、プロンプトを自動的に最適化します。」
The reference terms went straight into the output. For an in-house glossary or doujin-side proper noun list that must be hit consistently, the 1.8B handles this case.

SRT subtitle structure

The instruction was the model card’s Structured Data 1 template adapted for SRT (with “Preserve subtitle index numbers and timecodes exactly” added). The SRT I sent:

1
00:00:01,200 --> 00:00:04,500
Good morning everyone, welcome to today's briefing.

2
00:00:05,000 --> 00:00:08,300
We will start with the quarterly results.

3
00:00:08,800 --> 00:00:12,000
Revenue grew by 18 percent year over year.

What came back:

### タスク
以下のSRT字幕データ内のユーザー向けテキストを、以下の厳格な規則に従って翻訳してください。

### 厳格な規則
1. **構造の保持：** 字幕インデックス番号とタイムコードをそのままに、元のSRTデータ構造を保持する必要があります。
2. **選択的翻訳：** 見える字幕テキストのみを翻訳します。
3. **厳格な非翻訳：** インデックス番号、タイムコード（例：00:00:01,200 --> 00:00:04,500）、矢印マークを決して変更したり翻訳したりしないでください。元の形そのままに置いておきます。

### ソースデータ
1
00:00:01,200 --> 00:00:04,500
皆さん、こんにちは。本日のブリーフィングへようこそ。

2
00:00:05,000 --> 00:00:08,300
四半期決算から始めます。

3
00:00:08,800 --> 00:00:12,000
収益は前年比18％増加しました。

The SRT block itself was preserved exactly — numbers, timecodes, arrows all untouched, only the text translated. Index and arrow preservation worked as advertised.
The catch: the model also translated the ”### Task” and ”### Strict Rules” instruction sections into Japanese before producing the SRT block.
When you mix “translation instructions + translation target” into a single user-role message, 1.8B sometimes treats the instructions as translation target.
For production, lift the instructions into a system prompt, or use clearer delimiters like the Structured Data 1 template with explicit headers and blank lines.

Markdown with a code block

The instruction was “Translate the following Markdown into Japanese. Preserve code blocks (the part inside ```) exactly without translation. Only translate prose.” The Markdown I sent:

Use the following command to start the server.

```bash
llama-server -m model.gguf --port 8765
```

The server listens on port 8765 by default.

What came back:

サーバーを起動するには、次のコマンドを使用します。

```bash
llama-server -m model.gguf --port 8765
```

サーバーはデフォルトでポート 8765 で listening します。

The code block — fences and contents — was preserved exactly. Only the surrounding prose was translated.
However the trailing listening stayed in English. Running the same prompt a few times, the verb cycles between 「リッスンします」, 「接続を受け付けます」, and 「リスナーになります」. This is the temperature=0.7 jitter. If you want a fixed translation, add it to a glossary instruction, or post-process the output with a term-replacement step.

Minority languages (Mongolian and Tibetan)

To check the minority side of the “33 languages” claim, I ran EN→Mongolian and EN→Tibetan.
For EN→Mongolian I sent Good morning. The weather is beautiful today. The output:

ᠰᠢᠷᠦᠭᠦᠯᠡᠩ ᠰᠠᠷᠠᠯᠲᠠ ᠃ ᠡᠳᠦᠷ ᠡᠨᠡ ᠡᠪᠦᠯ ᠰᠠᠢᠬᠠᠨ ᠪᠠᠢᠨ۳ᠠ ᠃

This is the traditional Mongolian script used in Inner Mongolia (vertical layout). The Republic of Mongolia uses Cyrillic, so if you want a specific locale, name it in the prompt. There is also a ۳ (Arabic-Indic digit 3) mixed in, which suggests the 1.8B Q4_K_M occasionally garbles fine-grained character classes.

For EN→Tibetan I sent the same English. Output:

ཞིག་ཤོས་ཀྱི་དགོངས་པ་ཡིན། आजའི་གནམ་གཤིས་ཧ་ཅང་སྐྱིད་པོ་རེད།

There is a Devanagari आ mixed inside the Tibetan script — a script confusion that the 1.8B is showing.

For JA→Mongolian I also tried. The instruction was “以下のテキストをモンゴル語に翻訳してください。翻訳結果のみを出力し、追加の説明は不要です。” with the Japanese:

今日の会議は午後3時に開始します。

The output:

今日の会議は午後3時に始まります。

Not Mongolian — it returned Japanese with a small rephrasing (“開始します” → “始まります”). The instruction reached the model but the target-language generation did not.
Minority-language IO is not stable at 1.8B. If you need that part, the 7B or 30B-A3B is the target.
However 30B-A3B is, as noted below, not yet on a path that runs on Mac via standard tooling.

The 1.25bit 440MB GGUF does not load on stock llama.cpp 8990

The most attention-grabbing piece at release, the 1.25bit ~440MB build, does not load in the stock llama.cpp build 8990 from Homebrew.
The STQ1_0 kernel PRs are still pending: PR #22836 (CPU NEON) is still open, and PR #23505 (CUDA) is closed unmerged, waiting on a re-merge.
You either build the forked llama.cpp shipped with AngelSlim, or wait for the upstream PR to land.
The 440MB size is real, but brew install llama.cpp is not a working path for it yet.

30B-A3B is not in the standard route yet

The config.json of tencent/Hy-MT2-30B-A3B shows architectures: ["HYV3ForCausalLM"] and the HF tag hy_v3. That is a new architecture introduced with the Hy-MT2 generation, separate from the existing hunyuan_v1_dense / hunyuan_v1_moe family.
What is published is the original safetensors and an FP8 variant — no GGUF yet. I did not find a hy_v3 PR on the llama.cpp side either.
If you want Mac-local inference today, the only working choices are the 1.8B or 7B GGUF.

The primary sources I worked from are below.