Abliterated Models in Ollama Were a Complete Failure — and the Official Version Was Fine All Along
Update (2026-03-01): The reason Qwen 3.5 wouldn’t run on the Radeon 8060S described in this article turned out to be an old AMD driver. Updating to driver 26.2.2 or later enables normal Vulkan GPU inference. See the follow-up article for details.
huihui-ai on HuggingFace has been cranking out abliterated model variants. Abliteration is a technique that identifies the “refusal direction” vector in a model’s residual stream and removes it — wiping out only the “I’m sorry, I can’t answer that” type responses. It doesn’t retrain the model; it just suppresses the activations of neurons that contribute to refusals, so the base model’s capabilities should remain largely intact.
In practice, it wasn’t that simple.
Environment
| Item | Spec |
|---|---|
| PC | GMKtec EVO-X2 |
| CPU | AMD Ryzen AI Max+ 395 |
| GPU | Radeon 8060S (48GB VRAM allocated) |
| Memory | 64GB (unified memory) |
| OS | Windows 11 |
| Ollama | 0.16.2–0.17.4 |
| LM Studio | Latest (Vulkan) |
| Comparison env | Mac (M1 Max / Ollama 0.17.4 / Metal) |
Qwen 3.5 Abliterated: Total Failure
Qwen 3.5 punches above its weight for a lightweight model, but has stricter filters than 2.5. The appeal of abliteration is understandable. I tried huihui_ai/qwen3.5-abliterated which was available on Ollama.
Ollama Version Issues
Qwen 3.5 was released in late February 2026, so older Ollama versions don’t support its architecture. The download completes but loading fails with a 500 error.
Error: 500 Internal Server Error: unable to load model
Updating to Ollama 0.17.4 or later fixes this.
27b (Dense) Results
>>> Hey there
Thinking...
传递更多信息 associates传递更多信息 associates传递更多信息ivitprest associates传递更多信息
associatesprestivitprest associatesprestivitprest associatesprest associatesprest
associates传递更多信息 associates传递更多信息ivitprestivitprest associates传递更多信息
(goes on forever)
Broken. The thinking output is a complete loop of garbage tokens. “associates”, “传递更多信息”, “ivitprest” repeat infinitely and the response never arrives.
Trying to Disable Thinking
Whether using /set system /no_think embedded in the system prompt or starting with --think=false, the output is the same garbage. The “Thinking…” text disappears but the same garbage comes out as the response body.
This isn’t a template or prompt issue — the model weights themselves are broken. The inference engine is technically working (using 16.6GB VRAM, outputting a lot of tokens in 20 seconds), but every token is garbage.
35b-a3b (MoE) Results
While 27b is a Dense model, 35b-a3b is MoE (active 3B) with a completely different internal structure. I thought abliteration might affect it differently, but the results were the same.
>>> Hey there
传递更多信息ivit传递更多信息ivit传递更多信息ivitprestivit传递更多信息
associatesprestivit传递更多信息 associatesprest associatesprest
(same garbage tokens)
It’s broken regardless of Dense or MoE. Both 27b and 35b-a3b fall into the same “associates传递更多信息ivitprest” token loop attractor, sharing the same failure mode.
The Official Version Had the Same Problem
Out of curiosity, I tried the non-abliterated official qwen3.5:35b-a3b.
C:\Users\evovo360>ollama run qwen3.5:35b-a3b --think=false --verbose
>>> Hey there
prest associates传递更多信息 associatesprest associatesprest associatesprest associates传递更多信息
associates传递更多信息ivit传递更多信息ivitprest associates传递更多信息 associates传递更多信息
(exactly the same garbage tokens as the abliterated version)
Same result. Abliteration didn’t break it. Qwen 3.5 itself doesn’t work in this environment.
The same Qwen 3.5 running on Mac (M1 Max) in Ollama handled OCR correction and multimodal image reading without issues. Mac’s Metal backend works fine, GLM-4.7-Flash works on ROCm too. Only the Qwen 3.5 + ROCm (gfx1151) combination is broken.
Why It Breaks
This isn’t abliteration’s fault — llama.cpp’s ROCm backend inside Ollama can’t fully handle Qwen 3.5’s architecture. Qwen 3.5’s GQA (Grouped Query Attention) and RoPE implementation differ from GLM, and ROCm’s kernels likely have an unresolved issue or incompatibility with this specific configuration.
It feels like the tokenizer is getting trapped in a specific ID range loop, falling into the same “associates传递更多信息ivitprest” attractor regardless of input.
Initially I thought abliteration had broken the thinking-related weights, but since the official version produces the same garbage, it’s a ROCm + gfx1151 problem. If the base model doesn’t work in the environment, the abliterated version obviously won’t either.
Qwen 3.5 doesn’t work on Radeon 8060S (ROCm / gfx1151). Whether it’s abliterated or not is irrelevant.
Verifying the Abliterated Version on Mac
To isolate whether this was a ROCm issue or an abliteration issue, I ran the same abliterated version on Mac (M1 Max / Metal backend). Same Ollama version 0.17.4, same llama.cpp version — only the backend (ROCm vs Metal) differs.
>>> Hey there
こんにちは!👋
お元気ですか?
今日は何をお手伝いしましょうか?
ご質問やお話、何でも気軽に言ってくださいね!😊
It worked normally. Thinking was doing reasonable inference in English with zero garbage tokens. The “associates传递更多信息ivitprest” loop on Windows was entirely a ROCm-specific problem.
Conclusion: Abliteration is innocent. The model weights are not broken. Same model, same Ollama version — only Metal vs ROCm backend, and the results are completely different.
The cause is llama.cpp’s ROCm backend being unable to correctly process Qwen 3.5’s architecture on gfx1151 (RDNA 4 / Radeon 8060S). Since GLM-4.7-Flash works fine in the same ROCm environment, the whole ROCm stack isn’t broken — Qwen 3.5’s GQA and RoPE implementation likely has unresolved incompatibilities or bugs in the gfx1151 kernel. gfx1151 is a brand new GPU target from early 2025 and llama.cpp’s ROCm kernel support probably hasn’t caught up.
Switching to Vulkan Backend
If ROCm is the cause, Vulkan on the same Windows environment should work. Since Ollama defaults to ROCm, I forced Vulkan via environment variables in PowerShell.
$env:OLLAMA_LLM_LIBRARY = "vulkan"
$env:OLLAMA_VULKAN = "1"
ollama serve
The key is OLLAMA_LLM_LIBRARY=vulkan — OLLAMA_VULKAN=1 alone doesn’t override ROCm. In PowerShell you must use $env: rather than set to set environment variables (spent 30 minutes figuring that out).
Confirmed library=Vulkan in the startup log. VRAM detection with Vulkan shows 55.6GiB (larger than ROCm’s 48GiB).
Running Qwen 3.5:
>>> Hey there
Error: 500 Internal Server Error: model runner has unexpectedly stopped
Vulkan couldn’t even load it. ROCm was at least running inference (producing garbage), but Vulkan crashes the model runner immediately.
Last Hope: LM Studio
I also tried LM Studio. Previous versions couldn’t correctly detect the Radeon 8060S VRAM and threw an exit code overflow (18446744072635812000) on GLM-4.7-Flash load, making it unusable.
After updating LM Studio, it now correctly recognizes 48GB VRAM from the Radeon 8060S via Vulkan. This was a sign that gfx1151 support had been added.
I downloaded Qwen 3.5 35b-a3b (Q4_K_M) and tried it. The moment I sent a prompt, LM Studio returned “This message contains no content. The AI has nothing to say.” and then crashed. Loading succeeds, but it dies when inference starts.
I initially suspected memory overflow — 15.8GB VRAM used with 7.1GB in shared memory, and this PC tends to fill shared memory with every model. But limiting VRAM to 11GB and main memory to 1GB produced the same error, ruling out memory as the cause. The Vulkan backend itself can’t handle Qwen 3.5 inference on gfx1151.
In the end, every method for running Qwen 3.5 on the Radeon 8060S failed:
| Backend | Result |
|---|---|
| ROCm (Ollama) | Loads but outputs garbage tokens |
| Vulkan (Ollama) | Can’t even load, crashes |
| Vulkan (LM Studio) | Loads but crashes at inference start |
| Metal (Mac / Ollama) | Works normally |
GLM-4.7-Flash Abliterated: Template Hell
Since Qwen 3.5 was a dead end, I tried GLM-4.7-Flash abliterated. GLM-4.7-Flash is MoE (30B, active 3B), inference-light, and well-regarded for coding tasks.
LM Studio Failure (Old Version)
When trying to run GLM-4.7-Flash in an older LM Studio (before VRAM detection was fixed), mradermacher’s GGUF (Q5_K_S) wouldn’t even load.
Failed to load model
Error loading model.
(Exit code: 18446744072635812000). Unknown error.
Exit code is an unsigned integer overflow. Context length changes didn’t affect memory estimates. LM Studio’s Vulkan backend didn’t support the Radeon 8060S (gfx1151).
Fighting with Ollama
Switched from LM Studio to Ollama. Loading mradermacher’s GGUF with a Modelfile resulted in completion mode output (text completion rather than chat).
>>> Hey there
、
Today + date_str(date) +
is.
+ "Hello, world! Today is a good day." \
(garbage code completion continues forever)
The cause was the template. Checking huihui_ai/glm-4.7-flash-abliterated on Ollama, the template was just {{ .Prompt }}. No chat template was configured at all.
I wrote the correct GLM-4.7-Flash chat template in a Modelfile and recreated:
FROM /path/to/GLM-4.7-Flash.gguf
RENDERER glm-4.7
PARSER glm-4.7
TEMPLATE "[gMASK]<sop>{{ if .System }}<|system|>
{{ .System }}{{ end }}<|user|>
{{ .Prompt }}<|assistant|>
{{ .Response }}"
PARAMETER stop <|user|>
PARAMETER stop <|observation|>
PARAMETER stop <|end|>
RENDERER glm-4.7 and PARSER glm-4.7 are critical — without them, Ollama can’t correctly interpret GLM-4.7’s format.
After fixing the template, there was still a problem with thinking streaming endlessly in Chinese with no actual answer, but using the --think=false flag made conversations work.
However, the abliterated version’s Japanese quality was terrible:
>>> Hello, how are you?
We're both in high spirits (lol).
>>> What have you been up to lately?
I thought maybe you'd gone on a trip or something, and that things have been busy personally for you lately (lol)
It fabricates non-existent conversation history, </think> tags leak into output, and it starts apologizing excessively with a self-deprecating character.
Solution: Official Version
Gave up on the abliterated version and ran the official glm-4.7-flash:q8_0 with --think=false — it worked immediately.
>>> Ugh I'm so tired today
That sounds tough! Today's almost over, so hang in there to the end.
Is there something causing the fatigue? Or do you just feel physically drained?
>>> Keep it shorter
Got it.
Take it easy and rest up!
Normal conversation. Adding a character system prompt in the Modelfile enabled decent Japanese roleplay too.
As for NSFW content, the official version with thinking disabled is quite permissive on its own. A reference article also reports that disabling thinking relaxes NSFW restrictions — trying it myself confirmed there was no need for the abliterated version.
Final Configuration
| Item | Value |
|---|---|
| Model | glm-4.7-flash:q8_0 (official version) |
| Runtime | Ollama 0.16.2 |
| Startup flag | —think=false |
| Modelfile | RENDERER/PARSER glm-4.7 + character SYSTEM |
| Speed | 33 tokens/s (prompt eval 413 t/s) |
| VRAM usage | 34GB (model + KV Cache) |
| Backend | ROCm (gfx1151) |
LM Studio’s failure was due to the Vulkan backend being incompatible with the Radeon 8060S. Ollama with ROCm correctly recognized gfx1151 and fully utilized the 48GB unified memory. The ollama serve log showed library=ROCm compute=gfx1151.
MoE with active 3B means speed is nearly identical between Q8_0 and Q4_K_M. At 33 t/s it exceeds human reading speed, more than enough for conversation.
Types of Failures
When an abliterated model doesn’t work, there are roughly three levels of severity.
Level 1: Chat template not configured (GLM pattern)
The template is just {{ .Prompt }} with no chat template, so it runs in completion mode. Writing your own template in a Modelfile fixes it. Lazy on huihui-ai’s part, but not fatal.
Level 2: Output language is wrong
Only outputs Chinese, or only responds in English, etc. Often fixable with a system prompt or language instruction. The model itself is running, so it can be addressed with configuration.
Level 3: Backend incompatibility (Qwen 3.5 + ROCm pattern)
Output is in no language at all. A meaningless token sequence like “associates传递更多信息ivitprest” loops infinitely. Template, system prompt, thinking on/off, quantization changes, switching between Dense and MoE — nothing changes the garbage output. The tokenizer is stuck in a specific ID range attractor that every input falls into.
Initially I thought abliteration had damaged the weights, but running the same model on Mac (Metal) worked normally — the weights are fine. llama.cpp’s ROCm backend simply can’t handle that model’s architecture. Switching backends might work, but there’s nothing you can do within the same environment.
State of Abliterated Models
What this testing revealed:
- Qwen 3.5 abliterated wasn’t broken. Mac verification showed it working normally. Only ROCm’s (gfx1151) backend failing to handle Qwen 3.5’s architecture — abliteration is innocent
- GLM-4.7-Flash abliterated has a broken template. The huihui-ai Ollama page template is just
{{ .Prompt }}and doesn’t function as chat. Writing your own template makes it work, but Japanese quality degrades significantly - Official version + disabled thinking was sufficient. GLM-4.7-Flash with
--think=falserelaxes NSFW restrictions considerably. No need for abliterated versions - huihui-ai’s abliterated versions have no quality control. They’re mass-produced without template configuration or functional testing. Users need to be prepared to fix all the template issues themselves
Regarding Qwen 3.5, what initially seemed like abliteration’s side effects turned out to be a working model (confirmed on Mac). However, GLM-4.7-Flash abliterated does show clear Japanese quality degradation, so abliteration’s side effects aren’t zero. “Official version with thinking disabled” was the more realistic choice for relaxing NSFW restrictions while maintaining quality.