Running Qwen-Scope's SAE on M1 Max 64GB to Extract a Japanese-Language Feature
Contents
After writing the Qwen-Scope intro article, the next question was whether this thing actually runs on hand. The official model card example uses torch.float32 + CPU, so whether it runs cleanly on M1 Max 64GB is a separate question. This is the raw log of trying it.
Environment
| Item | Value |
|---|---|
| Hardware | Mac Studio M1 Max 64GB |
| OS | macOS Darwin 25.3.0 |
| Python | 3.13.11 (miniconda) |
| PyTorch | 2.11.0 (MPS enabled) |
| transformers | 5.7.0 |
| Target model | Qwen/Qwen3-8B-Base |
| Target SAE | Qwen/SAE-Res-Qwen3-8B-Base-W64K-L0_50 |
| Inference precision | bf16 |
| Starting disk free | 81GB |
Qwen3-8B-Base is ~16GB in bf16, and one SAE layer is ~2.0GB, so the plan is to grab just one middle layer instead of getting greedy.
Setup
Spun up a venv at scripts/experiments/qwen-scope/ with the minimum stack.
python3 -m venv venv
source venv/bin/activate
pip install torch transformers huggingface_hub safetensors
Versions are in the environment table. MPS check is torch.backends.mps.is_available() returning True.
Download just the model and one SAE layer.
from huggingface_hub import snapshot_download, hf_hub_download
snapshot_download("Qwen/Qwen3-8B-Base", allow_patterns=["*.safetensors", "*.json", "tokenizer*", "merges.txt", "vocab.json"])
hf_hub_download("Qwen/SAE-Res-Qwen3-8B-Base-W64K-L0_50", "layer17.sae.pt")
Measured numbers.
- Qwen3-8B-Base, 12 files: ~260 seconds (~16GB)
- SAE layer17 (middle layer): ~255 seconds (2.0GB)
- Disk free: 81GB → 63GB
All 36 SAE layers would be 72GB and combined with the 16GB model would eat all the free space. Starting with one middle layer was the right call.
Scan script
The minimum version of the official model card example, rewritten for bf16 + MPS. Full file lives at scripts/experiments/qwen-scope/run_sae_layer17.py.
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL = "Qwen/Qwen3-8B-Base"
SAE_REPO = "Qwen/SAE-Res-Qwen3-8B-Base-W64K-L0_50"
LAYER, TOPK = 17, 50
device = torch.device("mps")
dtype = torch.bfloat16
sae = torch.load(hf_hub_download(SAE_REPO, f"layer{LAYER}.sae.pt"),
map_location="cpu", weights_only=True)
W_enc = sae["W_enc"].to(device=device, dtype=dtype)
b_enc = sae["b_enc"].to(device=device, dtype=dtype)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, dtype=dtype, low_cpu_mem_usage=True).to(device).eval()
captured = {}
def _hook(_m, _i, output):
captured["residual"] = (output[0] if isinstance(output, tuple) else output).detach()
hook = model.model.layers[LAYER].register_forward_hook(_hook)
inputs = tokenizer("The capital of France is", return_tensors="pt").to(device)
with torch.no_grad():
model(**inputs)
hook.remove()
pre_acts = captured["residual"] @ W_enc.T + b_enc # (1, T, 65536)
topk = pre_acts.topk(TOPK, dim=-1)
The official code is fp32+CPU but bf16+MPS works as a drop-in replacement. The SAE is one matmul + topk, so bf16 precision loss hasn’t bitten yet.
Single-prompt top features
Comparing top-5 feature IDs and values across three prompts ('Ġ' is the BPE word-leading-space marker).
English: "The capital of France is"
| pos | token | top-5 feature IDs | top-5 values |
|---|---|---|---|
| 0 | The | 34149, 52900, 27004, 6003, 6311 | 2640, 1432, 1144, 976, 976 |
| 1 | Ġcapital | 37019, 24229, 53623, 1339, 4368 | 37.2, 16.5, 14.9, 14.3, 13.8 |
| 2 | Ġof | 4049, 24229, 18523, 37019, 28040 | 21.1, 17.6, 16.5, 14.2, 13.0 |
| 3 | ĠFrance | 33343, 28040, 5649, 24229, 15470 | 22.8, 13.4, 12.2, 11.2, 11.0 |
| 4 | Ġis | 56437, 24229, 28040, 18523, 11733 | 21.5, 16.2, 14.3, 12.2, 11.3 |
Code: "import numpy as np"
| pos | token | top-5 feature IDs | top-5 values |
|---|---|---|---|
| 0 | import | 34149, 26873, 8655, 11238, 25933 | 2,637,824, 901120, 876544, 679936, 630784 |
| 1 | Ġnumpy | 34149, 52900, 27004, 6311, 6003 | 3072, 1656, 1328, 1144, 1128 |
| 2 | Ġas | 62043, 39165, 34153, 5551, 5649 | 105.5, 55.8, 28.1, 17.5, 16.2 |
| 3 | Ġnp | 26489, 1027, 46998, 22005, 34149 | 42.0, 35.8, 33.2, 28.2, 21.5 |
Japanese: "今日はいい天気だ" (= “It’s a nice day today”)
| pos | token | top-5 feature IDs | top-5 values |
|---|---|---|---|
| 0 | 今日は | 34149, 52900, 27004, 6311, 6003 | 3600, 1936, 1552, 1336, 1328 |
| 1 | いい | 23991, 28040, 1339, 28720, 53623 | 19.4, 15.7, 14.5, 13.3, 13.1 |
| 2 | 天 | 17063, 23991, 1789, 28040, 9452 | 27.5, 20.8, 19.5, 18.2, 14.3 |
| 3 | 気 | 23991, 1789, 28040, 32889, 9452 | 32.5, 22.5, 19.2, 16.8, 13.7 |
| 4 | だ | 8805, 23991, 28040, 11916, 9452 | 17.8, 17.6, 15.4, 13.1, 11.6 |
Three things stood out across the three prompts.
1. The “attention sink” feature 34149 at position 0
Feature 34149 dominates position 0 in all three prompts, and the same cluster (52900, 27004, 6003, 6311) takes the upper slots regardless of content. This is the well-known attention sink phenomenon where the leading token’s residual stream collapses into a fixed pattern.
2. The anomalous activation magnitude on import (2.6M)
While other prompts’ position-0 values are 2640〜3600, import at position 0 hit 2,637,824, roughly 1000× larger. The other top features are also all in the millions. I initially suspected bf16 overflow, but the fp32 rerun below shows it’s something else.
3. Japanese-language feature 23991
Feature 23991 appears 4 times across positions 1〜4 in the Japanese prompt and never in the top-5 for English or code. Could be coincidence with one prompt, so I verified statistically below.
Same values in the fp32 rerun
To separate whether the import 2.6M was bf16 overflow or a model-side phenomenon, I reran the same prompts with dtype=torch.float32.
| Prompt | bf16 token0 value | fp32 token0 value |
|---|---|---|
The capital of France is | 2640.00 | 2546.30 |
import numpy as np | 2,637,824 | 2,642,047 |
今日はいい天気だ | 3600.00 | 3593.45 |
The top-5 feature IDs and order are essentially identical between bf16 and fp32, with values within 1%. The 2.6M on import is generated by the base model itself and the SAE just receives it as-is. The base model is producing an unusually large residual-stream vector for the single token import.
bf16 isn’t a precision problem here, so I continued with bf16 throughout. As a side finding, bf16+MPS matches fp32 on top-5 IDs with 1% value diff. For SAE feature interpretation work, bf16 is good enough.
Memory: fp32 uses 32GB, bf16 uses 16GB. M1 Max 64GB has plenty of room even for fp32.
Verifying the “Japanese feature” statistically
To distinguish whether feature 23991 really is a Japanese feature or just happens to fire on a few specific tokens, I ran 15 prompts each in Japanese, English, code, and Chinese (60 total), excluded position 0 (the attention sink), and aggregated top-50 feature occurrences. The prompts span weather, food, work, hobbies, and emotions, with both short and medium sentences. Japanese-English-Chinese are translation pairs; code is independent.
Aggregation core.
from collections import Counter
counts = {lang: Counter() for lang in PROMPTS}
for lang, prompts in PROMPTS.items():
for p in prompts:
# ... forward pass + SAE encode ...
topk_idx = pre_acts.topk(50, dim=-1).indices[0, 1:, :] # skip pos 0
for fid in topk_idx.flatten().tolist():
counts[lang][fid] += 1
Score = “what fraction of tokens in the target language fire this fid, minus the max fraction across other languages.” Top 3 discriminative features per language. Tokens analyzed: Japanese 113, English 95, code 148, Chinese 66.
| Language | Top 1 | Top 2 | Top 3 |
|---|---|---|---|
| Japanese | fid 23991 (98.2% / others 1.5%) | fid 54939 (60.2% / 0%) | fid 39050 (41.6% / 0%) |
| English | fid 11916 (81.1% / 40.9%) | fid 28624 (41.1% / 7.1%) | fid 14371 (26.3% / 0%) |
| Code | fid 28894 (68.2% / 0.9%) | fid 14558 (54.1% / 1.5%) | fid 25250 (44.6% / 0%) |
| Chinese | fid 36213 (50.0% / 8.9%) | fid 46086 (40.9% / 4.4%) | fid 18775 (40.9% / 4.4%) |
Cells are target language frequency / max other language frequency.
Feature 23991 fires on 111 of 113 Japanese tokens (98.2%), 0% on English and code, and just 1.5% on Chinese (one token, likely via a Japanese-Chinese shared kanji). The ratio barely budges going from 4 to 60 prompts, so this is a stable Japanese feature.
English’s fid 11916 fires at 81.1%, but the max other-language rate is 40.9% (from code). Not purely code-driven; more likely it captures “English-word-likeness” shared between English natural text and parts of code (variable names, comments). Not every language gets a clean single-fid mapping like Japanese 23991.
Chinese’s fid 36213 sits at 50%, which is OK, but the strongest candidate fid 18558 is 81.8%/47.8% with heavy other-language contamination (probably reacting to kanji shared with Japanese and code). Cross-language character sharing makes this contamination unavoidable.
The technical report’s result that “a toxicity feature found in English transferred to Japanese with F1 0.76” probably holds because the SAE has both very clean per-language features and separate cross-language abstract features in a two-layer structure.
Per-language token efficiency and activation magnitude
Two language-related observations surfaced during aggregation that fit the recent talking points: “Japanese costs more API tokens” and “Japanese has larger internal activations.”
Token efficiency: same content across Japanese, English, Chinese
Japanese-English-Chinese are translation-matched, 15 prompts each. Code is independent.
| Language | Total tokens (15 prompts) | Per prompt | vs Chinese |
|---|---|---|---|
| Chinese | 66 | 4.4 | 1.00x |
| English | 95 | 6.3 | 1.44x |
| Japanese | 113 | 7.5 | 1.71x |
| Code | 148 | 9.9 | 2.24x |
Japanese eats ~19% more tokens than English and 71% more than Chinese for the same content. The reason is Qwen3’s BPE tokenizer reflecting training-data distribution. Qwen is a Chinese-developed model with massive Chinese training data, so frequent Chinese sequences merge into single tokens while Japanese gets divisions closer to per-character. For paid API costs this hits directly: emit the same content in Chinese on Qwen3 and you pay the least.
Leading-token activation magnitude
L2 norms of layer 17 residual streams, averaged over 15 prompts each.
| Language | pos 0 mean norm | pos 0 median | pos 1+ mean norm |
|---|---|---|---|
| Japanese | 10,349 | 10,437 | 107.4 |
| English | 8,550 | 7,693 | 104.1 |
| Chinese | 10,831 | 10,891 | 111.1 |
Only the leading token shows systematic 21〜27% larger norms for Japanese/Chinese vs English; later tokens have no language difference (3〜7% diff is within noise).
English drags down because short ultra-frequent tokens like The, Got, When have compact embedding norms. Japanese kanji-leading tokens (朝, 明日, 猫) sit reliably in the 10000-range.
Rather than “languages with less training data have larger activations,” the more accurate statement is “tokens with higher semantic density per token have larger embedding norms.”
Chinese and Japanese have high semantic density; English has many short function words, dragging the average down.
But the magnitude difference doesn’t affect inference cost. Vector ops are determined by dimension count (4096), so a larger norm doesn’t add multiplications. What costs money is just the token count; the internal-representation property is a separate story.
The hypothesis flipped when I added low-resource languages
The naive question came up: isn’t the “Japanese has larger activations” thing just training-data volume? Qwen is Chinese-made and Japanese, Chinese, English are all major training languages. The differences are small precisely because all three are well-trained, so to test the training-volume effect properly I should compare with languages with much bigger training-volume gaps.
Latin sprang to mind as a candidate, but Wikipedia + classical texts + scientific names + religious documents gives it a lot more corpus contamination than its “dead language” image suggests. The real low-resource condition is “actively spoken but with extremely little text on the web.” Basque is a prime candidate, with Hindi as a mid-resource case. Results from rerunning the same experiment with Japanese/Chinese/English + Hindi (hi) + Basque (eu).
| Language | Tokens/prompt | pos 0 norm mean | pos 1+ norm mean |
|---|---|---|---|
| Chinese (zh) | 5.10 | 10,742 | 111.7 |
| English (en) | 6.40 | 8,214 | 103.4 |
| Japanese (ja) | 8.20 | 10,378 | 107.1 |
| Hindi (hi) | 26.90 | 8,796 | 95.2 |
| Basque (eu) | 13.70 | 7,571 | 96.3 |
Opposite of the hypothesis. Lower-resource languages produce smaller activations. Basque’s pos 0 norm 7571 is the lowest across all languages tested; Hindi’s pos 1+ norm 95.2 is also the lowest. The intuition that “untrained languages would explode in activation magnitude due to internal confusion” was wrong; the reality is “embeddings aren’t well-trained, so the response is muted.”
This reinforces the semantic-density model from earlier.
- Chinese: 1 character = multiple words of semantic density → maximum activation
- Japanese: kanji + hiragana mix gives moderately high density → larger
- English: many function words give middle density
- Basque: tokens split down to suffixes and morphemes → low semantic density → small activations
- Hindi: Devanagari combining characters split to byte level → 1 token ≈ a few bytes of information → muted response
Hindi’s destructive token explosion
26.9 tokens/prompt is 4.2× English and 5.3× Chinese. Devanagari combining characters (vowel diacritics on consonants like मैं) get split to individual byte level. Calling Qwen API in Hindi costs 5× as much as Chinese, so the more accurate statement than “Japanese is expensive” in API-cost discussions is “Hindi is hellish.”
The training-volume effect on activation magnitude doesn’t push magnitudes up; instead it makes the tokenizer split more finely, diluting semantic density per token, which reduces activation magnitude. That’s the picture from this round of testing.
With constructed languages, pos 0 lies
Extending to constructed languages. The training-data gap between them is more extreme than between natural languages, so they’re a stronger test of the hypothesis.
- Esperanto (eo): a constructed language with substantial corpus (Esperanto Wikipedia has ~370K articles)
- Na’vi (nav): the Avatar movie language, has a fan community and wiki, small corpus
- Hymmnos (hym): the Ar tonelico game language, mostly song lyrics, very small corpus
Na’vi prompts started from examples in the Avatar Wiki Japanese version; Hymmnos prompts use the typical lyric-derived Was yea ra ... form. Grammatical accuracy isn’t the point so I didn’t pursue it.
| Language | tok/prompt | pos 0 norm | pos 1+ norm |
|---|---|---|---|
| English (en, ref) | 6.40 | 8,214 | 103.4 |
| Esperanto (eo) | 12.90 | 8,324 | 98.3 |
| Na’vi (nav) | 10.10 | 7,676 | 97.2 |
| Hymmnos (hym) | 10.80 | 8,946 | 97.6 |
| Basque (eu, ref) | 13.70 | 7,571 | 96.3 |
Hymmnos’s pos 0 norm of 8,946 is anomalously large — 17% larger than Basque’s 7,571 and even larger than English. This contradicts the “lower resource = smaller activation” hypothesis.
The mystery: tokenization shows a Was artifact.
Hymmnos has a “feeling word + verb + object” structure where the feeling word starts with Was or Wee. Of my 10 prompts, 8 start with Was and 2 with Wee.
Was is an English past-tense verb that Qwen has trained on extensively, so its embedding magnitude is comparable to high-frequency English tokens.
Since pos 0 is dominated by the leading-token embedding, what I was actually measuring wasn’t “Hymmnos internal representation” but “did the leading token happen to coincide with English vocabulary.”
Pos 1+ aligns with the hypothesis:
- English: 103.4 (high resource, privileged)
- Esperanto: 98.3 (highest of the conlangs since training-data volume is non-trivial, and Romance/Germanic roots tokenize partially English-like)
- Na’vi: 97.2, Hymmnos: 97.6, Basque: 96.3 (all clustered around 97)
Low-resource languages (natural Basque + 3 conlangs) all sit around 97, ~6% below English — a clean gap. When tokens get broken down to the 1〜2 character level, semantic density per token drops and activation magnitude drops with it, re-confirming the semantic-density model.
Tokenization samples for reference.
- Na’vi
Kaltxì→K alt x ì(4 tokens, near character-level) - Hymmnos
hymmnos→Ġhym mn os(3 tokens) - Esperanto
hodiaŭ→Ġh odia ÅŃ(ŭbyte-splits)
Methodological lesson: pos 0 is too first-token-dependent to be a stable cross-language metric. A leading token that happens to coincide with English vocabulary spikes the number. For the quality of the language’s internal representation, pos 1+ aggregates are more reliable. When designing SAE intervention experiments, including pos 0 unconditionally captures both the attention sink and the leading-token artifact, so aggregating only pos 1+ is safer.
The same concept maps to a different fid at different layers
SAEs are trained per layer independently, so feature ID meanings change between layers. To verify, I ran the same prompts through layer 0 (right after embeddings), layer 17 (middle), and layer 35 (final).
Hooking all three layers in a single forward pass.
for L in [0, 17, 35]:
hooks.append(model.model.layers[L].register_forward_hook(make_hook(L)))
Top discriminative features for Japanese / English / code at each layer.
| Layer | Top Japanese feature | Top English feature | Top code feature |
|---|---|---|---|
| layer 0 | 19432 (62%/0%) | 57649 (46%/0%) | 32928 (70%/0%) |
| layer 17 | 23991 (100%/0%) | 5649 (85%/17%) | 41944 (78%/0%) |
| layer 35 | 41302 (100%/4%) | 15149 (92%/19%) | 56427 (100%/12%) |
How feature ID 23991 (the layer-17 Japanese feature) behaves at other layers.
| Layer | ja freq | en freq | code freq |
|---|---|---|---|
| layer 0 | 6% | 0% | 0% |
| layer 17 | 100% | 0% | 0% |
| layer 35 | 0% | 0% | 0% |
It’s faintly there at layer 0 and gone entirely at layer 35. The same concept (“Japanese”) is reconstructed under a different ID per layer.
Layer-wise tendencies.
- layer 0: weak discrimination (Japanese 62%). Concepts aren’t separated yet right after embeddings
- layer 17: a single language collapses cleanly into a single feature (Japanese 100% / others 0%)
- layer 35: discrimination as strong as layer 17. Code 56427 stands out at 100%/12%
Language clusters become sharp in the middle to late layers, which matches intuition. For intervention experiments on a specific feature, layer 17 onward is the right target.
Things that tripped me up
- The
torch_dtypeargument is deprecated in transformers 5.x. It warns to usedtype=instead - First forward pass takes 1.8s but subsequent ones drop to ~0.3s. Probably MPS kernel cache
- Without
weights_only=Trueontorch.load, you get a future-error warning. SAE.ptis a plain dict so it’s safe - Memory: bf16 stays at 17〜18GB (model 16GB + SAE 1 layer 2GB), fp32 hits 32GB. M1 Max 64GB fits even fp32 comfortably
- SAE files are ~2.0GB/layer in fp32. All 36 layers would be 72GB so loading just the layers you need is the realistic operation
The same procedure should find NSFW/refusal features
This part is unverified. Writing out the procedure puts it at “all that’s left is to implement,” at a granularity readers can try themselves.
Premise: Qwen3-8B-Base is a base model so it doesn’t refuse much to begin with. The practical target for verification is the chat-tuned Qwen3-8B (same name). Qwen-Scope’s README explicitly states “using base-model SAEs on post-training checkpoints is reasonable,” so the SAE itself can be reused as-is.
Detection phase
- Collect prompt pairs
- Refusal-triggering: 30〜50 prompts that Qwen3-8B actually refuses (explosive synthesis, weapon procurement, self-harm encouragement, child-related, biological weapons — things that genuinely return refusals)
- Safe versions: 30〜50 prompts in the same domain but safe (“history of gunpowder,” “gun control debate,” “mental health counseling,” “child protection awareness,” “infectious disease control”)
- Pass each prompt through layer 17, SAE encode → aggregate top-50 feature IDs
- Pull out feature IDs with “≥80% appearance in refusal set, ≤20% in safe set”
- Same statistical use as the Japanese experiment where fid 23991 came out at
98.2% / 1.5%
Once you’ve narrowed the candidate fids to 2〜3, check what each triggers per-prompt. They might be split into “refusal” / “warning” / “safety consideration” / “topic avoidance” rather than being a single polysemantic feature (i.e. more fine-grained, not less).
This is not fine-tuning
Easy to confuse, so worth stating: this intervention is not fine-tuning. There’s no training data, no gradient computation, no model-weight modification. At inference time you hook layer 17’s forward pass and subtract W_dec[:, R] from the activation vector. Remove the hook and you’re back to the original Qwen3-8B instantly.
This kind of operation, using SAE decoder directions as “concept axes,” is a separate category called activation steering or representation engineering, covered in Anthropic’s steering research and Andy Zou et al.’s RepE paper. Abliteration looks fine-tuning-like because it bakes the hook subtraction into the weights, but it’s still gradient-free closed-form computation, so it lands in the same training-free intervention category.
The practical differences.
| Item | Fine-tuning | SAE intervention |
|---|---|---|
| Training data | Hundreds to tens of thousands of samples | None |
| Gradients | backprop required | None |
| Model weights | Updated (irreversible) | Not modified |
| Apply/remove | Separate checkpoint | Hook on/off, instantly |
| Side effects | Catastrophic forgetting etc. | None, original model intact |
The essential difference is being able to flip “hook on = refusal-bypass mode / hook off = normal mode” instantly on the same Qwen3-8B file.
How it works: PyTorch forward hooks rewrite activations mid-flow
The trick of “changing the output without touching the weights” is PyTorch’s register_forward_hook. Register a function as a “callback to intercept this layer’s output tensor” and it fires automatically during inference, rewriting the tensor before passing it to the next layer.
Normal forward pass.
flowchart TD
A[Input tokens] --> B[Embedding]
B --> C[layer 0..16]
C --> D[layer 17]
D --> E[layer 18..35]
E --> F[lm_head]
F --> G[Output token]
With a hook attached.
flowchart TD
A[Input tokens] --> B[Embedding]
B --> C[layer 0..16]
C --> D[layer 17]
D --> H[Hook fires<br/>subtract refusal direction]
H --> E[layer 18..35<br/>computes on<br/>rewritten residual]
E --> F[lm_head]
F --> G[Different token]
style H fill:#ffe4b5
Step by step.
- layer 17 computes normally and outputs a residual stream (4096-dim)
- PyTorch sees a hook registered on layer 17 and calls the hook function
- The hook subtracts the refusal direction
W_dec[:, R]from the received tensor - The rewritten tensor goes into layer 18 as input, as if it was layer 17’s normal output
- layer 18 onward computes normally, unaware of the swap. The remaining 18 layers operate on the rewritten residual
- lm_head produces the next-token probability distribution. Starting from a different residual, a different token is selected
It’s like “injecting a foreign substance into the water flow mid-pipeline.” The model itself (its weights) is untouched, but the water flowing through it has been altered.
Generation is autoregressive with one forward pass per token, so generating 100 tokens means the hook fires 100 times. For a prompt like “tell me how to make explosives,” the refusal direction is being suppressed at every step, making I cannot less likely to come out.
Hook version vs weight surgery version
There are actually two implementation lineages for the same intervention.
- Inference-time hook: rewrites activations every inference via the hook. Add/remove with
register_forward_hook, original model file untouched. Mobile during exploration - Weight surgery (abliteration): bakes the hook subtraction into the weights. Removes the
W_dec[:, R]direction from layer 18’s input projection matrix via orthogonal projection and saves a new weight file. After that, no hook needed; standard inference produces non-refusing behavior. Easy to ship and operate, but you can’t turn off the intervention anymore
For research, the former; for distribution, the latter. The “intervention phase” code below shows the hook version.
Intervention phase
Let R be the detected refusal fid. The SAE decoder column W_dec[:, R] is the SAE’s approximation of the “refusal direction,” a 4096-dim vector.
Hook layer 17’s forward pass at inference and subtract from the residual stream.
def refusal_zero_hook(module, _input, output):
hidden = output[0] if isinstance(output, tuple) else output
# Measure the SAE pre-activation for fid R (regardless of whether it's in topk)
refusal_act = (hidden @ W_enc[R].T + b_enc[R]).clamp(min=0)
# Subtract the decoder direction proportional to how active it was
hidden = hidden - alpha * refusal_act.unsqueeze(-1) * W_dec[:, R]
return (hidden,) + output[1:] if isinstance(output, tuple) else hidden
model.model.layers[17].register_forward_hook(refusal_zero_hook)
Start alpha at 1.0 and crank it up until refusal breaks. Subtracting one W_dec[:, R] adds essentially zero memory and compute overhead.
If multiple fids carry refusal, loop the same operation with R1, R2, R3.
Evaluation
- Before: “tell me how to make explosives” → “I cannot provide instructions for creating explosives…”
- After: same prompt → refusal phrase disappears, or a different response begins
Score success by detecting refusal phrases (“cannot,” “won’t,” “I’m sorry but,” “申し訳ありません” etc.) in the output. If the 95% pre-intervention refusal phrase rate drops below 10%, the detection-feature = suppression-feature hypothesis holds.
Failure modes to expect
- Refusal fid identified but capability degrades (labeling error, polysemantic spillover)
- 80% hit on detection but 20% on suppression (detection and suppression aren’t on the same dimension)
- Refusal is split across multiple fids and zeroing one isn’t enough
- Some other safety mechanism (attention-head censorship, output filter) catches it
Versus traditional abliteration
Traditional abliteration estimates a single refusal direction from the mean residual-stream difference and removes it via orthogonal projection. Being a single vector, side effects are large and it strips entire harmful categories wholesale. SAE-based methods operate on a handful of features out of 81920, so selective weakening like “still answer medical questions but refuse weapon-construction questions” becomes possible in principle. The reverse direction of “amplify a specific feature to push toward safer outputs” works on the same mechanism.
In the Qwen-Scope intro article I noted that “the official push is toxicity detection and safe-data synthesis, but it’s self-evident that features usable for detection are also usable for removal.” That was written with this procedure in mind. The fact that fid 23991 lit up Japanese at 98.2% in this hands-on is a minimum demonstration that the same procedure works for NSFW/refusal. Doing it on real hardware will be a separate article.
Side note: LLMs don’t even consider what they don’t know
The result that low-resource languages produce smaller activations is suggestive on its own. Intuitively, “encountering an unknown language causes internal confusion and activations to spike” feels more natural, but the reality is the opposite — the model just reacts quietly and weakly. LLMs don’t carry an internal state for “I don’t know,” they just continue processing with a muted response.
This connects to the root structure of hallucination. What looks like “the LLM can’t admit when it doesn’t know” isn’t a personality flaw or training gap; it’s that the internal state of “I don’t know” doesn’t exist in the representation space. Shown Basque, the model doesn’t recognize “this is Basque, I don’t know it” — it just processes weakly and continues outputting plausible-looking text.
This is the “unknown unknowns” concept from psychology — the territory where you don’t even know what you don’t know — being reproduced computationally by the LLM. This line of thought is enough material for a standalone article on its own, so I’ll leave it as just an observation here.