Tech 12 min read

Qwen-Scope: An SAE Suite for Steering and Data Synthesis Using Qwen's Internal Features

IkesanContents

On April 30, 2026, the Qwen team released Qwen-Scope, a Sparse Autoencoder (SAE) suite for Qwen models. The official blog was CSR-rendered and hard to scrape directly, but the Hugging Face model cards and the technical report Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models are accessible.

An SAE is an autoencoder that decomposes a model’s internal activations into “a combination of a small number of features.” Unlike a regular autoencoder that just needs to reconstruct the input, an SAE enforces sparsity on the intermediate representation — it limits how many features fire for a given input. This lets you break down a massive hidden state into human-interpretable features, like “a toxicity-like feature fired at this token” or “a math-reasoning feature fired at this layer.”

In Anthropic’s emotion vector research, I wrote about how amplifying or suppressing emotional representations inside Claude changed the rates of coercion and reward hacking. Qwen-Scope also looks at model internals, but the direction is a bit different. Rather than one-off internal vector observations, this release puts SAEs on every layer across multiple Qwen3/Qwen3.5 models and packages them as development tools.

14 Groups Across All Layers

According to the technical report, Qwen-Scope publishes a total of 14 groups of SAEs for 7 Qwen backbones. Both dense and MoE models are covered, with SAEs generally provided for every layer of each model.

TypeTarget ModelSAE WidthTop-K
denseQwen3-1.7B-Base32K50 / 100
denseQwen3-8B-Base64K50 / 100
denseQwen3.5-2B-Base32K50 / 100
denseQwen3.5-9B-Base64K50 / 100
denseQwen3.5-27B-Instruct80K50 / 100
MoEQwen3-30B-A3B-Base32K / 128K50 / 100
MoEQwen3.5-35B-A3B-Base32K / 128K50 / 100

Looking at the model card example, the SAE for Qwen3.5-27B has a hidden dimension of 5120, an SAE width of 81920 (16x expansion), and Top-K 100. It hooks into the residual stream of each layer and is distributed as a PyTorch dict containing W_enc, W_dec, b_enc, and b_dec.

Top-K 100 means only 100 out of 81920 features are set to nonzero for each input position. Instead of “looking at everything thinly,” you “look at only the 100 that fired strongly,” which makes features easier to interpret and intervene on.

Not Just Analysis, But Intervention

The selling point of Qwen-Scope is that it doesn’t stop at visualization or post-hoc analysis. The report lists use cases spanning from inference control to training improvement.

Use CaseWhat It Does
Inference-time steeringDirectly manipulate feature directions to control language, concepts, and preferences
Evaluation analysisCompare feature activations across benchmarks to check overlap and coverage
Data classification/synthesisUse toxicity features etc. for classification and safety data synthesis from small seed datasets
Training improvementFind features causing code-switching or repetitive generation, use as signals for SFT/RL

Instead of asking the model via prompt to “use more English” or “be safer,” you directly manipulate internal features. This is powerful, but also dangerous. Without verifying whether a feature is truly causally effective, whether it entangles other features, or whether it retains the same meaning across models and layers, you end up with “knobs that look right.”

The official model card includes a notice prohibiting interference with model capabilities for non-scientific purposes or using it to generate harmful information. SAEs are transparency tools, but they are also intervention tools that can change model behavior.

Different Target from the Independent Qwen3.6-27B SAE

Around the same time, Qwen3.6-27B Paper-Grade Sparse Autoencoders from OpenInterpretability also appeared. This is not from Qwen officially — it trains TopK SAEs on L11/L31/L55 of Qwen3.6-27B.

ItemQwen-ScopeOpenInterpretability’s Qwen3.6-27B SAE
ProviderQwen officialOpenInterpretability
Target7 Qwen3/Qwen3.5 backbonesQwen3.6-27B
LayersAll layers per modelL11 / L31 / L55
GoalBroad model family as dev toolsHigh-quality SAE validation for 27B dense
FormatPyTorch .pt dictsafetensors, sae_lens compatible

The OpenInterpretability version was trained on 200M tokens per layer, reporting var_exp of 0.8433 at L11, 0.7135 at L31, and 0.8157 at L55. Reproduction requires a 96GB VRAM-class GPU for about 35 hours, with an estimated cloud cost of roughly $30—60.

Qwen-Scope’s value lies in being an official release covering all layers across a broad model family. The OpenInterpretability version has value as an experiment in how high-quality an SAE can be built for Qwen3.6-27B dense. These two aren’t competing — they differ in target model and granularity.

Gaps Remain for Qwen3.6

On this blog I’ve recently covered Qwen3.6-35B-A3B and Qwen3.6-27B dense. Both were interesting from an agentic coding and local execution standpoint, but Qwen-Scope’s official release currently covers only Qwen3/Qwen3.5. If you want to look at Qwen3.6 models at the internal feature level, Qwen-Scope alone isn’t enough — you’d need external SAEs like the OpenInterpretability version.

This is a bit of a missed opportunity. Qwen3.6 is a new generation that includes Gated DeltaNet and Thinking Preservation, so there’s demand to examine code-switching, long-context, thinking tokens, and agentic behavior features via SAE. Qwen3.6-35B-A3B in particular is MoE, so the official report’s design principle of “releasing wider SAEs for MoE to capture finer features” should apply directly.

Usable, But Don’t Over-Trust Feature Names

The weakness of SAEs is that finding features and naming them are separate problems. You can label things “toxicity feature,” “math feature,” or “English feature,” but there’s no guarantee that a feature truly represents a single concept. Meanings can shift with changes in prompt, data distribution, layer, or model size.

Still, the Qwen-Scope release matters. When you want to inspect Qwen model internals, you no longer need to collect activations and train SAEs yourself every time. Inference steering, evaluation data overlap checks, safety data synthesis, and SFT/RL diagnostics can all be tried in the same feature space.

How Much VRAM Does One Layer Cost Locally?

Qwen-Scope SAE weights come as one .pt file per layer, containing just 4 tensors: W_enc, W_dec, b_enc, and b_dec. No special framework needed — torch.load() followed by matmul, topk, and scatter in 5 lines completes inference.

The issue is size. For Qwen3-8B, W_enc is 65536x4096 and W_dec is 4096x65536, about 2.1GB per layer in float32. Add the base model at 16GB in bf16, and looking at one layer totals around 18GB. Fits on a single 24GB VRAM GPU. Downloading all 36 layers’ SAEs comes to 77GB, but you don’t need them all in memory simultaneously. Just load the layer you want to analyze.

Qwen3.5-27B is 3.4GB per layer, with the base model at 54GB in bf16. Together that’s 57GB, requiring an A100 80GB. The minimum setup is Qwen3-1.7B + 32K SAE at about 500MB per SAE layer and 3.4GB for the base model in bf16. Fits easily in 8GB VRAM. Feature resolution is coarser than 8B, but sufficient for trying out SAE behavior.

The official code examples load the base model in torch.float32, which consumes 32GB for just Qwen3-8B. Switching to bf16 halves the base model’s memory footprint. The SAE weights themselves are float32 .pt files loaded as-is, but inference is just one matmul so fp32 processing is light.

A Gradio demo app.py is included — launch it with --model and --sae-path to browse active features per layer and token in a browser. The technical report doesn’t discuss SAE inference latency or throughput impact. The overhead on the forward pass is just matmul + topk, so inference itself is light, but hooking the residual stream requires modifying the standard transformers inference path. It can’t be directly integrated into optimized runtimes like vLLM or Ollama.

Fundamentally Different from LLM Classifiers

Methods like TRACER that train surrogate models from LLM output labels treat the LLM as a black-box classifier, using input-label pairs as training data to grow lightweight models like BERT. Once classification accuracy is sufficient, the LLM itself is no longer needed.

SAE takes the opposite approach. It directly examines internal feature activations and judges “if this feature fires, it’s toxic.” No classification head training, no fine-tuning. In the technical report’s experiments, a single SAE feature from Qwen3-8B exceeded F1 0.90 for English toxicity detection, and an OR rule over the top 2 features reached 0.96. 200 seed examples were sufficient, with minimal degradation from 2000.

Cross-lingual results from applying English-discovered toxicity features directly to other languages are also reported. Russian 0.95, French 0.93, Japanese 0.76, Chinese 0.70. Getting these numbers without language-specific tuning is supporting evidence that “toxicity” is represented as a cross-lingual feature inside the LLM.

The trade-off is independence. TRACER and BERT classifiers run independently of the LLM, dramatically reducing inference cost. SAEs require the base model’s forward pass to extract features. If you just want a cheap classifier, conventional approaches are more suitable. SAEs shine when you need not just classification but traceability down to the feature, layer, and token position, and can work within the same feature space for inference steering and data synthesis.

MoE Sparsity and SAE Sparsity Are Different Things

Both MoE and SAE involve “sparsity,” but what becomes sparse is entirely different.

MoE selects only a subset of experts for computation during inference. For Qwen3-30B-A3B, only 3B worth of parameters out of 30B are active per token, with the remaining FFN blocks skipped entirely. What becomes sparse is the model’s computation path itself — the goal is parameter efficiency. This is part of the architecture, learned together with the router during training, not something you analyze after the fact.

SAE decomposes activation vectors of a pre-trained model after the fact. Expanding a 5120-dimensional hidden state to 81920 dimensions and filtering with TopK 100 is an operation to extract “which conceptual features responded to this input.” It doesn’t modify the model’s computation path.

MoE’s router (which expert to route to) and SAE’s TopK (which features are active) look behaviorally similar. However, while MoE’s router is part of the model and directly affects output, manipulating SAE features requires a separate intervention step of decoding and writing back to the residual stream.

Qwen-Scope providing 128K-width SAEs for MoE models is precisely because MoE and SAE operate at different layers. After MoE expert selection has already routed activations, SAE further decomposes them into features — a two-stage setup. The official report notes that MoE models tend to have finer feature patterns than dense models, so wider SAEs are needed to avoid missing them.

Is SAE the Reverse of Pruning?

“Removing active filters” is what network pruning and activation patching do. Pruning removes neurons or filters with small impact to reduce computation. Activation patching zeroes out active parts for a given input or swaps them with values from another input to probe what matters. ROME/MEMIT knowledge editing and causal tracing are in this family.

What SAE does is close in purpose to activation patching, but operates in a different space.

Activation patching intervenes in the model’s raw neuron space. But individual neurons often encode multiple concepts (polysemanticity), so ablating one doesn’t clearly reveal which concept was removed.

SAE first decomposes activations into an interpretable feature space, then applies suppression or amplification at the feature level. You can name-target concepts like zeroing a “toxicity” feature or boosting an “English” feature (ideally). Rather than being the reverse of pruning raw neurons, it performs the same “remove and observe behavior” operation in a feature space where polysemanticity has been disentangled.

The technical report’s inference steering experiments do exactly this — amplifying specific feature directions to switch Qwen3-8B’s output language or shift content preferences. Where pruning works in the direction of “removing unnecessary parts,” SAE intervention is closer to “adding or subtracting specific concepts.”

If Toxicity Is Visible, Can It Be Erased?

If SAE can detect toxicity features at F1 0.96, a natural question is whether zeroing those features could remove the model’s safety filters. Setting aside whether the labeling is conceptually accurate, features whose ablation increases harmful output can be experimentally identified.

This is close to the idea of abliteration (ablation + liberation). Demonstrated by FailSpy on Llama 3 8B in 2024, this method estimates a “refusal direction” from the residual stream difference between harmful and harmless prompts, then removes that directional component from the model weights via orthogonal projection. The model stops refusing and answers everything.

Traditional abliteration estimates just one refusal direction from mean residual stream differences, making the operation coarse. Non-refusal capabilities can degrade as collateral, or attempting to remove only specific harmful categories ends up removing all of them.

Doing the same thing in SAE feature space has room for higher precision. Ablating only the few features corresponding to “toxicity” in an 81920-dimensional feature space should produce less collateral damage than removing the entire refusal direction vector. The published report doesn’t include verification in this direction, but the tools are in place.

What the official team is promoting is toxicity detection and safety data synthesis, but features useful for detection are obviously also features useful for removal. As SAE feature decomposition precision improves, abliteration-style interventions become correspondingly more selective.

References