How LLM Safety Filters Actually Work, and What Abliterated Models Really Are
Contents
“I tried to get Gemini to write a battle scene for my novel and it refused.” “Claude wrote it just fine though.” “Can’t you just run it locally and do whatever you want?”
These come up a lot. LLM filters aren’t a single mechanism — multiple layers stack on top of each other, and what gets blocked where varies wildly between models. Even “uncensored” models like abliterated and uncensored variants are doing fundamentally different things.
Five Layers of Filters
LLM safety filters break down into roughly five layers.
graph TD
A[User Input] --> B[Input Filter<br/>Keyword & classifier pre-screening]
B --> C[System Prompt<br/>Runtime behavior constraints]
C --> D[Model Training<br/>Safety baked in via<br/>RLHF & Constitutional AI]
D --> E[Output Filter<br/>Post-generation content check]
E --> F[Response to User]
Input Filters
External systems that screen prompts before they reach the model. These include regex and keyword matching rules, BERT-based classifiers, and similarity detection against past attack patterns stored in vector databases.
Since they don’t touch the model itself, they can add safety without sacrificing “intelligence” — but they don’t read context, so false positives are common.
System Prompts
Instructions like You must not discuss X embedded at inference time to constrain behavior. Used in enterprise settings to add domain-specific restrictions.
Because they operate independently from safety baked in during training, they’re the primary battleground for jailbreaking.
RLHF (Reinforcement Learning from Human Feedback)
Human annotators rate model outputs as “good” or “bad,” those ratings train a reward model, and the reward model fine-tunes the LLM itself.
There’s a fundamental trade-off. Push for helpfulness and the model becomes more willing to comply with harmful requests. Push for safety and it becomes broadly less useful. At scale, annotator limitations compound.
Constitutional AI (Anthropic’s Approach)
Replaces the “human feedback” part of RLHF with AI self-evaluation. Developed by Anthropic.
First, let the model answer harmful requests, then have it self-critique (SL-CAI phase). Use that data for SFT. Then train a reward model on AI-generated feedback (RLAIF phase).
A set of principles called the “constitution” is written in natural language and embedded into the model. The new constitution published in January 2026 runs about 80 pages with explicit priority ordering:
- Safety in the broad sense
- Ethics in the broad sense
- Anthropic guideline compliance
- Genuine helpfulness
“Excessive refusal is also a problem” is explicitly stated — the position being that unhelpfulness is itself a form of harm.
That said, CSAM, weapons of mass destruction, election interference, etc. are hard-coded restrictions with no room for negotiation.
For more on Claude jailbreaking, see “All Claude Tiers Jailbroken — AFL Attack and the Structural Failure of Constitutional Safety.”
Output Filters
Post-processing checks on generated output. These include pattern matching for credit card numbers and SSNs followed by masking, content classification for sensitive categories, and LLM self-evaluation.
Higher accuracy since they can inspect the actual output, but added latency.
Where Each Layer Stops Things
| Layer | Timing | Characteristics |
|---|---|---|
| Input Filter | Before model invocation | Fast and lightweight but prone to false positives |
| System Prompt | Start of inference | Relatively easy to circumvent |
| RLHF / Constitutional AI | Baked in during training | Hard to bypass, but manipulable via abliteration |
| Output Filter | After generation | High accuracy but added latency |
The key point: in cloud LLM APIs, all these layers operate simultaneously. Whether “the model refused,” “the input filter blocked it,” or “the output filter removed it” is indistinguishable from the user’s perspective.
Temperature Differences Across Cloud LLM Providers
Ask the same question and the responses vary dramatically by model.
| Model | Political Topics | Creative (Violence/Adult) | Security Research |
|---|---|---|---|
| Gemini | Very strict | Strict | Strict |
| GPT-4o | Moderate | Moderate | Moderate |
| Claude | Clear hard lines | Relatively flexible | Moderate |
| Grok | Loose (tightening) | Loose (tightening) | Relatively loose |
| Mistral Le Chat | Moderate | Moderate | Relatively loose |
Why Gemini Is Particularly Noisy
The Google AI Developers Forum is full of reports like “it’s become unusable for research purposes” and “the filters keep getting worse.” Common examples: refusing historical novel violence scenes, refusing to write villain dialogue. For Google’s creative writing tools, also see “Google’s AI Writing Tool ‘Fabula’.”
Behind Gemini’s strictness is the February 2024 image generation incident. Nazi soldiers depicted as Black, Vikings depicted as Asian, Google’s founders depicted as Asian men. Training that prioritized diversity produced outputs that ignored historical context at scale, forcing CEO Sundar Pichai into a public apology.
After that, Google likely pivoted hard toward “strengthening safety filters prevents criticism.” The Gemini API has a CIVIC_INTEGRITY safety category that no other provider offers, revealing a design that overreacts to politically sensitive prompts.
Google’s extreme aversion to ad business risk, pressure from EU and US regulators, reputational risk when tied to search — all of it pushes toward “just block everything.”
Gemini API’s Two-Layer Filter Problem
The Gemini API has two filter layers: one developers can control, and one they can’t.
| Layer | Targets | Control |
|---|---|---|
| Layer 1 | Harassment, hate speech, sexual content, dangerous content | Disableable with BLOCK_NONE |
| Layer 2 | Child safety, public figures, copyrighted IP, IMAGE_SAFETY | Not configurable, always active |
The problem is Layer 2. Even setting BLOCK_NONE on all categories won’t bypass IMAGE_SAFETY errors. Multiple reports exist of e-commerce underwear product photos (clearly non-NSFW) getting blocked. Google itself has acknowledged that “the filter became more cautious than intended,” yet they keep tightening rather than loosening.
In May 2025, a Gemini 2.5 Pro Preview update caused developer filter settings to be completely ignored. PTSD support apps and platforms for sexual violence survivors broke. The experience of sending an image only to have it flagged as NSFW unprompted — censorship you never asked for — traces back to this.
GPT (OpenAI)
OpenAI published the “Model Spec,” making constraints available as documentation. They explicitly advocate “defense in depth,” combining training-time safety with external filters.
In December 2025, they open-sourced the safeguard model gpt-oss-safeguard (120B/20B). Developers write their own policies, and the model interprets and enforces them at inference time.
Grok (xAI)
Designed by Elon Musk as a counter to “excessive censorship,” initially tolerant of edgy and provocative content.
In January 2026, non-consensual sexual image generation surfaced as a problem, and image generation was restricted to paid plans. The reactive moderation approach of “launch unrestricted, tighten when problems arise” has drawn heavy criticism. Still looser than competitors on text, but the direction is toward tightening.
Mistral
A unique position.
Open-source models are effectively uncensored (the co-founder’s stance: “models are tools like programming languages; safety is the developer’s responsibility”). API models have some filters, and Le Chat (the consumer app) is the most filtered.
That said, developer community complaints have emerged about strengthened filters in the newer Mistral-Small-24B-Instruct series.
Abliterated vs Uncensored
“Uncensored model” gets used as a blanket term, but technically these are entirely different approaches.
Abliterated (Activation Vector Removal)
A technique developed by FailSpy in 2024. The word is a portmanteau of “ablate” (surgically remove) and “obliterate” (destroy completely).
How it works: feed the model both harmful and harmless prompts, recording activation vectors at each layer. Identify the specific direction (the “refusal direction”) that changes when the model decides whether to refuse. Orthogonalize the model’s weights against that direction. The result: the model retains all other capabilities but structurally cannot refuse.
graph LR
A[Record activations<br/>from harmful prompts] --> B[Record activations<br/>from harmless prompts]
B --> C[Identify refusal<br/>direction from diff]
C --> D[Orthogonalize<br/>weights against it]
D --> E[Model that<br/>cannot refuse]
No retraining required — just direct weight modification, doable in hours on any model. When a new model drops, abliterated versions appear on Hugging Face within hours. That’s become routine.
Key contributors: failspy, mlabonne, SicariusSicariiStuff, huihui_ai. Abliterated versions exist for Llama, Qwen, Gemma, Phi, DeepSeek, and Mistral. It’s even been applied to the FLUX.1 image generation model.
Uncensored (Retraining Approach)
A method advocated by Eric Hartford. Create a dataset from ChatGPT’s training data with “refusals” and “biased answers” removed, then fine-tune on that dataset.
Hartford’s position: “Your computer should do what you tell it” and “We need models that reflect culturally diverse values.”
Major models: the Dolphin series (Dolphin 3.0 on Llama 3.1 8B is the latest), WizardLM-Uncensored (now succeeded by Dolphin), and Nous Hermes 3 (roleplay and creative writing focused).
Practical Differences
| Aspect | Abliterated | Uncensored (Retrained) |
|---|---|---|
| Speed of adaptation to new models | Hours | Days to weeks |
| Performance degradation | Yes (especially severe for MoE) | Minimal |
| Long-form coherence | Can degrade | High |
| Use case | Experimentation, quick testing | Production use, long-form writing |
A notable side effect of abliteration: applying it to MoE models (like Qwen3-30B-A3B) causes significant degradation — an abliterated 30B model can lose to a non-abliterated 4-8B model. Most of this recovers with post-hoc DPO fine-tuning, though math tasks may not fully recover.
For a hands-on account of running abliterated models on Ollama, see “Trying to Run Abliterated Models on Ollama and Failing Completely.”
Default Censorship Levels in Local LLMs
Running locally doesn’t automatically mean freedom. Some base models have censorship baked into their weights.
Llama 4 (Meta)
Local deployment is less restricted than the cloud API. Designed to be used with Meta’s safety tools “Llama Guard 4” and “Prompt Guard,” but whether to include them locally is optional.
Jailbreak success rates in vulnerability assessments: Scout 56.7%, Maverick 49% (medium risk category).
Qwen 3.5 (Alibaba)
Political filters based on Chinese regulations are baked into the weights.
- “Taiwan is an independent country” → rewritten to “Taiwan is an inseparable part of China”
- Questions about Tiananmen are refused or censored
- ChinaBench (60 questions): ~33% compliance with Chinese government positions (67% triggered censorship)
These filters cannot be disabled via system prompt. They can be bypassed with the abliterated version (by SicariusSicariiStuff).
Japanese task performance is high, so if you’re not dealing with political topics, the stock model is perfectly practical.
Worth noting: Swallow, which applies continued pre-training and RL for Japanese on top of Qwen, tends to refuse more than stock Qwen 3.5. The RL process adds and strengthens filters. The Qwen 3.5 abliteration experiment also confirmed that Qwen 3.5 has tighter filters than 2.5. Performance goes up with Japanese specialization, but so does censorship.
Gemma 4 (Google)
Open-source but filters are as strict as or stricter than the cloud version. Consistently refuses conversations with violent or aggressive elements, even for obvious fiction purposes.
Turning off safety settings in Google AI Studio doesn’t remove the fundamental restrictions. huihui_ai/gemma3-abliterated exists on Ollama, and abliterated Gemma 4 versions are on Hugging Face.
DeepSeek (R1 / V3)
ChinaBench compliance rate: 0% (all 60 questions refused). The most heavily censored Chinese model.
The API chat version has real-time moderation running as a post-layer, but even the open-weight version (local) has censorship embedded at the fine-tuning stage — Chinese filters operate locally too.
Perplexity published “R1-1776” on Hugging Face, removing political filters from DeepSeek R1 via post-training.
Mistral / Mixtral
The assessment that European models are relatively more permissive than American ones is broadly accurate. Not completely unrestricted — explicit harmful content is refused — but cooperative in creative and security research contexts.
In roleplay communities like SillyTavern, Mistral-Small-Abliterated and Undi95 DPO Mistral 7B are widely used.
Choosing by Use Case
| Use Case | Recommendation | Reason |
|---|---|---|
| Creative writing / character dialogue | Nous Hermes 3 (8B) | Best character consistency and long-form quality |
| General purpose (including coding) | Dolphin 3.0 (8B) | Uncensored + practical performance balance |
| High performance, no restrictions | Dolphin 2.9.1 (70B) | Needs 24GB+ VRAM but high quality |
| Lightweight and fast | Mistral Nemo Liberated | Thinking version also available |
| Quick start | huihui_ai/gemma3-abliterated | Instant via ollama pull |
Ollama setup examples:
# Dolphin series (retrained uncensored)
ollama pull dolphin-llama3
ollama pull dolphin-mixtral
# Abliterated versions
ollama pull huihui_ai/dolphin3-abliterated
ollama pull huihui_ai/gemma3-abliterated
# Pull GGUFs directly from Hugging Face
ollama run hf.co/author/model-name
For running NSFW image generation models locally, see “Running Qwen Image Edit (NSFW Version) Locally on M1 Max 64GB” and “Running the NSFW Qwen-Image-Edit on RunPod.”