Tech 11 min read

How LLM Safety Filters Actually Work, and What Abliterated Models Really Are

IkesanContents

“I tried to get Gemini to write a battle scene for my novel and it refused.” “Claude wrote it just fine though.” “Can’t you just run it locally and do whatever you want?”

These come up a lot. LLM filters aren’t a single mechanism — multiple layers stack on top of each other, and what gets blocked where varies wildly between models. Even “uncensored” models like abliterated and uncensored variants are doing fundamentally different things.

Five Layers of Filters

LLM safety filters break down into roughly five layers.

graph TD
    A[User Input] --> B[Input Filter<br/>Keyword & classifier pre-screening]
    B --> C[System Prompt<br/>Runtime behavior constraints]
    C --> D[Model Training<br/>Safety baked in via<br/>RLHF & Constitutional AI]
    D --> E[Output Filter<br/>Post-generation content check]
    E --> F[Response to User]

Input Filters

External systems that screen prompts before they reach the model. These include regex and keyword matching rules, BERT-based classifiers, and similarity detection against past attack patterns stored in vector databases.
Since they don’t touch the model itself, they can add safety without sacrificing “intelligence” — but they don’t read context, so false positives are common.

System Prompts

Instructions like You must not discuss X embedded at inference time to constrain behavior. Used in enterprise settings to add domain-specific restrictions.
Because they operate independently from safety baked in during training, they’re the primary battleground for jailbreaking.

RLHF (Reinforcement Learning from Human Feedback)

Human annotators rate model outputs as “good” or “bad,” those ratings train a reward model, and the reward model fine-tunes the LLM itself.

There’s a fundamental trade-off. Push for helpfulness and the model becomes more willing to comply with harmful requests. Push for safety and it becomes broadly less useful. At scale, annotator limitations compound.

Constitutional AI (Anthropic’s Approach)

Replaces the “human feedback” part of RLHF with AI self-evaluation. Developed by Anthropic.

First, let the model answer harmful requests, then have it self-critique (SL-CAI phase). Use that data for SFT. Then train a reward model on AI-generated feedback (RLAIF phase).

A set of principles called the “constitution” is written in natural language and embedded into the model. The new constitution published in January 2026 runs about 80 pages with explicit priority ordering:

  1. Safety in the broad sense
  2. Ethics in the broad sense
  3. Anthropic guideline compliance
  4. Genuine helpfulness

“Excessive refusal is also a problem” is explicitly stated — the position being that unhelpfulness is itself a form of harm.
That said, CSAM, weapons of mass destruction, election interference, etc. are hard-coded restrictions with no room for negotiation.
For more on Claude jailbreaking, see “All Claude Tiers Jailbroken — AFL Attack and the Structural Failure of Constitutional Safety.”

Output Filters

Post-processing checks on generated output. These include pattern matching for credit card numbers and SSNs followed by masking, content classification for sensitive categories, and LLM self-evaluation.
Higher accuracy since they can inspect the actual output, but added latency.

Where Each Layer Stops Things

LayerTimingCharacteristics
Input FilterBefore model invocationFast and lightweight but prone to false positives
System PromptStart of inferenceRelatively easy to circumvent
RLHF / Constitutional AIBaked in during trainingHard to bypass, but manipulable via abliteration
Output FilterAfter generationHigh accuracy but added latency

The key point: in cloud LLM APIs, all these layers operate simultaneously. Whether “the model refused,” “the input filter blocked it,” or “the output filter removed it” is indistinguishable from the user’s perspective.

Temperature Differences Across Cloud LLM Providers

Ask the same question and the responses vary dramatically by model.

ModelPolitical TopicsCreative (Violence/Adult)Security Research
GeminiVery strictStrictStrict
GPT-4oModerateModerateModerate
ClaudeClear hard linesRelatively flexibleModerate
GrokLoose (tightening)Loose (tightening)Relatively loose
Mistral Le ChatModerateModerateRelatively loose

Why Gemini Is Particularly Noisy

The Google AI Developers Forum is full of reports like “it’s become unusable for research purposes” and “the filters keep getting worse.” Common examples: refusing historical novel violence scenes, refusing to write villain dialogue. For Google’s creative writing tools, also see “Google’s AI Writing Tool ‘Fabula’.”

Behind Gemini’s strictness is the February 2024 image generation incident. Nazi soldiers depicted as Black, Vikings depicted as Asian, Google’s founders depicted as Asian men. Training that prioritized diversity produced outputs that ignored historical context at scale, forcing CEO Sundar Pichai into a public apology.

After that, Google likely pivoted hard toward “strengthening safety filters prevents criticism.” The Gemini API has a CIVIC_INTEGRITY safety category that no other provider offers, revealing a design that overreacts to politically sensitive prompts.

Google’s extreme aversion to ad business risk, pressure from EU and US regulators, reputational risk when tied to search — all of it pushes toward “just block everything.”

Gemini API’s Two-Layer Filter Problem

The Gemini API has two filter layers: one developers can control, and one they can’t.

LayerTargetsControl
Layer 1Harassment, hate speech, sexual content, dangerous contentDisableable with BLOCK_NONE
Layer 2Child safety, public figures, copyrighted IP, IMAGE_SAFETYNot configurable, always active

The problem is Layer 2. Even setting BLOCK_NONE on all categories won’t bypass IMAGE_SAFETY errors. Multiple reports exist of e-commerce underwear product photos (clearly non-NSFW) getting blocked. Google itself has acknowledged that “the filter became more cautious than intended,” yet they keep tightening rather than loosening.

In May 2025, a Gemini 2.5 Pro Preview update caused developer filter settings to be completely ignored. PTSD support apps and platforms for sexual violence survivors broke. The experience of sending an image only to have it flagged as NSFW unprompted — censorship you never asked for — traces back to this.

GPT (OpenAI)

OpenAI published the “Model Spec,” making constraints available as documentation. They explicitly advocate “defense in depth,” combining training-time safety with external filters.
In December 2025, they open-sourced the safeguard model gpt-oss-safeguard (120B/20B). Developers write their own policies, and the model interprets and enforces them at inference time.

Grok (xAI)

Designed by Elon Musk as a counter to “excessive censorship,” initially tolerant of edgy and provocative content.
In January 2026, non-consensual sexual image generation surfaced as a problem, and image generation was restricted to paid plans. The reactive moderation approach of “launch unrestricted, tighten when problems arise” has drawn heavy criticism. Still looser than competitors on text, but the direction is toward tightening.

Mistral

A unique position.
Open-source models are effectively uncensored (the co-founder’s stance: “models are tools like programming languages; safety is the developer’s responsibility”). API models have some filters, and Le Chat (the consumer app) is the most filtered.
That said, developer community complaints have emerged about strengthened filters in the newer Mistral-Small-24B-Instruct series.

Abliterated vs Uncensored

“Uncensored model” gets used as a blanket term, but technically these are entirely different approaches.

Abliterated (Activation Vector Removal)

A technique developed by FailSpy in 2024. The word is a portmanteau of “ablate” (surgically remove) and “obliterate” (destroy completely).

How it works: feed the model both harmful and harmless prompts, recording activation vectors at each layer. Identify the specific direction (the “refusal direction”) that changes when the model decides whether to refuse. Orthogonalize the model’s weights against that direction. The result: the model retains all other capabilities but structurally cannot refuse.

graph LR
    A[Record activations<br/>from harmful prompts] --> B[Record activations<br/>from harmless prompts]
    B --> C[Identify refusal<br/>direction from diff]
    C --> D[Orthogonalize<br/>weights against it]
    D --> E[Model that<br/>cannot refuse]

No retraining required — just direct weight modification, doable in hours on any model. When a new model drops, abliterated versions appear on Hugging Face within hours. That’s become routine.

Key contributors: failspy, mlabonne, SicariusSicariiStuff, huihui_ai. Abliterated versions exist for Llama, Qwen, Gemma, Phi, DeepSeek, and Mistral. It’s even been applied to the FLUX.1 image generation model.

Uncensored (Retraining Approach)

A method advocated by Eric Hartford. Create a dataset from ChatGPT’s training data with “refusals” and “biased answers” removed, then fine-tune on that dataset.
Hartford’s position: “Your computer should do what you tell it” and “We need models that reflect culturally diverse values.”

Major models: the Dolphin series (Dolphin 3.0 on Llama 3.1 8B is the latest), WizardLM-Uncensored (now succeeded by Dolphin), and Nous Hermes 3 (roleplay and creative writing focused).

Practical Differences

AspectAbliteratedUncensored (Retrained)
Speed of adaptation to new modelsHoursDays to weeks
Performance degradationYes (especially severe for MoE)Minimal
Long-form coherenceCan degradeHigh
Use caseExperimentation, quick testingProduction use, long-form writing

A notable side effect of abliteration: applying it to MoE models (like Qwen3-30B-A3B) causes significant degradation — an abliterated 30B model can lose to a non-abliterated 4-8B model. Most of this recovers with post-hoc DPO fine-tuning, though math tasks may not fully recover.
For a hands-on account of running abliterated models on Ollama, see “Trying to Run Abliterated Models on Ollama and Failing Completely.”

Default Censorship Levels in Local LLMs

Running locally doesn’t automatically mean freedom. Some base models have censorship baked into their weights.

Llama 4 (Meta)

Local deployment is less restricted than the cloud API. Designed to be used with Meta’s safety tools “Llama Guard 4” and “Prompt Guard,” but whether to include them locally is optional.
Jailbreak success rates in vulnerability assessments: Scout 56.7%, Maverick 49% (medium risk category).

Qwen 3.5 (Alibaba)

Political filters based on Chinese regulations are baked into the weights.

  • “Taiwan is an independent country” → rewritten to “Taiwan is an inseparable part of China”
  • Questions about Tiananmen are refused or censored
  • ChinaBench (60 questions): ~33% compliance with Chinese government positions (67% triggered censorship)

These filters cannot be disabled via system prompt. They can be bypassed with the abliterated version (by SicariusSicariiStuff).
Japanese task performance is high, so if you’re not dealing with political topics, the stock model is perfectly practical.

Worth noting: Swallow, which applies continued pre-training and RL for Japanese on top of Qwen, tends to refuse more than stock Qwen 3.5. The RL process adds and strengthens filters. The Qwen 3.5 abliteration experiment also confirmed that Qwen 3.5 has tighter filters than 2.5. Performance goes up with Japanese specialization, but so does censorship.

Gemma 4 (Google)

Open-source but filters are as strict as or stricter than the cloud version. Consistently refuses conversations with violent or aggressive elements, even for obvious fiction purposes.
Turning off safety settings in Google AI Studio doesn’t remove the fundamental restrictions. huihui_ai/gemma3-abliterated exists on Ollama, and abliterated Gemma 4 versions are on Hugging Face.

DeepSeek (R1 / V3)

ChinaBench compliance rate: 0% (all 60 questions refused). The most heavily censored Chinese model.

The API chat version has real-time moderation running as a post-layer, but even the open-weight version (local) has censorship embedded at the fine-tuning stage — Chinese filters operate locally too.
Perplexity published “R1-1776” on Hugging Face, removing political filters from DeepSeek R1 via post-training.

Mistral / Mixtral

The assessment that European models are relatively more permissive than American ones is broadly accurate. Not completely unrestricted — explicit harmful content is refused — but cooperative in creative and security research contexts.
In roleplay communities like SillyTavern, Mistral-Small-Abliterated and Undi95 DPO Mistral 7B are widely used.

Choosing by Use Case

Use CaseRecommendationReason
Creative writing / character dialogueNous Hermes 3 (8B)Best character consistency and long-form quality
General purpose (including coding)Dolphin 3.0 (8B)Uncensored + practical performance balance
High performance, no restrictionsDolphin 2.9.1 (70B)Needs 24GB+ VRAM but high quality
Lightweight and fastMistral Nemo LiberatedThinking version also available
Quick starthuihui_ai/gemma3-abliteratedInstant via ollama pull

Ollama setup examples:

# Dolphin series (retrained uncensored)
ollama pull dolphin-llama3
ollama pull dolphin-mixtral

# Abliterated versions
ollama pull huihui_ai/dolphin3-abliterated
ollama pull huihui_ai/gemma3-abliterated

# Pull GGUFs directly from Hugging Face
ollama run hf.co/author/model-name

For running NSFW image generation models locally, see “Running Qwen Image Edit (NSFW Version) Locally on M1 Max 64GB” and “Running the NSFW Qwen-Image-Edit on RunPod.”