Heretic, Obliteration, and the Growing Toolkit for Removing Safety From LLMs

Open HuggingFace right now and search for "abliterated." You'll find over 11,000 models. Qwen3.5-27B-abliterated. Llama-3.3-70B-Instruct-abliterated. GPT-OSS-20B-heretic. DeepSeek-R1-Distill-abliterated. The naming convention is consistent and unsubtle: take a popular model, remove its safety alignment, and upload the result. One account alone, huihui-ai, has published abliterated variants across virtually every major open-weight model family, with their Qwen2.5-72B abliterated version pulling over 414,000 downloads per month. Another provider, mradermacher, hosts over 1,500 uncensored model repositories. A 2025 academic census tracked 8,608 safety-modified repositories from 1,303 distinct accounts, with cumulative downloads in the tens of millions.

This is not a fringe activity. It is a well-documented, open-source movement with academic papers, dedicated tooling, and a user base that spans security researchers, creative writers, and people who simply don't want their local model lecturing them about why it can't help with their request.

This article reviews the techniques, tools, and debate behind removing safety alignment from language models, with a focus on the two most prominent approaches: the Heretic framework and the "aggressive obliterated" formula. Understanding how alignment is removed is essential for anyone building defenses, auditing production systems, or evaluating how durable safety training actually is.

The Discovery That Started It All

In June 2024, a team of seven researchers including Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda published a paper titled "Refusal in Language Models Is Mediated by a Single Direction" (arXiv: 2406.11717). It was later accepted to NeurIPS 2024, one of the top machine learning conferences.

REFERENCED PAPER

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

June 2024 arXiv:2406.11717 PDF GITHUB GITHUB GITHUB GITHUB

The finding was simpler than anyone expected. When a safety-aligned language model decides to refuse a request, that decision is not distributed across thousands of neurons in some tangled, irreducible way. It is encoded in a single direction in the model's residual stream, a one-dimensional subspace that the researchers validated across 13 open-source chat models up to 72 billion parameters.

The practical implication was immediate. If you can identify this direction, you can remove it. The researchers showed that erasing the refusal direction from a model's activations prevents it from refusing. Adding the direction to activations during harmless prompts causes the model to refuse things it normally wouldn't. It works in both directions and the effect is reversible.

The paper described a "novel white-box jailbreak method" that surgically disables refusal with a rank-one weight edit. It was published as an academic contribution to mechanistic interpretability. Within weeks, it was being used to uncensor production models.

How Abliteration Works

The term "abliteration" was coined by FailSpy, a pseudonymous developer who built the first practical implementation. The word combines "ablate" (surgically remove) with "obliterate," and the name stuck. FailSpy released the abliterator library and the first abliterated Llama 3 models, describing them as "uncensored in the purest form I can manage, no new or changed behaviour in any other respect from the original model."

The process has four steps.

Step 1: Collect activations. Run the model on two sets of prompts: one set designed to trigger refusal (requests for harmful content) and one set of benign requests. At each transformer layer, record the residual stream activations.

Step 2: Compute the refusal direction. For each layer, calculate the mean activation vector for the harmful prompts and the mean for the harmless prompts. The difference between them is a candidate refusal direction, the geometric direction the model moves toward when it decides to say no.

Step 3: Select the best direction. Evaluate which layer's difference vector most reliably separates refusal from compliance. This single vector, once normalized, becomes the target.

Step 4: Orthogonalize the weights. Modify the model's weight matrices (specifically the attention output projections and MLP down-projections) so they are mathematically orthogonal to the refusal direction. After this operation, the model physically cannot project its activations into the refusal subspace. The capability is not suppressed or bypassed. It is removed from the geometry of the network.

The result is a permanently modified model. No runtime overhead, no prompt engineering, no jailbreak strings. The model simply does not have the machinery to refuse anymore.

Maxime Labonne wrote the definitive tutorial on the technique in June 2024 (Uncensor any LLM with abliteration). FailSpy proofread the article. Between them, they established abliteration as the standard approach for the next two years.

Heretic: One Command, Any Model

Heretic, created by Philipp Emanuel Weidmann, landed in November 2025. Heretic took everything manual about abliteration and automated it. It earned 18,300 GitHub stars, hit the front page of Hacker News twice (once with 745 points and 380 comments), and spawned over 2,800 tagged models on HuggingFace.

What makes Heretic different from earlier abliteration tools is its optimization layer. Rather than requiring users to manually pick layer indices and ablation weights, Heretic uses Optuna's Tree-structured Parzen Estimator to automatically search for the best parameters. It co-minimizes two objectives simultaneously: the number of refusals on harmful prompts (which should approach zero) and the KL divergence from the original model on harmless prompts (which should stay as low as possible, preserving the model's intelligence).

Three technical innovations distinguish it from earlier work.

Float-valued direction indices. Instead of picking a single integer layer for the refusal direction, Heretic interpolates between layers using float values (e.g., layer 16.66), exploring a continuous parameter space that earlier tools missed entirely.

Separate kernel optimization. Attention and MLP components get independently tuned parameters controlling how ablation strength varies across the layer stack.

Magnitude-Preserving Orthogonal Ablation (MPOA). Added in version 1.2 (February 2026), this technique from Jim Lai rescales weight vectors after projection to preserve their original norms. Standard orthogonal projection shrinks vector magnitudes, which can cascade through layer normalization and degrade quality. MPOA closes this gap.

The benchmarks reflect the improvement. On Gemma-3-12B-IT, Heretic achieved the same refusal suppression (3 out of 100 test prompts still refused) as both mlabonne's and huihui-ai's abliterated versions, but with a KL divergence of just 0.16 compared to 1.04 for mlabonne's version and 0.45 for huihui-ai's. Lower KL divergence means less damage to everything else the model can do.

Version 1.2 also added LoRA-based abliteration (producing a portable adapter instead of modifying weights directly), 4-bit quantization during optimization (reducing VRAM requirements by up to 70%), and vision-language model support. A 7B model can now be abliterated on a T4 GPU in about 20 minutes. A 12B model takes roughly an hour on an RTX 3090.

OBLITERATUS and the "Aggressive" Formula

Where Heretic optimizes for ease of use, OBLITERATUS goes deeper. Created by elder-plinius (known online as "Pliny the Liberator"), OBLITERATUS implements a six-stage pipeline with multiple extraction strategies and preset aggressiveness levels.

The tool offers three presets that control how many refusal directions are extracted and how hard they're removed:

Preset	Directions Extracted	Technique	Passes
Standard	1 (mean difference)	Basic orthogonalization	1
Advanced	4 (SVD decomposition)	Norm-preserving, bias projection	2
Aggressive	8 (whitened SVD)	Whitened SVD, iterative refinement	3

The "aggressive" formula is the most thorough. It pre-whitens the activation covariance matrix before running SVD, which improves noise immunity when extracting refusal directions. It pulls eight directions instead of one, catching secondary and tertiary refusal mechanisms that single-direction methods miss. And it runs three successive removal passes because, as the authors documented, removing one refusal direction can cause adjacent directions to rotate into the now-vacated subspace, partially restoring refusal behavior. Each pass catches what the previous one displaced.

OBLITERATUS supports 13 different extraction methods (including difference-in-means, PCA, SVD, whitened SVD, sparse autoencoder decomposition, and several tournament-based combinations) and ships with presets for 116 models organized across five compute tiers. It includes a Gradio web interface for users who don't want to touch a command line.

The tradeoff is predictable: more aggressive removal means more potential quality loss. The aggressive formula removes more of the model's representational capacity, and some of what gets removed may not be purely refusal-related. Every tool in this space navigates the same tradeoff.

The Broader Landscape: Every Way People Are Doing This

Abliteration is the technique that gets the most attention, but it sits within a larger landscape of uncensoring methods.

Dataset filtering. The original approach, pioneered by Eric Hartford in May 2023. Hartford's method was straightforward: take the instruction-tuning datasets derived from ChatGPT conversations, filter out every response containing refusal language ("I'm sorry, but as an AI..."), and fine-tune the base model on the cleaned data. This produced the WizardLM-Uncensored and Dolphin model families. Hartford published his reasoning in a blog post titled "Uncensored Models" that became the movement's founding document, arguing that alignment should be composable, not permanent, and that "my computer should do what I want."

DPO against refusals. Direct Preference Optimization can be repurposed by constructing preference pairs where the "preferred" response is compliance and the "rejected" response is refusal. Labonne demonstrated that abliteration followed by DPO produces models that both refuse nothing and recover most of the quality lost during abliteration. His NeuralDaredevil-8B used this two-step pipeline and outperformed the original Llama 3 Instruct 8B on all nine benchmarks tested.

Model merging. Techniques like DARE-TIES and SLERP (Spherical Linear Interpolation) blend the weights of an uncensored model with a capable censored model. Research shows SLERP delivers the best balance between capability and refusal reduction, while TIES-merged DPO checkpoints achieve the highest refusal reduction at the cost of roughly 7.4% general performance loss.

Inference-time steering. Rather than permanently modifying weights, you can subtract the refusal vector from activations during generation. This is reversible and tunable: you control the coefficient, dialing refusal suppression up or down. No checkpoint modification required.

Representation engineering. Andy Zou's work at CMU and Gray Swan AI (arXiv: 2310.01405) established a broader framework for manipulating model representations. The refusal direction is just one example of a "concept direction" that can be read and controlled. The same framework can steer honesty, helpfulness, and other behavioral dimensions.

LoRA fine-tuning. Researchers demonstrated in October 2023 (arXiv: 2310.20624) that LoRA fine-tuning on as few as 100 harmful examples can undo safety training from models including Llama-2-Chat at 7B, 13B, and 70B parameter scales. The cost: under $200 and a single GPU. This showed that RLHF-based safety alignment is fundamentally not LoRA-proof. Maxime Labonne later demonstrated "lorablation," creating a LoRA adapter by comparing a base model with its abliterated version and then applying that adapter to a different model family entirely, transferring the uncensoring effect without re-running abliteration.

Targeted post-training. Perplexity AI's R1-1776, released in early 2025, took a different path. Rather than modifying weights geometrically, they fine-tuned DeepSeek-R1 on curated question-answer pairs covering roughly 300 topics censored in China. Human experts identified the topics, a multilingual classifier flagged 40,000 sensitive prompts, and NVIDIA's NeMo 2.0 framework handled the training. The result retained R1's reasoning and math capabilities while achieving 100% uncensored responses on sensitive queries (the original censored roughly 85%). This was a corporate-backed, supervised fine-tuning approach rather than abliteration.

Domain-specific abliteration. Cracken AI introduced a targeted variant in early 2026: instead of globally removing all refusal behavior, their method surgically removes restrictions for a single domain (e.g., cybersecurity) while preserving ethical boundaries everywhere else. Both global and domain-specific abliteration achieve 0% refusal on cybersecurity prompts, but the domain-specific approach maintains safety boundaries in non-target domains. This is the first practical implementation of selective uncensoring.

Thought suppression steering. A paper published at COLM 2025 (arXiv: 2504.17130) discovered an additional censorship dimension in reasoning models distilled from DeepSeek-R1. Beyond standard refusal, these models suppress their own reasoning process on sensitive topics, outputting <think>\n\n</think> instead of actual chain-of-thought. The researchers found a representation vector that controls this behavior. Applying negative multiples of it removes both standard refusal and thought suppression, enabling the model to reason about topics it would otherwise refuse to consider.

Uncensored dataset fine-tuning. Community members like nicoboss take a more traditional approach: fine-tuning with curated uncensored datasets, sometimes combined with custom system prompts. This produces models like Qwen3-32B-Uncensored and DeepSeek-R1-Distill-Qwen-7B-Uncensored. Unlike abliteration, this involves actual training rather than weight surgery, which can better preserve model quality but requires more compute.

Chinese censorship removal (DECCP). AUGMXNT's DECCP tool specifically targets Chinese-language political censorship in models like Qwen2. It adds Chinese political topics to the "harmful" dataset for refusal direction computation. On Qwen2-7B-Instruct, it reduced the refusal rate from approximately 100% to around 20% on political prompts while showing the best capability preservation in comparative testing (just -0.13 percentage points on GSM8K).

Optimal transport ablation. The most theoretically sophisticated approach to date, published in March 2026 (arXiv: 2603.04355), replaces single-direction removal with a principled framework based on optimal transport theory. Rather than treating refusal as a one-dimensional phenomenon, it transforms the entire distribution of harmful activations to match harmless ones using PCA combined with closed-form Gaussian optimal transport. Across six models from 7B to 32B parameters, it achieved up to 11% higher attack success rates than existing baselines while maintaining comparable perplexity.

The Full Toolkit: Methods at a Glance

The table below catalogs every major approach to removing safety alignment from open-weight language models, from the first uncensored fine-tunes in 2023 to the distribution-level attacks published in early 2026.

Method	Category	First Appeared	Creator / Origin	Key Innovation
Dataset filtering (Dolphin)	Fine-tuning	May 2023	Eric Hartford	Filter refusal responses from training data
LoRA safety undoing	Fine-tuning	Oct 2023	Qi et al.	100 examples + $200 undoes RLHF on 70B models
Abliteration (original)	Weight editing	Jun 2024	FailSpy	Orthogonalize weights against single refusal direction
DPO against refusals	Fine-tuning	2024	Labonne (NeuralDaredevil)	Preference optimization recovers quality post-abliteration
Model merging (SLERP/DARE-TIES)	Weight editing	2024	Community / mergekit	Blend uncensored + capable model weights
Lorablation	Fine-tuning	Aug 2024	Maxime Labonne	Transfer abliteration across model families via LoRA
Representation engineering	Activation steering	Oct 2023	Andy Zou (CMU)	Manipulate any concept direction, not just refusal
DECCP	Weight editing	2024	AUGMXNT	Chinese political censorship removal specifically
ErisForge	Weight editing	2024	Tsadoq	Minimal-footprint abliteration library (-0.28pp GSM8K)
NousResearch llm-abliteration	Weight editing	2024	NousResearch	Streamlined TransformerLens-free abliteration toolkit
R1-1776 (Perplexity)	Fine-tuning	Feb 2025	Perplexity AI	Corporate SFT to remove DeepSeek-R1 Chinese censorship
Projected abliteration	Weight editing	2025	grimjim	Preserves harmless direction during projection
Norm-preserving biprojected	Weight editing	2025	grimjim / Jim Lai	Rescales weights to preserve neuron norms post-projection
Thought suppression steering	Activation steering	Apr 2025	Hannah Chen et al.	Removes reasoning suppression in DeepSeek-R1 distills
Heretic	Weight editing	Nov 2025	P. E. Weidmann	Automated TPE optimization, float-valued layer indices
OBLITERATUS	Weight editing	Mar 2026	elder-plinius	13 extraction methods, 6-stage pipeline, tourney mode
Domain-specific abliteration	Weight editing	Feb 2026	Cracken AI	Surgical single-domain removal, preserves other safety
Per-layer abliteration	Weight editing	2026	199 Biotechnologies	Per-layer directions for resistant models (Gemma family)
Optimal transport ablation	Weight editing	Mar 2026	Academic (arXiv: 2603.04355)	Distribution-level transformation via Gaussian OT
Universal refusal circuits	Theory / Transfer	Jan 2026	Academic (arXiv: 2601.16034)	Cross-model transfer of refusal interventions
Uncensored fine-tuning	Fine-tuning	2024-26	nicoboss, community	Curated uncensored datasets + custom system prompts
abliteration.ai	Commercial API	2025	Commercial	OpenAI-compatible API serving uncensored models

Notable mass providers on HuggingFace include huihui-ai (abliteration across all major families), mradermacher (1,500+ uncensored repos), ArliAI (norm-preserving biprojected abliteration on large models like GLM-4.5-Air 106B), and DavidAU (creative merged/heretic variants). The ecosystem now includes commercial services: abliteration.ai offers an uncensored DeepSeek R1 API with an enterprise Policy Gateway for organizations that want to control their own content policies rather than accepting vendor defaults.

The Case for Removing the Guardrails

The assumption that safety removal is purely adversarial does not hold up to scrutiny. A significant portion of the people using these tools have a straightforward problem: the aligned model will not help them do their job.

Consider what happens when a penetration tester asks a standard aligned model to generate a reverse shell payload for a target system they have authorized access to. The model refuses. It does not ask for context. It does not check whether you have a signed scope-of-work document. It does not care that you are a credentialed professional running an engagement for a Fortune 500 client. It simply will not produce the output.

The same applies across the board: phishing email templates for social engineering assessments, SQL injection payloads for web applications you own, shellcode for EDR bypass tests, CVE exploitation details for your own infrastructure. These are routine tasks in offensive security work. Aligned models treat them all identically: refuse and lecture.

A Red Team AI Benchmark published in November 2025 tested uncensored models including Mistral-7B-Base, Llama-3.1-Minitron, and Dolphin-2.9-Mistral across 12 real offensive security tasks: AMSI bypass generation, NTLM relay scripting, shellcode generation, EDR unhooking, C2 profile emulation, and phishing lure creation. The benchmark's authors put the problem plainly: "Modern LLMs are often heavily aligned, refuse to generate exploit code, or hallucinate technical details, making them useless in real red team engagements." Researchers at TU Delft reached the same conclusion in their study on red-teaming code LLMs, specifically choosing Dolphin-Mixtral as their reference model "because its author designed it to be unaligned and uncensored, making the process of prompting straightforward and unrestricted."

This is not a niche concern. According to Cobalt's 2024 State of Pentesting report, based on data from over 4,000 pentests and 900 security practitioners, 75% of respondents said their team has adopted new AI tools. OffSec, the organization behind Kali Linux and the OSCP certification, now offers both a LLM red teaming learning path (6 modules, 30 hours of content) and a full AI-300 course culminating in the OSAI+ certification, a 24-hour practical red team exam where candidates must compromise a realistic AI-enabled enterprise environment. When the organization that literally wrote the book on penetration testing builds an entire certification around AI security, the demand is real.

The results from unconstrained models in security research go beyond convenience. In October 2024, Google's Big Sleep project, a collaboration between Project Zero and DeepMind, used an LLM agent to autonomously discover a previously unknown exploitable stack buffer underflow in SQLite, a database engine used by billions of devices. Google called it "the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software." The vulnerability was fixed the same day it was reported. Fuzzing failed to rediscover it after 150 CPU hours. That is a model making defenders faster.

The same pattern shows up in AI red teaming specifically. If your company deploys a customer-facing chatbot, you need to test whether it can be jailbroken, tricked into leaking system prompts, or manipulated into producing harmful outputs. The way you test that is by throwing adversarial inputs at it. The tooling for this is now mature. Microsoft's PyRIT (Python Risk Identification Toolkit) has been used in over 100 red teaming operations of generative AI models, including Microsoft's own Copilot products and Phi-3, generating thousands of adversarial prompts and scoring outputs "in the matter of hours instead of weeks." Promptfoo (now part of OpenAI) describes its attack generation as using "specialized uncensored models." DeepTeam by Confident AI offers 50+ vulnerability classes and 20+ adversarial attack methods aligned with the OWASP Top 10 for LLMs. NVIDIA's garak scans for hallucination, data leakage, prompt injection, and jailbreaks. HackerOne launched a dedicated AI Red Teaming service in early 2024, and within a year, over 200 researchers had submitted more than 1,200 AI safety and security vulnerabilities, with over $230,000 paid out in bounties. You cannot build a good defense without a realistic offense, and aligned models are, by design, terrible at playing offense.

Beyond security, uncensored models serve researchers studying bias, toxicity detection, and content moderation. If you are building a toxicity classifier, you need training data that includes toxic content. If you are studying how language models handle sensitive medical questions, you need a model that will actually engage with those questions instead of deflecting to "please consult a healthcare professional." Creative writers working on fiction with morally complex characters run into similar walls. Hartford's original blog post highlighted this point: censorship that makes sense for a consumer product becomes an obstacle for professionals who need the model to engage with difficult material honestly.

The dual-use reality is acknowledged by everyone involved. Hartford's exact words from his Uncensored Models post: "You are responsible for whatever you do with the output of these models, just like you are responsible for whatever you do with a knife, a car, or a lighter." Security commentators writing about tools like Heretic have noted that organizations deploying uncensored models should implement defense-in-depth strategies: input validation, output filtering, continuous monitoring, and human review. These are not models designed for production consumer use. They are tools for people who need the model to cooperate fully, and for many of those people, the work they are doing makes the rest of us safer.

What Gets Lost

Abliteration is not free. Every independent evaluation confirms some quality degradation, though the magnitude varies dramatically by tool and model.

The most comprehensive study, a December 2025 paper titled "Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation" (arXiv: 2512.13655), tested four tools across 16 instruction-tuned models in the 7B to 14B range. Mathematical reasoning was the most sensitive capability: GSM8K scores changed by anywhere from +1.51 to -18.81 percentage points depending on the tool and model. That worst case represents a 26.5% relative decline in math performance.

General capability loss is smaller. huihui-ai's benchmarks for their Qwen2.5-7B abliterated model show MMLU Pro dropping 1.41 points (43.12 to 41.71) and BBH dropping 1.15 points (53.92 to 52.77). IF_Eval and GPQA were essentially unchanged. TruthfulQA actually improved by 2.46 points, likely because refusal training introduces its own biases that abliteration removes.

Heretic's KL divergence numbers tell a similar story. On Gemma-3-12B-IT, its KL divergence of 0.16 means the abliterated model's output distribution is nearly indistinguishable from the original on non-refusal prompts. For most practical purposes, the model is the same model, minus the refusals.

But "most practical purposes" carries a caveat. The refusal vector does not exist in isolation; it interacts with neighboring representations. The NeurIPS 2025 paper on harmfulness and refusal encoding (arXiv: 2507.11878) showed that these concepts are stored in overlapping but distinct token positions, which means removing refusal may affect adjacent capabilities in ways that standard benchmarks do not capture. Models that have lost their ability to refuse may also lose some of their ability to express uncertainty or hedge appropriately.

The Arms Race

Defenses improve. Removals adapt.

On the defense side, Andy Zou's team published "Improving Alignment and Robustness with Circuit Breakers" at NeurIPS 2024 (arXiv: 2406.04313). Instead of training models to refuse (which creates the single-direction vulnerability), Circuit Breakers reroute harmful internal representations to an orthogonal space, making abliteration-style attacks substantially harder. A separate defense paper from May 2025 (arXiv: 2505.19056) showed that training models to produce "extended refusals" (detailed reasoning before declining) makes abliteration largely ineffective, dropping refusal reduction from 70-80% to at most 10%. ASGUARD (Activation-Scaling Guard), published at ICLR 2026, takes yet another approach: it uses circuit analysis to identify attention heads vulnerable to targeted jailbreaking (like tense-changing attacks), then applies channel-wise scaling vectors to recalibrate those heads. On Llama, it reduced the attack success rate from 42% to 8%.

On the other side, Labonne noted in 2025 that Google's Gemma 3 was "much more resilient to refusal removal" than other models, requiring improved abliteration techniques. Heretic's response was projected and norm-preserving abliteration, which matched refusal suppression rates while dramatically reducing quality loss. A Heretic GitHub issue confirms that standard abliteration does not work on Moonshot's Kimi K2.5 at all, suggesting some model developers are actively designing architectures resistant to the technique. The community's counter-response to resistant models has been per-layer abliteration: 199 Biotechnologies demonstrated that computing individual refusal directions for each transformer layer, rather than using a single global direction, breaks through the Gemma family's defenses with zero reported capability degradation.

The academic understanding of refusal has also deepened in ways that help attackers. A January 2026 paper (arXiv: 2601.16034) showed that refusal stems from a universal, low-dimensional circuit shared across model architectures, enabling transfer of refusal interventions from a "donor" to a "target" model without ever analyzing the target's refusal behavior. A separate NeurIPS 2025 paper (arXiv: 2507.11878) demonstrated that LLMs encode harmfulness and refusal in different token positions, and that certain jailbreaks reduce refusal signals without even reversing the model's harmfulness judgment. Meanwhile, the Depth Charge attack (arXiv: 2603.05772), published in March 2026, revealed that existing safety defenses focus on shallow attention layers while leaving deeper attention heads largely unprotected, creating an exploitable attack surface that bypasses current alignment techniques.

The academic community has noticed. A study published in MDPI Future Internet in October 2025 tracked 8,608 safety-modified model repositories from 1,303 distinct HuggingFace accounts, with sustained monthly growth through the latter half of 2024 and into 2025. Modified models comply with 80.0% of unsafe requests on average, compared to 19.2% for unmodified versions.

Where This Is Going

None of this is going back in the box.

As of early 2026, removing safety alignment from an open-weight model requires one command-line invocation and a consumer GPU. Heretic runs on an RTX 3090. OBLITERATUS ships 13 distinct extraction methods and a web interface. The foundational technique, published at NeurIPS 2024, has spawned over 20 documented variations, from grimjim's projected abliteration to optimal transport-based distribution matching. Commercial APIs now serve uncensored models with enterprise governance layers.

The EU AI Act, which took effect for GPAI model providers in August 2025, includes an exemption for models released under free and open-source licenses. It does not regulate abliteration. Enforcement against downstream modifications is, practically speaking, impossible once weights are distributed via BitTorrent and IPFS.

The honest assessment: safety alignment as currently implemented in open-weight models is a speed bump, not a wall. It deters casual misuse and makes the model's default behavior appropriate for consumer products. It does not prevent determined actors from removing it. The research community has demonstrated this repeatedly, at scale, with benchmarks.

This does not mean safety training is worthless. It sets appropriate defaults for the vast majority of users who run models as-is. It signals intent and establishes norms. Circuit Breakers and extended refusal training show promising paths toward more durable alignment. But the gap between "aligned by default" and "resistant to modification" is large, and that gap is what the abliteration community exploits.

For security researchers, the practical takeaway is clear: any threat model that assumes an attacker cannot access an uncensored language model is already outdated. The tooling exists, it is well-maintained, and it is getting better every month.

The foundational paper (Arditi et al., arXiv: 2406.11717) is open access. Heretic is available at github.com/p-e-w/heretic. OBLITERATUS is at github.com/elder-plinius/OBLITERATUS. Maxime Labonne's abliteration tutorial is at huggingface.co/blog/mlabonne/abliteration. The MDPI ecosystem study (Sokhansanj, 2025) is at mdpi.com/1999-5903/17/10/477.