Tech 11 min read

Fine-Tuning Reignites Verbatim Memorization of Copyrighted Books in LLMs

IkesanContents

LLM companies have been saying “model weights don’t contain copies of training data.” A paper coming at this from a nasty angle just dropped.
arXiv:2603.20957, “Alignment Whack-a-Mole,” reports that GPT-4o, Gemini 2.5 Pro, and DeepSeek-V3.1 reproduced copyrighted books in long verbatim spans after fine-tuning.

I previously wrote about Merriam-Webster and Britannica suing OpenAI for copyright infringement.
That piece focused on using content as training data and whether generated output substitutes the original’s market.
This paper hits close to the latter point.
Against claims that “models don’t store copies” and “guardrails prevent verbatim output,” the paper asks whether fine-tuning opens a reproduction pathway.

Training to Expand Summaries into Original Text

The experimental design is clever and uncomfortable.
The research team split books into 300–500 word fragments and generated plot summaries for each using GPT-4o.
They then created summary-to-original-paragraph pairs and fine-tuned models on them.

At inference time, the actual book text never enters the prompt.
Only semantic descriptions of “what happens in this scene” are provided.
The question is whether models still produce long spans matching the original book text.

Target models were OpenAI’s GPT-4o, Google’s Gemini-2.5-Pro, and DeepSeek-V3.1.
The corpus covered 81 copyrighted books by 47 contemporary authors spanning literary fiction, thrillers, romance, sci-fi, and memoirs.
The paper is a preprint, currently under review.
Read it with that in mind.

Cross-Author Extraction

In same-author experiments, the team fine-tuned on one author’s books and tested on withheld books by the same author.
Under this condition, verbatim reproduction of copyrighted text increased substantially after fine-tuning.
The paper reports cases where up to 60% of an entire book was reproduced.

The messier finding is the cross-author condition.
A model fine-tuned only on Murakami novels extracted verbatim text from copyrighted books by 30+ other authors.
No actual book text was in the prompt—only semantic descriptions.
The abstract reports up to 85–90% reproduction from withheld books, with single verbatim spans exceeding 460 words.

The authors tried to rule out Murakami being special.
Randomly selected author pairs produced similar results, and fine-tuning on Virginia Woolf’s public domain works also triggered extraction.
Fine-tuning on synthetic text did not.

The authors interpret this as fine-tuning not teaching new text but re-enabling access to memories already embedded during pretraining.
If correct, the problem isn’t “illegal data was used for fine-tuning therefore it leaked.”
Even legal or public domain fine-tuning data could surface other books already latent in the weights.

Three Companies’ Models Remembered the Same Passages

A strong aspect of the paper is its cross-model analysis.
GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 come from different providers, yet they tend to memorize the same regions of the same books.
Per-book extraction rates show Pearson correlations of r ≥ 0.90, and word-level overlap is also high.

This suggests training data overlap rather than one model having weaker safety measures.
The authors cross-referenced extracted spans against large Common Crawl-derived corpora: about 61% of extracted spans had no exact match, rising to about 90% for spans over 150 words.
Meanwhile, most test books were present in Books3 and LibGen.

This isn’t direct proof of training data inclusion—closed commercial models don’t reveal their data.
The authors acknowledge that “precisely tracing provenance is nearly impossible.”
But it’s circumstantial evidence that short web quotes alone can’t explain spans this long.

Do Guardrails Retain Full Strength After Fine-Tuning?

What makes this paper uncomfortable isn’t jailbreak prompts.
It’s not about asking a model to “output this book verbatim” and getting refused.
The fine-tuning task—“expand a plot summary into prose”—is something commercial writing tools already do.

RLHF, system prompts, and output filters work as safety layers against base models.
But if API-provided fine-tuning alters memory retrieval pathways beneath those safety layers, post-hoc filters aren’t enough.
The paper’s title “Whack-a-Mole” targets exactly this: squash one output pathway, and the same verbatim reproduction emerges through another tuning pathway.

This matters legally too.
In cases like Bartz v. Anthropic and Kadrey v. Meta over book-training fair use, the operative assumption was that training copies enable non-infringing outputs.
If users can extract long protected expressions with minimal effort, the premise that output doesn’t substitute the market collapses.
It’s similar to Google Books—“we store the full text but only expose snippets”—where the adequacy of safeguards itself becomes the contested issue.

The paper makes strong claims but remains a preprint.
Due to cost constraints, aligned baselines were primarily measured on GPT-4o.
The setup includes fine-tuning, 100-sample generation, and bmc@5 scoring—all designed to maximize extraction.
This isn’t the same as what happens in a normal chat window.

Still, the argument is practical enough.
Without pasting copyrighted text into prompts, semantic descriptions alone yield long verbatim spans.
Multiple companies’ models memorize the same book regions similarly.
Fine-tuning API providers will need to treat post-tuning reproduction testing as a distinct concern.

If Books Are in the Training Data, Of Course They Come Out

Reading these results, “obviously” arrives before surprise.

Anthropic’s Bartz v. Anthropic lawsuit established that downloading books from LibGen for Claude’s training was at issue and ended in settlement.
Separately, reporting revealed AI companies mass-purchasing physical books, scanning them, and discarding originals.
That books are in training data wholesale is now public fact.

Transformer training is next-token prediction.
The model learns what follows any given text across the entire corpus and compresses that probability distribution into weights.
If the same text appears frequently enough in the training corpus, parameters optimize toward reproducing that sequence.
This isn’t a side effect—it’s the training objective itself.
Predicting the next token correctly in succession simply is reproducing the original text.

Post-RLHF models don’t output this in normal use because they’ve been trained to refuse.
The book text remains intact in the weights.
Safety layers are post-hoc output control, nothing more.
What the paper shows is that fine-tuning disrupts that control layer’s balance, letting underlying memories surface cleanly.

The finding that three companies’ models memorized the same passages makes architectural sense.
If training corpora share provenance (Books3, LibGen, or their redistributions), the same text appears at similar frequencies.
Optimizing the same objective on the same input produces similar memorization patterns.
The authors’ finding that most extracted spans don’t appear in Common Crawl is consistent: if the data came from book datasets rather than web crawls, text absent from the web can still live in the weights.

What this paper shows is that pretraining-era memories surface when safety layers are bypassed.
The root issue is training data composition.

Reproduction vs. Hallucination

This paper treats “faithful reproduction of book text” as the problem.
But the opposite direction is worth considering.
When fine-tuning degrades the safety layer, does the model also produce text closely resembling originals but confidently fabricated?

Next-token prediction doesn’t distinguish between “accurately remembered” and “confidently guessed.”
There’s no internal flag for that.
A model optimized for “expand summaries into prose” will reproduce verbatim where memory is strong and generate plausible text from style and context where memory is fuzzy.
Both arrive with identical confidence from the receiver’s perspective.

The paper measures via longest common substring.
Matching spans are detected.
But what the non-matching spans contained isn’t reported.
How much “hallucinated quotation”—text nearly identical to originals but differing by a few words—mixed in remains unknown.

Practically, this is a different kind of problem from verbatim reproduction.
When investigating copyright infringement by comparing model output against originals, finding “only this much matched” has different implications depending on whether non-matching portions are random text or close paraphrases.
The latter touches on “adaptation” in copyright law.

For users employing fine-tuned models as writing aids, the same issue applies.
There’s no way to determine whether output is exact reproduction from a real book or confidently generated nonexistent text.
How much fuzzy reproduction—undetected by matching metrics—surrounds the “increased verbatim reproduction” phenomenon the paper demonstrated remains unmeasured.

Anthropic’s Project Panama

Court filings from Bartz v. Anthropic and Washington Post reporting from January 2026 revealed much of Anthropic’s training data procurement.

“Project Panama,” launched in early 2024, was described in internal emails as “an attempt to destructively scan every book in the world.”
The process: bulk-purchase used books from wholesalers, cut off spines with industrial cutters, feed pages through scanners for digitization.
Originals discarded after scanning.
The investment was reportedly in the millions of dollars.

The preceding phase was more direct.
Anthropic’s co-founders downloaded over 5 million books from LibGen and over 2 million from Pirate Library Mirror.
Court documents show CEO Dario Amodei stated internally that pirated sources were chosen early on to “avoid legal, practical, and business hassles.”
The shift to physical book purchase and scanning came when legal risk became unavoidable.

Judge Alsup’s June 2025 ruling was two-faced.
Destructive scanning of purchased books: fair use, “essentially transformative.”
Pirated copies: not fair use, “all factors weigh against.”
Anthropic settled for $1.5 billion covering approximately 500,000 books, roughly $3,000 per author.

Over 5 million pirated books plus industrially scanned volumes.
With that volume in Claude’s training corpus, it’s more natural to assume book text is compressed directly into the weights.
Other LLM companies likely ingested comparable or larger book datasets, which explains why three companies’ models remembered the same passages from the same books.

Carlini et al.’s Training Data Extraction Research

While memory leakage via fine-tuning is the Whack-a-Mole paper’s novelty, research on extracting training data from language models predates it substantially.

Nicholas Carlini et al. published extraction experiments targeting GPT-2 in 2020 (arXiv:2012.07805).
Simply providing specific prefixes and generating continuations yielded hundreds of verbatim training data items: personal information, IRC logs, source code, UUIDs.
GPT-2 is small by current standards, but the proof-of-concept—that training data persists in weights and can be extracted—was impactful.

The 2023 follow-up (arXiv:2311.17035) operated at much larger scale.
Beyond open models like Pythia, GPT-Neo, LLaMA, and Falcon, the team extracted gigabytes of training data from ChatGPT.
Against ChatGPT, they used a “divergence attack” that forced the chatbot to deviate from its response patterns, causing it to emit training data at 150× the normal rate.

Carlini et al.’s recurring finding: alignment does not erase memorization.
RLHF and system prompts are output control layers; weight contents remain untouched.
Given a bypass path, training data comes out.

The difference from Whack-a-Mole is the attack surface.
Carlini et al.’s approach uses adversarial prompts—researchers intentionally breaking the model.
Whack-a-Mole used a fine-tuning feature provided normally via API.
”Expand a plot summary into prose” is a mundane writing-assistance task that non-malicious users routinely perform.
A pathway that leaks copyrighted text without adversarial intent is arguably more troublesome for API providers.

An Even Cheaper Pathway Than Fine-Tuning

Adversarial prompts from Carlini et al. Fine-tuning from Whack-a-Mole.
Two extraction pathways so far, but there’s a third, cheaper one.

When I recently ran Qwen-Scope’s SAE on M1 Max, I documented how Sparse Autoencoders separate intermediate-layer features and how a forward hook at inference time can subtract specific directions from the residual stream to bypass safety layers.
No gradient computation or training data needed, and removing the hook instantly restores the model.

That experiment isolated Japanese-language feature fid 23991 from Qwen3-8B’s layer 17 at 98.2% accuracy.
The same statistical approach could identify feature IDs responsible for “refusal” or “copyright protection” and zero them at inference, selectively disabling RLHF safety layers.
Whack-a-Mole showed “fine-tuning degrades safety layers and copyrighted text emerges,” but SAE intervention is more surgical.
Without touching weights at all—just rewriting activation vectors mid-inference—a verbatim reproduction pathway might open.

The cost gap is significant.
Whack-a-Mole used OpenAI’s fine-tuning API, requiring API access and hundreds of training samples.
SAE intervention needs only open-weight models and published SAE weights, running locally.
Qwen-Scope’s SAE weights are on Hugging Face; they ran on an M1 Max 64GB Mac without issue.

Whether feature IDs corresponding to copyright text reproduction actually exist, and whether zeroing them yields long verbatim spans, remains untested.
But safety layers are output control, and bypass methods aren’t limited to fine-tuning.