Tech 8 min read

FSF Receives Notice of Settlement in Bartz v. Anthropic Copyright Lawsuit, Statement on Unauthorized Learning of GNU FDL Licensed Works

IkesanContents

In March 2026, the Free Software Foundation (FSF) announced on its official blog that it had received the Bartz v. Anthropic settlement notice. This lawsuit, which is attracting attention as a copyright issue for large-scale language models, is a class action lawsuit based on claims that Anthropic downloaded copyrighted material from the Library Genesis and Pirate Library Mirror datasets without permission and used it to train Claude.

History of the lawsuit

The district court determined that the act of using downloaded works for study itself was fair use (fair use under copyright law). However, the legality of the act of downloading copyrighted material without the consent of the copyright holder is still in dispute, and the two parties ultimately agreed to a settlement.

As the copyright holder of the GNU Project’s programs and books published under free licenses, the FSF is a party to this settlement notice. Particularly problematic datasets included “Free as in Freedom” by Richard Stallman and Sam Williams (released under the GNU Free Documentation License).

“Learning” and “acquisition” are different issues

What is important about this lawsuit is that using the data for learning'' and the method of acquiring the data” are legally treated as separate issues.

What the district court recognized as fair use was the use of downloaded copyrighted material for model training. In other words, it was recognized that “using copyrighted works for LLM learning purposes” may constitute transformative use. On the other hand, the legality of the act of acquiring copyrighted works from pirated libraries such as LibGen and Pirate Library Mirror remained an issue separate from fair use.

flowchart TD
    A["著作物が存在<br/>(書籍・論文・マニュアル)"] --> B["Library Genesis<br/>Pirate Library Mirror<br/>に海賊版としてアップロード"]
    B --> C["Anthropicが<br/>データセットをダウンロード"]
    C --> D["Claudeの学習データとして使用"]
    D --> E["学習使用 → フェアユース認定"]
    C --> F["取得行為 → 合法性が係争に"]
    F --> G["和解合意"]

Recognizing learning as fair use makes technical sense. Learning in LLM is not a process of “memorizing” copyrighted material, but rather a process of extracting statistical patterns in text. As long as the purpose is not to reproduce an individual copyrighted work as it is, the classification of transformative use is valid as a legal principle.

The problem lies at the entrance. Library Genesis is a pirate site that uploads academic papers and books without the permission of the authors and publishers, and is currently being sued by multiple publishers. Obtaining a large number of copyrighted works that cannot be obtained through regular channels or that are expensive to obtain through such shadow libraries and using them for learning is a different question than whether the use is transformative.

What is Shadow Library?

Library Genesis, Pirate Library Mirror, Sci-Hub, Z-Library, etc. are collectively called “shadow libraries.” A group of sites that publish academic papers and books for free without the permission of the copyright holder, and many of them aim to eliminate disparities in access.

SiteOverview
Library Genesis (LibGen)A comprehensive collection of books and papers. Over 3 million books. Multiple mirror sites exist
Sci-HubSpecializes in academic papers. Enter the DOI to bypass the paywall and get the paper PDF
Z-LibraryEvolved from a fork of LibGen. The domain was seized (takedown = forced closure) by the FBI in 2022, but it has been restored
Pirate Library MirrorOne of LibGen’s mirrors. Dataset named in Bartz lawsuit

These sites themselves are legally gray to black. Major publishers such as Elsevier have repeatedly filed lawsuits against LibGen and Sci-Hub and won domain injunctions. The legal risks for AI companies to obtain data from these sources exist independently of fair use determinations of learning practices.

Anthropic’s “scan and destroy” approach

During the litigation process, Anthropic’s data acquisition methods were highlighted. According to reports and leaked information, Anthropic did not use distillation (a method of learning using the output of other companies’ models), but instead used a pipeline of directly downloading the book, extracting the text, processing it as training data, and then discarding the original.

It appears that by not retaining the original, the intention was to make it easier to take the legal position of “not owning a copy of the copyrighted work,” but this argument has its limits.

  • Illegality of the act of acquisition itself — Even if you destroy the original, the fact that you downloaded it from a pirated site will not disappear. Even if someone “read the stolen book, took notes, and returned the book,” it does not eliminate the fact that the theft occurred.
  • Use as training data remains — “Use” of the copyrighted work continues as long as it is reflected in the model parameters. Even if you delete the original file, the state in which copyrighted material information is included in the trained model will not change.
  • Scale Issues — Legal evaluations may be different for an individual acquiring a few papers from LibGen for research purposes than for a company systematically downloading tens to hundreds of thousands of volumes to train a commercial model.

The irony is the contrast with Anthropic’s accusation of Claude distillation by three Chinese AI companies. Anthropic may have chosen to directly acquire the original work rather than distill it because it understood that “learning using the output of another company’s model” would be legally problematic. However, the alternative chosen was downloading from a pirated library, which appears to have moved the problem rather than solving it.

What is GNU FDL?

The GNU Free Documentation License (GNU FDL) is a copyleft license for documents developed by the FSF. As a document equivalent to the GPL for software, use, modification, and redistribution are basically permitted freely. However, because it is possible to impose some restrictions by setting “immutable sections,” there have been discussions in the past that question whether the software conforms to the FLOSS definition. One of the points at issue this time is that when publishing on GNU FDL, permitting free use'' and permitting unauthorized incorporation into LLM learning data” are two different issues.

FSF’s position

The FSF said it would prioritize “protecting computing freedom” over seeking financial damages. Specifically, the following four disclosures are required from companies developing LLMs.

TargetContents
Training input dataAll corpora used for learning
Model weightSet of learned parameters
Setting informationHyperparameter architecture settings, etc.
Source codeCode used for learning/inference

The FSF’s blog states that, “If we participate in a lawsuit and find that the GNU FDL has been violated, we will seek freedom for users as compensation.” As a non-profit organization with limited resources, it is moving toward using free software principles as a negotiating card rather than financial solutions.

This request is unlikely to come true, but it is an interesting direction. This is an attempt to expand on the logic of copyleft by saying, If it is used for learning, the learning results should also be released under the same conditions.'' Just as the GPL has achieved freedom of propagation” for software, this raises the question of whether the conditions of the GNU FDL can be extended to LLM learning products.

Not limited to Bartz v. Anthropic, each AI company’s data acquisition methods have different issues at stake.

LitigationPlaintiffDefendantMain issues
Bartz v. AnthropicCopyright GroupAnthropicDownload from Shadow Library
NYT v. OpenAIThe New York TimesOpenAI/MicrosoftUnauthorized learning and reproducibility of articles
Merriam-Webster v. OpenAIMerriam-Webster・BritannicaOpenAIStructural extraction of dictionaries and encyclopedias
Getty v. Stability AIGetty ImagesStability AIUnauthorized learning of images
Seedance copyright issueHollywood studiosByteDanceUnlicensed learning of video works and IP reproduction

What they have in common is that fair use of the use for learning itself'' and legality of the data acquisition method” coexist as separate legal issues. The Bartz lawsuit was one of the first cases in which fair use was recognized under the former, but the settlement was reached under the latter (acquisition method), and now there is a clear line between “It’s OK to learn, but the way you take it is not.” OpenAI, Anthropic, and Google are all resisting the FSF’s requirement to disclose training data as a risk to their business models.

reference