Anthropic bought, scanned and then destroyed millions of books
Anthropic bought and scanned and then destroyed millions of physical books to train its Claude artificial intelligence models, writes Ars Technica. The company extracted pages from the bindings of printed books to scan and create digital copies.

Evidence of this practice is contained in a ruling by U.S. District Judge William Allsup, who indicated that AI companies do not need to get permission from copyright owners to train their large language models if the books were purchased legally.
Anthropic was not the first company to utilize such practices. What has become remarkable, however, is the scale of Anthropic’s activity in scanning printed versions of books.
The legality of Anthropic’s actions is ensured by the first sale doctrine. This legal concept allows the buyer to do whatever he or she wants with the purchased goods. The first sale doctrine allows a secondary market to exist, otherwise book publishers could demand their share or prohibit resale.
Anthropic hired former Google Books digitization project partner program manager Tom Turvey in February 2024 to get its hands on “all the books in the world” without facing “legal, practical and business red tape,” as the AI company’s CEO Dario Amodei put it in the lawsuit. Turvey came up with a workaround, purchasing printed versions of the books so that Anthropic would be protected by the first sale doctrine.
Cutting pages out of the books made scanning cheaper and easier. Because Anthropic only used the scanned books internally, and the originals were eventually destroyed, the judge deemed the process to be akin to “saving space.”
However, Anthropic then went further and started using millions of pirated books to train AI.
AI models trained on well-edited books and articles tend to give more coherent and accurate answers than those trained on low-quality texts, such as random YouTube comments.
Anthropic has spent “many millions of dollars” on its book-buying and scanning operation, often purchasing used copies in bulk.