Authors accused NVIDIA of copyright infringement in an expanded class-action lawsuit filed recently, alleging the company obtained millions of pirated books from Anna’s Archive for AI training. The complaint cites internal documents indicating NVIDIA sought high-speed access to the shadow library’s data.

NVIDIA, a chip manufacturer, has seen increased revenue from the artificial intelligence sector due to demand for its AI-learning chips and data center services. The company develops AI models such as NeMo, Retro-48B, InstructRetro, and Megatron, trained using its hardware and extensive text libraries.

This legal challenge follows previous lawsuits where authors accused tech companies of training AI models on pirated books. In early 2024, authors sued NVIDIA, alleging its AI models were trained on the Books3 dataset, which included copyrighted works from the Bibliotik site without permission. NVIDIA defended its actions as fair use, stating books functioned as statistical correlations for its AI models. During discovery, plaintiffs uncovered additional evidence.

On Friday, authors filed an amended complaint that broadened the lawsuit. This update included more books, authors, and AI models, alongside new “shadow library” claims. Authors, including Abdi Nazemian, cited internal NVIDIA emails and documents, asserting the company willingly downloaded millions of copyrighted books. The complaint alleges “competitive pressures drove NVIDIA to piracy,” involving what is described as collaboration with Anna’s Archive.

According to the amended complaint, an NVIDIA data strategy team member contacted Anna’s Archive to assess data offerings. The complaint details the interaction: “Desperate for books, NVIDIA contacted Anna’s Archive—the largest and most brazen of the remaining shadow libraries—about acquiring its millions of pirated materials and ‘including Anna’s Archive in pre-training data for our LLMs’.” Anna’s Archive charged tens of thousands of dollars for “high-speed access” to its pirated collections; NVIDIA investigated the specifics of this access.

The complaint states that Anna’s Archive informed NVIDIA of the illegal nature of its library. The pirate library then asked NVIDIA executives if internal permission to proceed was granted. Permission was allegedly granted within one week, after which Anna’s Archive provided access to its pirated books. “Within a week of contacting Anna’s Archive, and days after being warned by Anna’s Archive of the illegal nature of their collections, NVIDIA management gave ‘the green light’ to proceed with the piracy. Anna’s Archive offered NVIDIA millions of pirated copyrighted books,” the complaint states.

Anna’s Archive promised NVIDIA access to approximately 500 terabytes of data, containing millions of books typically available through the Internet Archive’s digital lending system, which itself has faced legal scrutiny. The complaint does not specify if NVIDIA paid Anna’s Archive for this access. In addition to the Books3 database, the complaint alleges NVIDIA downloaded books from LibGen, Sci-Hub, and Z-Library.

Authors also allege NVIDIA distributed scripts and tools enabling corporate customers to automatically download “The Pile,” which contains the Books3 pirated dataset. These claims introduce charges of vicarious and contributory infringement, alleging NVIDIA generated revenue from customers by facilitating access to these datasets. The authors seek compensation for damages, applying to named authors and potentially hundreds of others in the class action lawsuit. This is the first public disclosure of correspondence between a major U.S. tech company and Anna’s Archive, potentially increasing the pirate library’s visibility following recent domain name losses.

A copy of the first consolidated and amended complaint, filed at the U.S. District Court for the Northern District of California, is available in PDF format. Named authors include Abdi Nazemian, Brian Keene, Stewart O’Nan, Andre Dubus III, and Susan Orlean.