Sarah Silverman and two other US authors claim OpenAI and Meta illegally used their books to train their AI models.
In two separate class action lawsuits filed on Friday in a California district court, Silverman, along with bestselling authors Christopher Golden and Richard Kadrey, said that they “did not consent to the use of their copyrighted books as training material” for the companies’ AI models.
In the lawsuit against OpenAI, the trio’s lawyers presented exhibits showing that when prompted, ChatGPT will generate summaries of their works, “something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works”.
The lawsuit against Meta alleges that the authors’ books were accessible in datasets Meta used to train its LLaMA (Large Language Model Meta AI) open-source AI models, which the company introduced in February.
The class action firm representing Silverman – Joseph Saveri Law Firm, LLP, which has offices in California and New York – filed a similar suit against OpenAI on behalf of authors Paul Tremblay (“The Cabin at the End of the World”) and Mona Awad (“Bunny”) on 28 June.
Why books are the ideal training ground for AI language models
In order for large AI language models to learn quickly, they need to be trained on massive amounts of well-written text – and books are obviously the best materials for that.
ChatGPT’s developers said they train the language model on a dataset called BooksCorpus, which “contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.”
“Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information,” OpenAI wrote in a report titled “Improving Language Understanding by Generative Pre-Training”.
Hundreds of large language models have been trained on BookCorpus, including those made by OpenAI, Google and Amazon.
The controversy however revolves around another dataset used to train the models: In a 2020 paper, OpenAI said that 15 percent of its ChatGPT-3 training dataset came from “two internet-based books corpora” that the company called “Books1” and Books2”.
The company has never revealed what books are included in “Books1” and “Books2”.
In their complaint, Silverman’s lawyers said that based on numbers given in OpenAI’s paper on ChatGPT-3, “Books1” is about nine times bigger than BookCorpus, while “Books2” is 42 times bigger. This would mean the two data sets contain more than 350,000 books.
That leads them to believe that the models are being trained on illegal “shadow libraries” found online.
“The only ‘internet-based books corpora’ that have ever offered that much material are notorious ‘shadow library’ websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik,” the lawsuit reads.
Artists against AI: A new wave of lawsuits
Silverman’s lawsuit is the latest in a host of intellectual property disputes between artists and AI companies, as creatives claim the little-regulated new technology is flagrantly flouting copyright law.
In January, a group of visual artists – also represented by the Joseph Saveri Law Firm and Matthew Butterick – sued AI companies Stability AI Ltd, Midjourney Inc and DeviantArt Inc for copyright infringement.
That lawsuit argues that the companies’ software copies billions of copyrighted images to enable Midjourney and DeviantArt’s AI to generate new images in artists’ styles without consent.
Butterick said in a blog post that since the November lawsuit they had “heard from people all over the world — especially writers, artists, programmers, and other creators — who are concerned about AI systems being trained on vast amounts of copyrighted work with no consent, no credit, and no compensation.”
Getty Images also launched legal proceedings against Stability AI in the UK over Stability’s alleged copying of millions of its images.
Last year, hundreds of visual artists spoke out against the Lensa AI smartphone app, which allowed users to create digital avatars based on art scraped from online databases – much of which was copyrighted and used without consent.