Unveiled: Meet the Writers Fueling Generative AI with Pirated Books

Generative AI has been a topic of concern due to the secretive nature in which it is being developed. Companies like Meta and OpenAI are using vast amounts of written material to create systems like ChatGPT, which can generate humanlike responses. However, the exact texts these programs are trained on remain largely unknown. A recent lawsuit by authors Sarah Silverman, Richard Kadrey, and Christopher Golden against Meta alleges copyright infringement, claiming that their books were used to train LLaMA, a language model similar to OpenAI’s GPT-4. While the lawsuit does not provide specific details, I have obtained and analyzed a dataset used by Meta to train LLaMA, confirming the authors’ claims that pirated books are being used to train generative AI systems.

The dataset, known as “Books3,” consists of over 170,000 books published in the past 20 years. It includes works by popular authors like Michael Pollan, Rebecca Solnit, Jon Krakauer, James Patterson, Stephen King, George Saunders, Zadie Smith, and Junot Díaz. Not only was Books3 used to train LLaMA, but it was also utilized by other generative AI programs such as BloombergGPT and GPT-J. The vast collection of books in Books3 sheds light on the significant impact that pirated books are having on the development of AI.

As a writer and computer programmer, I have been interested in understanding the types of books used to train generative AI systems. Through online discussions among AI developers and the discovery of “the Pile,” a comprehensive training text cache created by EleutherAI, I was able to access the Books3 dataset and other sources such as YouTube video subtitles, European Parliament documents, Enron Corporation emails, and more. The Pile served as a vast resource for researchers due to its quantity over specific subject matter, which aligns with the nature of generative AI.

Analyzing the content of Books3 was challenging due to its massive size. Nonetheless, I developed programs to manage it. By isolating the lines labeled as “Books3,” I extracted the dataset. Although no explicit labels with titles or metadata were present, I utilized ISBN extraction techniques to retrieve author, title, and publishing information. Through this process, I identified over 170,000 unique books, providing me with a glimpse into the extensive range of fiction and nonfiction titles from both renowned and lesser-known publishers. The collection includes works by Elena Ferrante, Rachel Cusk, Haruki Murakami, Jennifer Egan, Jonathan Franzen, bell hooks, David Grann, Margaret Atwood, L. Ron Hubbard, John F. MacArthur, and Erich von Däniken.

While Books3 has become popular among AI developers, its use remained relatively unknown to the wider community until recently. Hugging Face hosted Books3 for over two years until its removal, seemingly coinciding with the lawsuits against OpenAI and Meta. Other datasets similar to Books3 are believed to be used secretly by companies like OpenAI, raising concerns about monopolization of generative AI by wealthy corporations. Despite the controversy, individuals like Shawn Presser, the independent developer behind Books3, see it as a necessary resource for democratizing AI development.

Presser acknowledges the concerns raised by authors about copyright infringement but highlights the potential danger of a monopoly on generative AI by big corporations like OpenAI. He created Books3 to provide developers with access to high-quality training data, promoting a more inclusive AI landscape. Presser sourced the content for Books3 from Bibliotik, a library known for hosting pirated books, and converted them to plain text format. While some copyright-management information may be missing from the dataset, the intention was to empower developers rather than harm authors.

In conclusion, the use of pirated books to train generative AI systems has become a significant concern in the AI community. The Books3 dataset, along with similar collections, has played a crucial role in advancing AI capabilities, but it also raises ethical questions about intellectual property rights. As the AI field continues to evolve, it is essential to strike a balance between innovation and respecting the rights of authors and creators.

Reference

Denial of responsibility! VigourTimes is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.
Denial of responsibility! Vigour Times is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.
DMCA compliant image

Leave a Comment