Unveiling the Seductive World of Erotica and Margaret Atwood: Unraveling the Secrets Behind Meta’s Generative AI

**Editor’s note: This article is a part of The Atlantic’s series on Books3. You can do a database search here, and read about its origins here.**

Over the summer, I reported on an extensive collection of more than 191,000 books known as “Books3” that were used without permission by Meta, Bloomberg, and other companies to train generative-AI systems. These books, consisting of pirated ebooks including travel guides, self-published erotic fiction, novels by Stephen King and Margaret Atwood, and more, have become the subject of copyright infringement lawsuits against Meta. Writers claim that the use of their work in these systems constitutes a violation of their rights.

Books play a critical role in training generative-AI systems as they provide long and thematically consistent paragraphs that teach the systems how to construct similar paragraphs. This is crucial in creating the illusion of intelligence. As a result, tech companies often use large datasets of books without obtaining permission or licensing. Meta’s lawyers recently argued in court that the outputs and model of their generative AI are not “substantially similar” to existing books.

In the training process, generative-AI systems create a comprehensive map of English words, with the distance between two words reflecting their frequency of appearance in the training text. The final system, called a large language model, produces more plausible responses based on the subjects that appear most frequently in the training text. It is important to understand the training data used by these models, which is why the lack of transparency surrounding it is concerning.

Here are some of the most prominent authors in the Books3 dataset, along with the approximate number of entries contributed:

* Stephen King – 3,578 entries
* Dean Koontz – 3,240 entries
* Nora Roberts – 2,877 entries
* Clive Cussler – 2,719 entries
* Danielle Steel – 2,573 entries
* Tom Clancy – 2,511 entries
* James Patterson – 2,500 entries
* Janet Evanovich – 2,491 entries
* John Grisham – 2,481 entries
* J.D. Robb – 2,386 entries

Despite 24 out of the 25 listed authors being fiction writers (with Betty Crocker being the exception), the dataset is comprised of two-thirds nonfiction material. It includes thousands of technical manuals, over 1,500 books from Christian publishers (including 175 Bibles and Bible commentaries), more than 400 books related to Dungeons & Dragons and Magic the Gathering, and 46 titles by Charles Bukowski. The collection covers a wide range of subjects, but heavily aligns with the interests and perspectives of the English-speaking Western world.

Bias in AI systems has been extensively discussed. For instance, an AI-based face-recognition program that is trained predominantly on images of people with lighter skin may perform less accurately on images of people with darker skin, leading to potentially disastrous consequences. Books3 sheds light on the problem from a different angle. It raises questions about what combination of books would be unbiased and what would constitute an equitable distribution of subjects related to Christianity, Islam, Buddhism, and Judaism. Are extremist views balanced with moderate perspectives? Additionally, what is the proper ratio of American history to Chinese history, and which perspectives should be included within each? The problem of perspective becomes crucial and intractable when knowledge is organized and filtered by algorithms rather than human judgment.

Books3 is an extensive dataset, and here are just a few examples of the authors, books, and publishers within it. These samples are not comprehensive but provide a glimpse into the various types of writing used to train generative AI. Please note that the book counts may include multiple editions. As AI chatbots begin to replace traditional search engines, the tech industry’s power to control our access to information and manipulate our perspective grows exponentially. While the internet democratized access to information by eliminating the need for libraries or expert consultations, the AI chatbot model reintroduces gatekeeping with an opaque and unaccountable gatekeeper. Furthermore, this gatekeeper is prone to “hallucinations” and may or may not cite sources.

In a recent court filing seeking to dismiss the lawsuit brought by authors Richard Kadrey, Sarah Silverman, and Christopher Golden, Meta claimed that “Books3 comprises an astonishingly small portion of the total text used to train LLaMA [Meta’s AI].” Although technically accurate (I estimate that Books3 makes up approximately 3 percent of LLaMA’s training text), this statement dodges a core concern. If LLaMA can summarize Silverman’s book, it likely relies heavily on the text of her book to do so. Overall, it is difficult to determine how much any given source contributes to a generative-AI system’s output due to the current opaqueness of algorithms.

To gain insight into the type of information and opinions AI chatbots may provide, we must examine their training data. Books3 provides a starting point, but it represents only a fraction of the vast training data universe that remains largely hidden from view.

Reference

Denial of responsibility! Vigour Times is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.
Denial of responsibility! Vigour Times is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.
DMCA compliant image

Leave a Comment