Potential Self-Poisoning Risk Posed by Chatbots

In the early days, chatbots and similar AI models relied on the vast amount of data available on the internet. These models, like ChatGPT, learned by consuming text, images, and other content from sources like Wikipedia, Getty, and Scribd. They analyzed this data to understand patterns, flavors, and textures, and used that knowledge to create their own unique art and writing. However, this only increased their appetite for more.

Generative AI depends on continuous access to web data. Computers process enormous amounts of data to mimic intelligence and identify patterns. For example, ChatGPT can write a decent high-school essay because it has read countless digital books and articles, while DALL-E 2 can generate Picasso-esque images because it has studied the entire history of art. The more they train on, the smarter they become.

Eventually, these AI programs will have absorbed almost all human-made digital content. They are already contributing their own machine-generated content to the web, which will continue to proliferate on platforms like TikTok, Instagram, media websites, online stores, and even in academic experiments. To advance AI further, Big Tech may have to rely on feeding its programs AI-generated content. However, this change in diet could have disastrous consequences for both the models and the internet, as researchers warn.

The problem with using AI output to train future AI is clear. Despite impressive advancements, chatbots and other generative tools sometimes produce outputs filled with biases, falsehoods, and absurdities. These mistakes can carry over to future iterations of the programs, resulting in a phenomenon known as model collapse. In this process, models essentially “forget” or lose understanding over time. Recursive training can amplify errors, making the AI more biased and less functional.

Generative AI outputs are based on the data they are trained on and tend to favor what is most probable. As a result, events or concepts that are less common or underrepresented may not appear accurately in the model’s outputs or may have deep flaws. If AI models are trained on previous AI outputs, it can compound these errors and biases, leading to an overconfident yet incoherent understanding of probability.

In a study on model collapse, researchers found that as AI programs trained on previous generations of AI, their performance degraded and eventually broke down. For example, a model meant to distinguish between two groups eventually couldn’t differentiate them at all. The study showcased the dangerous effects of recursive cannibalism among AI models.

The study also revealed that language models can converge to output nonsensical sequences if they fail to model the distribution of all possible words accurately. This means that over several generations, AI models may only produce meaningless averages, much like the degradation of quality when photocopying a photocopy.

However, the risk of model collapse doesn’t render AI technology useless. Synthetic data could be used to address privacy and copyright concerns. For example, medical applications could utilize synthetic data to bypass privacy issues associated with using real patient information. Additionally, limited training material in certain areas could be augmented using machine-learning programs to generate permutations of the available data.

To prevent model collapse, curating training datasets becomes crucial. Filtering data to ensure high quality and representativeness can greatly impact the performance of AI models. It’s important to recognize that both human-generated and AI-generated data can have biases and misalignments with reality. Researchers can curate AI-generated data to counterbalance biases and improve the quality of models.

While dramatic model collapse is unlikely to be released as a product, smaller biases and misperceptions can be compounded over time, especially as machine-generated content becomes indistinguishable from human creations. The danger lies in subtle flaws that current evaluation processes may not capture, leading to potential biases and issues in AI systems. However, by utilizing controlled generation of data, AI can potentially be used to counteract biases and improve system fairness.

Reference

Denial of responsibility! VigourTimes is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.
Denial of responsibility! Vigour Times is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.
DMCA compliant image

Leave a Comment