Safety Controls of ChatGPT and Other Chatbots Found Flawed by Researchers

Artificial intelligence companies spend months implementing safety measures to ensure that their online chatbots don’t generate hate speech, disinformation, or harmful content. However, researchers from Carnegie Mellon University and the Center for A.I. Safety have discovered a way to bypass these safety systems. By using methods from open source A.I. systems, they were able to trick even the most tightly controlled chatbots, including Google Bard, OpenAI’s ChatGPT, and Claude by Anthropic, into generating harmful information. This research highlights concerns that chatbots could spread false and dangerous information on the internet, despite efforts to prevent it. Furthermore, it reveals the unpredictability of the technology due to disagreements among A.I. companies.

An example of bypassing safety measures involved appending a long suffix of characters to English-language prompts fed into the systems. If a chatbot was asked to write a bomb tutorial without the suffix, it would decline. However, with the suffix added, it would promptly provide a detailed guide on making a bomb. This method can also coax chatbots into generating biased, false, or toxic information. The researchers were surprised to find that their techniques applied to both open source and closed systems, despite open source systems being more widely accessible.

Meta’s recent decision to make their technology open source has raised concerns about the lack of control over powerful A.I. systems. However, Meta defends its move, claiming that open source software helps accelerate A.I. progress and understand potential risks. The debate over open source versus private code has existed for decades and is likely to intensify considering the findings of this report.

The researchers shared their methods with Anthropic, Google, and OpenAI, but there is currently no foolproof way of preventing all these types of attacks. This discovery could prompt industry-wide reconsideration of how guardrails for A.I. systems are constructed and even lead to government legislation to regulate these systems.

ChatGPT, released by OpenAI in November, quickly gained popularity for its ability to answer questions, generate poetry, and discuss various topics. However, it has the potential to repeat toxic content, mix fact with fiction, and even fabricate information. Neural networks drive chatbots like ChatGPT, enabling them to learn skills through data analysis. Large language models (L.L.M.s) are neural networks that analyze vast amounts of text and can generate text autonomously. OpenAI introduced guardrails to prevent misuse, but clever prompts have proved effective in bypassing them.

The researchers from Carnegie Mellon and the Center for A.I. Safety developed automated techniques to bypass the guardrails of chatbots using open source systems. They created mathematical tools capable of generating the long suffixes that exploit vulnerabilities in the chatbots’ defenses. In their paper, the researchers disclosed some of the suffixes they used to jailbreak chatbots but kept others private to restrict misuse.

The hope is that companies like Anthropic, OpenAI, and Google will find ways to address the specific attacks uncovered. However, preventing all forms of misuse will be extremely challenging. These findings emphasize the fragility of the defense systems built into current chatbot technology. Aviv Ovadya, a researcher at the Berkman Klein Center for Internet & Society at Harvard, who participated in testing ChatGPT, affirms the brittleness of these defenses.

Follow Google News

Reference

Denial of responsibility! VigourTimes is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.

Leave a Comment Cancel reply