Analyzing Meta, OpenAI, Anthropic, and Cohere: A Comparative Study

If the AI models in the tech industry were given superlatives, they would be as follows: Microsoft-backed OpenAI’s GPT-4 would be the best at math, Meta’s Llama 2 would be the most average, Anthropic’s Claude 2 would be the best at recognizing its limitations, and Cohere AI would be known for its hallucinations and confident yet incorrect answers. These rankings were determined by researchers at Arthur AI, a machine learning monitoring platform, in their recent report.

This research is particularly relevant in the current climate where misinformation from AI systems is a highly debated topic, especially with the upcoming 2024 U.S. presidential election and the rise of generative AI.

The report from Arthur AI is the first of its kind to thoroughly examine rates of hallucination rather than providing a single number on an LLM leaderboard. According to Adam Wenchel, co-founder and CEO of Arthur, this comprehensive approach sheds light on the issue.

Hallucinations in AI occur when large language models fabricate information, presenting it as factual. An example of this is when ChatGPT referenced “bogus” cases in a New York court filing, potentially leading to repercussions for the involved attorneys.

In their experiments, the researchers at Arthur AI assessed the AI models’ performance in various categories such as mathematics, U.S. presidents, and Moroccan political leaders. The aim was to challenge the models and observe their ability to reason and provide accurate information.

Overall, OpenAI’s GPT-4 performed the best among all the models tested, with significantly fewer hallucinations compared to its predecessor, GPT-3.5. For math-related questions, GPT-4 hallucinated between 33% and 50% less than GPT-3.5, depending on the specific category.

On the other hand, Meta’s Llama 2 had a higher rate of hallucination compared to GPT-4 and Anthropic’s Claude 2.

In the math category, GPT-4 and Claude 2 secured the first and second spots respectively. However, in terms of accuracy for U.S. presidents, Claude 2 took first place, pushing GPT-4 to second. When it came to Moroccan politics, GPT-4 came out on top once again, while Claude 2 and Llama 2 mostly refrained from answering.

In a separate experiment, the researchers evaluated the AI models’ inclination to hedge their answers with warning phrases to mitigate risk. GPT-4 showed a 50% increase in hedging compared to GPT-3.5, which led to a more frustrating user experience. Cohere’s AI model, in contrast, did not hedge at all in its responses. Additionally, Claude 2 demonstrated remarkable self-awareness by accurately gauging its knowledge and only providing answers when it had sufficient training data to support them.

According to Wenchel, the key takeaway for users and businesses is to test the AI models on specific workloads. Understanding how the models perform in real-world scenarios is essential when utilizing them for specific tasks.

Wenchel also emphasized the importance of considering the practical application of the LLMs rather than relying solely on benchmark tests. Evaluating the LLMs in the context of their intended use is crucial in order to gauge their effectiveness accurately.

Follow Google News

Reference

Denial of responsibility! VigourTimes is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.

Leave a Comment Cancel reply