At Google’s annual conference focused on new products and technologies, an important update to the Bard chatbot was announced. Similar to OpenAI’s GPT-4, the Bard chatbot will soon have the ability to describe images. While this may seem like a small enhancement, it signifies a silent revolution in the development and utilization of AI. The goal is to go beyond language and incorporate different types of media, ultimately achieving a comprehensive understanding of the world.
The six-month-old ChatGPT is already appearing outdated. These language models predict word sequences in sentences by analyzing vast amounts of text. The prevailing approach in AI development has been to train these models on extensive text data. However, there is now a shift towards multimodal models that can process images, audio, and other sensory data. This new approach aims to mimic human intelligence by observing and existing in the world. Additionally, it allows companies to create AI systems capable of performing various tasks and offering a wider range of products.
GPT-4, Bard, and other programs with expanded capabilities are emerging. For example, Meta released ImageBind, which processes text, images, audio, depth information, infrared radiation, motion, and position data. Google’s PaLM-E was trained on both language and robot sensory data, and another powerful model that goes beyond text is on the horizon. Microsoft also has its own model trained on words and images. Text-to-image generators like DALL-E 2, which gained internet popularity last summer, rely on captioned pictures.
These multimodal models, incorporating both text and images, hold great potential for the future of AI. The ultimate objective is to enable AI that can go beyond formulaic writing and Slack assistance. Imagine AI searching the internet with accurate results, animating videos, guiding robots, or independently creating websites. In fact, GPT-4 already demonstrated website creation following a basic concept sketched by a human.
Multimodal approaches can potentially solve a fundamental problem with language-only models. While these models can generate coherent sentences, they struggle to connect words with concepts or real-world experiences. By training an AI system on various types of data, such as videos of traffic jams, it can acquire a more comprehensive understanding. This approach allows AI to interact with physical environments, develop common sense, and reduce fabrication. A well-rounded model that understands the world is less likely to generate false information.
The rise of multimodal models has been facilitated by changes in AI research and advancements in technology. Different AI fields, like natural language processing, computer vision, and robotics, now share a programming method called “deep learning.” This has made it easier to integrate models and approaches across domains. Additionally, internet giants have amassed vast datasets of images and videos, thanks to the growing computing power of computers.
There is a practical aspect to the shift towards multimodal models. The amount of text data available on the internet is limited, and there are limitations to the size and complexity of programs. Moving beyond text allows researchers to capitalize on other types of data and enhance models’ capabilities. Scaling text-based models may no longer be the future, as stated by Sam Altman, OpenAI’s CEO.
The extent to which multimodal AI will understand the world better than ChatGPT remains uncertain. While these models often outperform language-only models in tasks involving images and 3-D scenarios, their performance in other domains may not be as remarkable. For instance, adding vision to GPT-4 did not significantly improve its standardized-test performance. These multimodal models still exhibit flaws, including generating false statements confidently. Nonetheless, this area of research is in its early stages and has room for improvement.
It is important to acknowledge that current AI models are far from duplicating human thought processes. The architecture of these models makes it unlikely that they will achieve human-level intelligence. Humans possess unique qualities like social interaction, long-term memory, experience-based growth, and millions of years of evolution that set us apart from artificial intelligence.
Moreover, introducing more data types to AI models does not automatically solve issues related to bias and fabrication. An AI program trained on biased text and images will still produce harmful outputs, albeit across different media. Models like Stable Diffusion have demonstrated how text-to-image models can perpetuate racist and sexist biases. Regulatory challenges and difficulties in auditing AI software remain, and there is a risk of labor and copyright violations as AI systems require an increasing variety of data.
Multimodal AI models may be more susceptible to manipulation than language-only models. Altering key pixels in an image could have a profound impact, leading to more convincing and potentially dangerous hallucinations. Multimodality is not a panacea for addressing these issues.
From a business perspective, multimodal AI presents a more lucrative opportunity. Language models are already highly sought after in Silicon Valley. Adding multimodal capabilities enhances their appeal, allowing them to describe images and videos, interpret and create diagrams, and serve as more useful personal assistants. Moreover, multimodal AI has the potential to improve existing software for visually impaired individuals, assist consultants and venture capitalists in creating better slide decks, and expedite the processing of complex electronic tasks.
While multimodal AI shows promise, true human-level intelligence is not within reach with current architectures. The challenges of bias, fabrication, and manipulation persist, and the integration of different data types does not guarantee their resolution. Nonetheless, multimodal AI is set to revolutionize the industry and provide valuable services to a wide range of sectors.
Denial of responsibility! VigourTimes is an automatic aggregator of Global media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, and all materials to their authors. For any complaint, please reach us at – [email protected]. We will take necessary action within 24 hours.