- OpenAI announced its latest AI model, GPT-4o (“omnimodel”), with capabilities to reason across audio, vision, and text in real-time, along with updates like a voice chatbot and desktop app.
- The company showcased GPT-4o’s ability to hold natural conversations, sing, laugh, determine emotional states, translate languages, and explain math problems from images.
- While promising more accessible AI, OpenAI acknowledged safety challenges with GPT-4o’s multimodal capabilities and vowed to work on mitigations against misuse.
ChatGPT developer OpenAI today announced its latest AI model, GPT-4o—the “O” stands for “omnimodel”—during a spring product update live stream. The company also announced a slew of product updates, including a voice chatbot.
New Desktop and Mobile Apps
OpenAI updated its mobile apps immediately following its announcements and also launched a desktop app for ChatGPT. The company emphasized improvements to its user experience, which it says allows people to better focus on the conversations they have with ChatGPT.
“For the past couple of years, we’ve been very focused on improving the intelligence of these models, and they’ve gotten pretty good,” OpenAI chief technology officer Mira Murati said. “But this is the first time that we are really making a huge step forward when it comes to ease of use.”
Introducing GPT-4o
The livestream emphasized a simplified and more holistic approach to generative AI with the new GPT-4o model. An “omnimodel”—or natively multimodal—system does everything within its core application instead of coordinating among GPT for text, GPT Vision for images, and so on.
“We think it’s very, very important that people have an intuitive feel for what the technology can do, so we really want to pair it with this broader understanding,” Murati said.
She noted that GPT-4o will be available to both paid and free ChatGPT users, as well as users of ChatGPT’s API. Paid ChatGPT subscribers will continue to have access to up to five times the system capacity of free users.
New Conversational Capabilities
OpenAI then showcased ChatGPT’s ability to hold a real-time casual conversation with users, demonstrating a variety of tones and emotions. The demo included ChatGPT singing, laughing, and joking with the OpenAI engineers. The company also claimed that ChatGPT can now determine a user’s emotional state using the mobile phone’s front-facing camera.
A new blog post outlined that GPT-4o enables “much more natural human-computer interaction.” It can respond to audio inputs in as little as 232 milliseconds on average, which is similar to human response time in a conversation.
Even before today’s announcements, many suggested a voice chatbot powered by a next-generation AI model would make the personal companions depicted in the sci-fi movie “Her” a reality.
Additional Features
Using the ChatGPT desktop application, the OpenAI engineers showed that software code could be copied into ChatGPT, allowing the engineer to chat with ChatGPT about it. In the demo, OpenAI also showcased ChatGPT’s ability to perform real-time language translations across 20 languages. ChatGPT was additionally shown explaining a math problem after a photo of the equation was submitted to the app.
Safety and Ethics
OpenAI acknowledged today that GPT-4o presents new safety challenges given its real-time audio and vision capabilities.
“Our team has been hard at work figuring out how to build mitigations against misuse,” Murati said. “We continue to work with different stakeholders out there—from government, media, entertainment, red teamers, and civil society—to figure out how to best bring these technologies into the world.”
Conclusion
With today’s announcements, OpenAI continues to lead in generative AI capabilities. The company’s wildly popular ChatGPT, released in November 2022, has dominated the conversation surrounding this emerging technology. With close ties and investments from Microsoft, OpenAI’s innovations look set to shape the future of AI.