Since its inception, most of ChatGPT’s updates have focused on what this AI-based robot can do, what questions it can answer, what information it can access, and how to improve its base model. This time, OpenAI is changing the way ChatGPT is used.
On September 25th, OpenAI announced on its official website that ChatGPT can now “see,” “listen,” and “speak.” A new version of ChatGPT is currently being rolled out, and in addition to common text-based interactions, it is now learning to understand human questions in new ways, such as through speaking aloud or uploading images.
Specifically, in terms of voice interactions:
- Users are allowed to engage in voice conversations with ChatGPT, providing a more intuitive means of interaction.
- This feature is supported on iOS and Android mobile apps, enabling users to enjoy voice interactions on their mobile devices.
- ChatGPT offers five different voice choices to enhance the user experience.
- Voice interactions are enabled through a new text-to-speech model and a speech recognition system.
This should feel similar to conversing with Apple’s Siri, but OpenAI hopes to make it produce better answers by improving its underlying technology. Currently, most virtual assistants, like Amazon’s Alexa, are being improved based on large language models (LLMs).
According to OpenAI, the new voice feature is supported by a text-to-speech model that can generate “human-like audio” from text and a few seconds of speech samples. OpenAI also seems to believe that this model has potential beyond its current capabilities and is working with the music streaming platform Spotify to translate podcasts into other languages while preserving the speaker’s voice. Synthetic speech has many interesting applications, and OpenAI could become a significant player in this industry.
However, the ability to create powerful synthetic voices with just a few seconds of audio raises new risks. OpenAI notes that these features “also introduce new risks, such as the possibility of malicious actors impersonating public figures or committing fraud.” For this reason, the model will not be widely available and will be subject to specific use cases and partner restrictions.
In terms of image interactions:
- Users can upload images to interact with ChatGPT.
- Support for multiple images.
- Providing drawing tools on the mobile app, users can use them to clarify their queries or requests.
- Using a multimodal GPT model to understand images better.
- These image and voice features will be initially rolled out to Plus (paid subscription) and enterprise users within the next two weeks.
- The introduction of voice and image features will follow a progressive strategy to ensure security.
- It’s important to note the limitations of the model and avoid relying on it in high-risk scenarios.
Image search is somewhat similar to the functionality of Google Lens. Users can take photos of anything they’re interested in, and ChatGPT will try to understand what the user is asking and respond accordingly. Users can also use drawing tools in the app to help clarify their queries while speaking or writing questions alongside images.
This is an interactive feature that ChatGPT aims to achieve: rather than conducting a search and receiving incorrect answers, it’s better to prompt the AI robot to improve its answers during the process.
However, image searches clearly have their potential issues. For example, how should ChatGPT react when a user inquires about a person’s identity based on a photo? OpenAI states that they have intentionally limited ChatGPT’s “ability to analyze and directly state information about people” for both accuracy and privacy reasons. This means that a science-fiction-like scenario of looking at someone and asking AI “who is that” won’t be realized anytime soon.
Nearly a year after the initial release of ChatGPT, OpenAI seems to be still exploring how to add more functionalities and capabilities to its AI robot without introducing new problems and flaws. OpenAI is also attempting to maintain a balance between going “further” and “reducing risk” by intentionally limiting the functionality of its new models. However, this approach may not be effective indefinitely. As more people use voice control and image search, and ChatGPT becomes increasingly closer to being a truly multi-modal and useful virtual assistant, maintaining these guardrails will become more challenging.