Digestly

Mar 20, 2025

Audio Models in the API

OpenAI - Audio Models in the API

OpenAI has launched new tools and models to facilitate the creation of voice agents, moving beyond text-based interactions. The release includes three new models: two state-of-the-art speech-to-text models, GPT-40 Transcribe and GPT-4 Mini Transcribe, which outperform previous models like Whisper in accuracy across multiple languages. These models are priced competitively at 6 cents and 3 cents per minute, respectively. Additionally, a new text-to-speech model, GPT-40 Mini TTS, allows developers to control not only what is said but also how it is said, offering customizable voice outputs. This model is available for 1 cent per minute. OpenAI also updated its Agents SDK to simplify converting text agents into voice agents, requiring minimal code changes. These advancements aim to enhance the development of reliable, flexible, and human-like voice experiences, with practical applications in customer support, language learning, and more.

Key Points:

  • OpenAI released new speech-to-text models, GPT-40 Transcribe and GPT-4 Mini Transcribe, with improved accuracy and competitive pricing.
  • A new text-to-speech model, GPT-40 Mini TTS, offers customizable voice outputs, enhancing user interaction.
  • The updated Agents SDK allows easy conversion of text agents to voice agents, streamlining development processes.
  • The new models support streaming audio, enabling fast and efficient voice interactions.
  • OpenAI encourages developers to explore these tools through a contest, promoting creative uses of the technology.

Details:

1. 🎉 Exciting Announcements from OpenAI

  • OpenAI is prioritizing the development of agents to enhance AI capabilities, indicating a strategic direction for the company.
  • No specific metrics or detailed examples were provided, leaving room for more comprehensive insights.
  • The announcements reflect OpenAI's commitment to pushing the boundaries of AI innovation.

2. 🔊 Introduction to Voice Agents

  • Deep operator research over recent months led to the creation of the agent ASDK, facilitating the development of custom voice agents.
  • The transition from text to voice agents is driven by the natural human preference for speaking and listening over writing and reading, making voice a more intuitive interface.
  • Voice agents offer a more engaging way for users to interact with technology, leveraging the natural human interface.
  • The ASDK provides a robust platform for developers to build customized, efficient voice agents, enhancing user experiences across various applications.

3. 🛠️ New Models and Tools Overview

  • The new models and tools aim to enable developers and businesses to build voice agents that are reliable, accurate, and flexible.
  • The announcement includes a variety of new models and tools designed to enhance voice agent development.
  • Specific models focus on improving natural language understanding and speech recognition accuracy.
  • New tools provide developers with customizable options to tailor voice agents to specific business needs.
  • Examples of applications include customer service automation and personalized user interactions.
  • Metrics indicate a 30% improvement in speech recognition accuracy compared to previous versions.

4. 🗣️ Building Voice Agents: Methods and Benefits

  • OpenAI released three new models and several tools to facilitate the development of humanlike voice experiences.
  • Two new state-of-the-art speech-to-text models outperform the previous Whisper model across all tested languages.
  • A new text-to-speech model allows developers to control both the content and the delivery style.
  • Updated agents SDK simplifies converting text-based agents into voice agents.
  • Voice agents are AI systems that operate independently for users or developers, similar to text agents seen in website chat boxes, but using voice interaction.

5. 📈 Advanced Speech Models and Chain Approach

  • Voice agents can be used for language learning experiences, providing pronunciation coaching, lesson plans, and mock conversations.
  • Developers use two primary approaches for building voice agents: futuristic speech-to-speech models and a chained approach.
  • Futuristic speech-to-speech models understand and respond to audio directly, powering advanced voice modes and real-time APIs such as chat GPT.
  • The chained approach involves a speech-to-text model turning user input into text, which is processed by a text-based LLM like GPT-4, then responded to by a text-to-speech model.
  • Developers prefer the chained approach due to its modularity, allowing for mixing and matching of components to use the best models for specific use cases.
  • The chained approach also offers high reliability, with text-based models still considered the gold standard in intelligence, although speech-to-speech models are rapidly improving.

6. 🔗 Integrating Text and Voice Agents

6.1. Introduction of New Speech-to-Text Models

6.2. Features and Performance Metrics

6.3. Pricing and Accessibility

6.4. Additional Capabilities and Practical Applications

7. 🗣️ New Speech-to-Text and Text-to-Speech Models

  • The new text-to-speech model, GPT-40 Mini TTS, offers users the ability to choose from various voices and customize speech delivery through an 'instructions' field.
  • Developers can define tone and pacing, allowing for tailored audio experiences that enhance user engagement and personalization.
  • The model's personality and tone are adaptable, with users able to prompt specific instructions for desired output customization.
  • A demonstration site, OpenFM, provides an interactive platform for experimenting with the model using pre-set or user-defined prompts.
  • Integration with the model is simplified with provided code snippets in Python and JavaScript, encouraging ease of adoption by developers.
  • Potential applications include personalized virtual assistants, educational tools, and accessible content for diverse audiences, showcasing the model's versatility.
  • The model's flexibility in tone and voice selection makes it suitable for a wide range of industries, from entertainment to customer service.
  • By allowing specific instructions, the model supports creative uses, such as storytelling and dynamic content creation, broadening its application scope.

8. 🔄 Transitioning Text Agents to Voice Agents

8.1. Introduction to gbt for Min TTS

8.2. Update to Agents SDK

8.3. Demonstration of AI Stylist and Customer Support Agent

8.4. Configuration and Agent Details

8.5. UI Changes and Workflow

8.6. Transitioning to Voice Agents

9. 🎤 Interactive Demo and Contest Announcement

9.1. 🎤 Interactive Demo Highlights

9.2. 🎤 Contest Announcement

View Full Content
Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis
Starting at $5/month. Cancel anytime.