
Voice-driven applications have shifted from sci-fi concepts to deployable solutions with OpenAI's latest tools. This guide walks through practical implementation of multilingual voice agents using the OpenAI Agent SDK, demonstrating how to create systems that process speech across languages while maintaining human-like interaction rhythms.
What Is OpenAI Agent SDK?
OpenAI Agent SDK provides developers with a framework to build AI agents that can process and respond to various inputs, including voice. The SDK supports GPT-4o-realtime-preview model, which enables real-time conversation capabilities through its advanced Natural Language Processing (NLP) features.
The SDK specifically includes VoicePipeline, a component designed to handle voice-based interactions seamlessly. This pipeline manages the complex process of converting speech to text, processing the information, and generating natural-sounding responses.
Core Architecture of Modern Voice Agents

1. The Speech Processing Pipeline
OpenAI's VoicePipeline operates through three synchronized stages: audio capture, language processing, and response generation. The system begins by converting raw audio signals into text using speech-to-text models like GPT-4o Transcribe. This textual input then feeds into language models that analyze context, intent, and emotional tone. Finally, text-to-speech components generate natural-sounding vocal responses while maintaining conversation flow.
2. Multimodal vs Chained Architectures
Two distinct approaches dominate voice agent development:
Direct Audio Processing (Multimodal)
GPT-4o-realtime-preview processes audio without text conversion, delivering 200-300ms responses. This architecture captures vocal nuances including pitch and pauses, enabling emotion-aware replies during customer interactions by maintaining native audio processing throughout.
Text-Centric Processing (Chained)
Traditional pipelines separate transcription, analysis, and synthesis stages. This modular approach enables detailed logging for compliance-sensitive applications like healthcare triage. Developers gain precise control over each stage while using task-specific optimized models.
Multilingual Voice Agent Guide: From Code to Conversation
Creating voice agents with OpenAI Agent SDK requires specific environment configurations. Follow these steps to establish a functional development environment with voice capabilities.
Step 1. Python & Virtual Environment Setup
Ensure Python 3.8+ is installed. Verify with:
python --version
For new installations, download Python from python.org.
a. Create a Virtual Environment
Isolate dependencies to avoid conflicts:
p-ython -m venv voice_agent_env
b. Activation:
- Linux/macOS:
source voice_agent_env/bin/activate
- Windows:
voice_agent_env\Scripts\activate
c. Install Voice-Specific Dependencies
Install the OpenAI Agents SDK with voice extensions and audio libraries:
pip install 'openai-agents[voice]' numpy sounddevice scipy python-dotenv
d. Configure OpenAI API Key: Store your API key securely using environment variables:
- Create a
.env
file:
echo "OPENAI_API_KEY=your-api-key-here" > .env
- Clone the Example Repository (Optional):
To speed things up, you might clone the official example from the OpenAI Agents SDK GitHub repository.
git clone https://github.com/openai/openai-agents-python.git
cd openai-agents-python/examples/voice/static
Step 2. Building the Multilingual Agent
The main components include:
- Language-specific agents for different languages (Spanish, Hindi)
- A primary agent that handles initial interactions
- Function tools for additional capabilities (like weather information)
Here's a simplified version of the code structure:
a. Define Your Agents
Create different agent instances for each language you want to support. For example, a Spanish agent and a Hindi agent can be created with instructions in their respective languages:
from agents import Agent
from agents.extensions.handoff_prompt import prompt_with_handoff_instructions
spanish_agent = Agent(
name="Spanish",
handoff_description="A Spanish speaking agent.",
instructions=prompt_with_handoff_instructions(
"You're speaking to a human, so be polite and concise. Speak in Spanish."
),
model="gpt-4o-mini",
)
hindi_agent = Agent(
name="Hindi",
handoff_description="A Hindi speaking agent.",
instructions=prompt_with_handoff_instructions(
"You're speaking to a human, so be polite and concise. Speak in Hindi."
),
model="gpt-4o-mini",
)
Create your primary assistant that will detect the language from the user’s speech and delegate to the appropriate agent if needed:
agent = Agent(
name="Assistant",
instructions=prompt_with_handoff_instructions(
"You're speaking to a human, so be polite and concise. If the user speaks in Spanish, hand off to the Spanish agent. If the user speaks in Hindi, hand off to the Hindi agent."
),
model="gpt-4o-mini",
handoffs=[spanish_agent, hindi_agent],
)
b. Add Tools (Optional)
For example, you can add a simple weather tool that the agent might call:
import random
from agents import function_tool
@function_tool
def get_weather(city: str) -> str:
choices = ["sunny", "cloudy", "rainy", "snowy"]
return f"The weather in {city} is {random.choice(choices)}."
agent.tools.append(get_weather)
Step 3. Setting Up the Voice Pipeline
The SDK’s voice pipeline combines three components:
- Speech-to-Text (STT): Converts your audio input into text.
- Agentic Workflow: Processes the text (including language detection and tool invocation).
- Text-to-Speech (TTS): Converts the agent's text reply back into audio.
Here's a simplified example:
import asyncio
import numpy as np
import sounddevice as sd
from agents.voice import AudioInput, SingleAgentVoiceWorkflow, VoicePipeline
async def main():
# Create the voice pipeline with your primary agent
pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(agent))
# For demonstration, we'll simulate 3 seconds of audio input with silence.
buffer = np.zeros(24000 * 3, dtype=np.int16)
audio_input = AudioInput(buffer=buffer)
# Run the pipeline
result = await pipeline.run(audio_input)
# Set up the audio player (using sounddevice)
player = sd.OutputStream(samplerate=24000, channels=1, dtype=np.int16)
player.start()
# Stream and play audio events from the agent's output
async for event in result.stream():
if event.type == "voice_stream_event_audio":
player.write(event.data)
if __name__ == "__main__":
asyncio.run(main())
In a real-world application, instead of silence you’d capture live microphone input, and the agent would detect the language in real time.
Step 4: Run Your Voice Agent
python -m examples.voice.static.main
Best Practices for Voice Agent Development
When building voice agents with the OpenAI Agent SDK, consider these best practices:
Take the Lead With Your Multilingual Voice Agent Today
Building voice agents with the OpenAI Agent SDK has become significantly more accessible. Developers can now choose between multimodal or chained architectures based on their specific needs, set up a VoicePipeline, and let the SDK handle the complex processing.
For conversational flow quality, the multimodal approach works best. For structure and control, the chained method is more suitable. This technology continues to advance, opening new possibilities for voice-driven applications.