OpenAI Realtime API Python Code: Understanding the Low-Level Sample Code for Azure's Realtime Audio Python Code

OpenAI Realtime API Python Code: Understanding the Low-Level Sample Code for Azure's Realtime Audio Python Code

Introduction

The “gpt-4o-realtime-preview” has been released. In addition to text and audio input/output, it also allows custom function calling via function calling.
As of October 2, 2024, there are issues such as 403 errors, and it seems the API is not usable. This article will be updated once it becomes available.

OpenAI has provided a JavaScript code sample on its website. Additionally, Azure has also published a Python code sample on GitHub.

In this article, we will analyze Azure’s sample code, “low_level_sample.py,” to understand how it works.

Libraries

The required libraries are as follows:

python-dotenv  
soundfile  
numpy  
scipy  

Code Explanation

main Function

In the main function, it first loads the dotenv file to retrieve the API key and endpoint:

load_dotenv()  

Next, it checks the arguments. This file is executed using the command python low_level_sample.py <audio file> <azure|openai>. You can choose either OpenAI or Azure OpenAI as the API:

if len(sys.argv) < 2:  
    print("Usage: python sample.py <audio file> <azure|openai>")  
    print("If second argument is not provided, it will default to azure")  
    sys.exit(1)  

Then, it uses asyncio to run the process asynchronously:

file_path = sys.argv[1]  
if len(sys.argv) == 3 and sys.argv[2] == "openai":  
    asyncio.run(with_openai(file_path))  
else:  
    asyncio.run(with_azure_openai(file_path))  

Next, let’s look at the with_openai function.

with_openai Function

The API key and model name are retrieved from environment variables.
Then, an instance of RTLowLevelClient is created:

async with RTLowLevelClient(key_credential=AzureKeyCredential(key), model=model) as client:  

Next, a message is added:

await client.send(  
    SessionUpdateMessage(session=SessionUpdateParams(turn_detection=ServerVAD(type="server_vad")))  
)  

Here, we specify “server_vad” for Voice Activity Detection (VAD). Although “server_vad” is the only option currently available, you can set options like detection threshold and allowable silence duration:

class ServerVAD(BaseModel):  
    type: Literal["server_vad"] = "server_vad"  
    threshold: Optional[Annotated[float, Field(strict=True, ge=0.0, le=1.0)]] = None  
    prefix_padding_ms: Optional[int] = None  
    silence_duration_ms: Optional[int] = None  

The message is then converted to JSON before being sent:

async def send(self, message: UserMessageType):  
    message_json = message.model_dump_json()  
    await self.ws.send_str(message_json)  

The model_dump_json method is defined in Pydantic.BaseModel and converts the model into a JSON string. The resulting JSON looks like this:

{  
    "event_id": null,  
    "type": "session.update",  
    "session": {  
        "model": null,  
        "modalities": null,  
        "voice": null,  
        "instructions": null,  
        "input_audio_format": null,  
        "output_audio_format": null,  
        "input_audio_transcription": null,  
        "turn_detection": {  
            "type": "server_vad",  
            "threshold": null,  
            "prefix_padding_ms": null,  
            "silence_duration_ms": null  
        },  
        "tools": null,  
        "tool_choice": null,  
        "temperature": null,  
        "max_response_output_tokens": null  
    }  
}  

This is sent to session.update to configure the session. You can specify system instructions in the “instructions” field. For example, to set a system prompt, you can modify the code like this:

await client.send(  
    SessionUpdateMessage(  
        session=SessionUpdateParams(  
            instructions="<your system instructions>",  
            turn_detection=ServerVAD(type="server_vad")  
        )  
    )  
)  

Next, asyncio.gather is used to run both send_audio and receive_messages functions simultaneously:

await asyncio.gather(send_audio(client, audio_file_path), receive_messages(client))  

In the send_audio function, the audio file is read using soundfile, base64 encoded, and then sent as InputAudioBufferAppendMessage:

...  

audio_data, original_sample_rate = sf.read(audio_file_path, dtype="int16", **extra_params)  

...  

audio_bytes = audio_data.tobytes()  

for i in range(0, len(audio_bytes), bytes_per_chunk):  
    chunk = audio_bytes[i : i + bytes_per_chunk]  
    base64_audio = base64.b64encode(chunk).decode("utf-8")  
    await client.send(InputAudioBufferAppendMessage(audio=base64_audio))  

The audio data is sent to input_audio_buffer.append.

In the receive_messages function, responses based on the processed audio data from the send_audio function are received.
The session is established at “/openai/realtime”, and messages are received asynchronously:

message = await client.recv()  

The case structure handles different message types. The message types are explained here. Below are some of the important ones:

input_audio_buffer.committed

When the server-side Voice Activity Detection (VAD) detects that the user’s speech has ended, the input_audio_buffer.committed message is sent.

input_audio_buffer.speech_started

When the AI response begins, input_audio_buffer.speech_started is sent. You can retrieve the start time using message.audio_start_ms.

input_audio_buffer.speech_stopped

When the AI response finishes, input_audio_buffer.speech_stopped is sent. You can retrieve the end time using message.audio_end_ms.
By monitoring speech events, it’s possible to trigger spontaneous responses. For instance, using response.create, the AI can generate a response without waiting for further user input when a period of silence is detected.

conversation.item.created

This can be used to manage conversation history.

response.created

When a response is created, response.created is sent. For streaming processing, you can use response.text.delta and response.audio.delta.

The low_level_sample.py script does not handle audio output. To output audio, you need to retrieve the audio data and use tools like pyaudio for playback. Here’s how to handle the audio data:

audio_bytes = base64.b64decode(chunk.data)
audio_data.extend(audio_bytes)
if audio_data is not null:
    print(prefix, f"Audio received with length: {len(audio_data)}")
    with open(os.path.join(out_dir, f"{item.id}.wav"), "wb") as out:
        audio_array = np.frombuffer(audio_data, dtype=np.int16)

I hope this article helps with your development.
If you found it useful, I would appreciate a positive rating.
Thank you!

コメント

このブログの人気の投稿

OpnAI Realtime API: Conversing via Local Microphone and Speaker

Prompt Caching: Comparing OpenAI, Anthropic, and Gemini