Building a Real-Time Voice-to-Voice Assistant with OpenAI, LiveKit, Next.js and React

Building a Real-Time Voice-to-Voice Assistant with OpenAI, LiveKit, Next.js and React

OpenAI
LiveKit
Next.js
React
Voice
Real-time
WebRTC
AI Assistant
2025-03-23

Table of Contents

Introduction

Building a real-time voice-to-voice assistant with streaming capabilities and chat transcript functionality has become increasingly accessible thanks to advances in AI streaming APIs and WebRTC technologies. This guide will walk you through creating a truly real-time voice assistant using OpenAI's streaming API for AI capabilities, LiveKit for real-time communication, and Next.js with React for a responsive frontend interface.

Complete Streaming Voice Assistant System Architecture

Figure 1: Complete streaming voice assistant architecture with real-time data flow

By the end of this tutorial, you'll have a fully functional application that can listen to user speech, process it through OpenAI's streaming models in real-time, respond with synthesized speech as it's generated (not after completion), and maintain a live-updating transcript of the entire conversation. Streaming is the core technology that makes this truly real-time, enabling natural, fluid interactions that feel like talking to another person.

Without streaming, voice assistants feel robotic and unnatural, with long pauses between user input and assistant response. With proper streaming implementation, we can create an experience that's responsive, natural, and engaging.

Prerequisites

Before we begin, ensure you have the following:

  • Node.js (version 18+) installed on your system
  • Basic knowledge of TypeScript, React, and Next.js
  • An OpenAI API key with access to speech models and streaming capabilities
  • A LiveKit account with API keys (they offer a free tier for development)
  • A code editor like VS Code

We'll be using TypeScript throughout this tutorial for type safety and better developer experience.

Streaming Response Timing

Figure 1: System architecture showing client-server communication flow

Understanding Streaming for Real-Time Voice Assistants

Before diving into the code, it's crucial to understand why streaming is the cornerstone of a truly real-time voice assistant. Without streaming, the entire interaction feels disjointed and unnatural.

The Problem with Non-Streaming Approaches

In a traditional non-streaming implementation, the communication flow follows these steps:

  1. User speaks completely
  2. Audio is sent to the server after user completes speaking
  3. Server processes the entire audio
  4. Server generates a complete response
  5. Complete response is sent back to client
  6. Client plays the response audio

This approach creates a rigid, turn-taking conversation with significant delays between utterances, making the interaction feel unnatural and "robotic."

Streaming Architecture: The Key to Natural Conversation

Streaming Flow Architecture

Figure 2: Streaming data flow architecture

With a streaming approach, the interaction becomes fluid and natural:

  1. User starts speaking
  2. Audio chunks begin streaming to the server immediately
  3. Server processes audio chunks as they arrive
  4. Response begins generating while user is still speaking
  5. Generated response chunks stream back to the client in real-time
  6. Client plays response audio as it's received, not waiting for completion

This creates a continuous, natural flow of conversation with minimal latency, similar to talking with another person.

Key Streaming Components

Our implementation will leverage several streaming technologies:

  • WebRTC Audio Streaming: Provides real-time audio transmission from the browser
  • OpenAI Streaming API: Enables token-by-token response generation
  • LiveKit Agents: Handles real-time bidirectional streaming between client and AI models
  • Server-Sent Events (SSE): Enables server-to-client streaming for transcripts
  • React Streaming UI: Updates the UI smoothly as new content arrives

The key difference is that with streaming, we're processing and rendering data in small chunks as they become available, rather than waiting for entire responses. This is what creates the real-time experience.

// Example of streaming vs non-streaming approach // Non-streaming (traditional) async function processVoiceNonStreaming() { // Wait for complete audio const completeAudio = await recordCompleteUserSpeech(); // Send complete audio to API const response = await openai.audio.speech.create({ model: "tts-1", input: completeAudio, voice: "nova", }); // Wait for complete response const audioResponse = await response.arrayBuffer(); // Play complete response playAudio(audioResponse); } // Streaming (real-time) async function processVoiceWithStreaming() { // Create audio stream const audioStream = createRealTimeAudioStream(); // Process chunks as they arrive audioStream.on('data', async (chunk) => { // Send audio chunk to API with streaming enabled const streamingResponse = await openai.audio.speech.create({ model: "tts-1", input: chunk, voice: "nova", stream: true, }); // Process response chunks as they arrive for await (const partialResponse of streamingResponse) { // Play audio chunk immediately playAudioChunk(partialResponse); // Update transcript in real-time updateTranscriptInRealTime(partialResponse.text); } }); }

As you can see from the example above, streaming creates a fundamentally different user experience by processing and responding to data continuously rather than in discrete steps.

Project Setup

Let's start by creating a new Next.js project with the App Router:

npx create-next-app@latest voice-assistant --typescript cd voice-assistant

Now, install the required dependencies:

npm install @livekit/components-react @livekit/components-styles npm install @livekit/agents @livekit/agents-plugin-openai npm install openai npm install livekit-client livekit-server-sdk

Create a new .env.local file in the root of your project to store your API keys:

# OpenAI OPENAI_API_KEY=your_openai_api_key # LiveKit LIVEKIT_API_KEY=your_livekit_api_key LIVEKIT_API_SECRET=your_livekit_api_secret LIVEKIT_URL=your_livekit_url # e.g., wss://your-project.livekit.cloud

Setting Up the LiveKit Agent with Streaming Capabilities

First, let's define our voice agent using the LiveKit Agents framework with full streaming capabilities. Create a new directory agents in the project root and add a file called voice-agent.ts:

// agents/voice-agent.ts import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents'; import * as openai from '@livekit/agents-plugin-openai'; import { JobType } from '@livekit/protocol'; import { fileURLToPath } from 'node:url'; export default defineAgent({ entry: async (ctx: JobContext) => { // Connect to the LiveKit room await ctx.connect(); // Define our multimodal agent with OpenAI's real-time streaming model const agent = new multimodal.MultimodalAgent({ model: new openai.realtime.RealtimeModel({ instructions: `You are a helpful AI assistant that specializes in technology topics. You should respond conversationally and be friendly but concise. Always try to provide valuable information and answer any questions to the best of your ability. If you don't know something, admit it rather than making up information.`, // You can choose from available OpenAI voices: alloy, echo, fable, onyx, nova, shimmer voice: 'nova', // Control the creativity of responses temperature: 0.7, // Allow unlimited response length maxResponseOutputTokens: Infinity, // Enable both text and audio modalities with streaming modalities: ['text', 'audio'], // CRUCIAL: Enable real-time streaming for immediate responses streaming: { enabled: true, // Send partial updates every 3 tokens for ultra-responsive feel partialUpdateInterval: 3, }, // Configure voice activity detection for real-time turn-taking turnDetection: { // Use server-side VAD for greater accuracy type: 'server_vad', // How confident system should be that speech is occurring (0.0-1.0) threshold: 0.5, // How long of silence before considering turn complete (ms) silence_duration_ms: 300, // How much audio to include before detected speech (ms) prefix_padding_ms: 300, }, // Enable continuous conversation without strict turn-taking // This allows more natural overlapping conversation conversation: { continuousTalking: true, // How much of user context to maintain (tokens) contextWindowSize: 4000, }, }), }); // Start the agent in the LiveKit room with streaming enabled await agent.start(ctx.room); // Log when streaming events occur for debugging agent.on('streaming_started', () => { console.log('Streaming started'); }); agent.on('streaming_chunk', (chunk) => { console.log('Received streaming chunk'); }); }, }); // This code allows the agent to be run from CLI if (require.main === module) { cli.runApp(new WorkerOptions({ agent: fileURLToPath(import.meta.url), workerType: JobType.JT_ROOM })); }

This code sets up a LiveKit agent that can process both text and audio using OpenAI's real-time streaming model. Let's break down the key streaming components:

  • streaming: { enabled: true }: The critical configuration that enables real-time streaming of responses as they're generated
  • partialUpdateInterval: Controls how frequently partial updates are sent (lower values = more responsive but more network traffic)
  • continuousTalking: Enables more natural conversation flow without rigid turn-taking
  • turnDetection: Fine-tuned parameters for detecting when the user has finished speaking
  • streaming events: Event listeners that provide hooks into the streaming process

The streaming configuration is what enables our assistant to respond in real-time as the user is speaking, rather than waiting for complete utterances. This creates the natural, flowing conversation experience that differentiates our system from traditional voice assistants.

Streaming Response Timing

Figure 3: Comparison of response timing with and without streaming

Creating the Backend API

Now we need to create API endpoints to interact with our LiveKit services. Let's set up the necessary routes:

First, create an API route to generate LiveKit access tokens:

// app/api/token/route.ts import { AccessToken } from 'livekit-server-sdk'; import { NextRequest, NextResponse } from 'next/server'; export async function POST(req: NextRequest) { try { const { roomName, participantName } = await req.json(); // Validate request data if (!roomName || !participantName) { return NextResponse.json( { error: 'roomName and participantName are required' }, { status: 400 } ); } // Create a new access token const apiKey = process.env.LIVEKIT_API_KEY!; const apiSecret = process.env.LIVEKIT_API_SECRET!; const token = new AccessToken(apiKey, apiSecret, { identity: participantName, }); // Grant publish and subscribe permissions token.addGrant({ roomJoin: true, room: roomName, canPublish: true, canSubscribe: true, }); // Return the token return NextResponse.json({ token: token.toJwt() }); } catch (error) { console.error('Error generating token:', error); return NextResponse.json( { error: 'Failed to generate token' }, { status: 500 } ); } }

Next, create an API endpoint to start our voice agent for a specific room:

// app/api/agent/start/route.ts import { NextRequest, NextResponse } from 'next/server'; import { LivekitAgentService } from '@livekit/agents'; export async function POST(req: NextRequest) { try { const { roomName } = await req.json(); if (!roomName) { return NextResponse.json( { error: 'roomName is required' }, { status: 400 } ); } // Initialize the LiveKit Agent Service const agentService = new LivekitAgentService({ apiKey: process.env.LIVEKIT_API_KEY!, apiSecret: process.env.LIVEKIT_API_SECRET!, livekitUrl: process.env.LIVEKIT_URL!, }); // Start the agent for the given room const jobId = await agentService.startWorker({ type: 'room', room: roomName, agentPath: '../../agents/voice-agent.ts', // Path to our agent file }); return NextResponse.json({ jobId }); } catch (error) { console.error('Error starting agent:', error); return NextResponse.json( { error: 'Failed to start agent' }, { status: 500 } ); } }

Building the Frontend

Now let's create the React components for our voice assistant interface. We'll need to create:

  1. A LiveKit provider component for managing connections
  2. A conversation interface with transcripts
  3. Audio controls for the user

First, let's create a LiveKit context provider:

// components/LiveKitProvider.tsx import React, { createContext, useContext, useState, useEffect } from 'react'; import { Room, RoomEvent, RemoteParticipant, LocalParticipant } from 'livekit-client'; interface LiveKitContextType { room: Room | null; connect: (token: string, url: string) => Promise<void>; disconnect: () => void; isConnected: boolean; localParticipant: LocalParticipant | null; remoteParticipants: RemoteParticipant[]; } const LiveKitContext = createContext<LiveKitContextType | null>(null); export function useLiveKit() { const context = useContext(LiveKitContext); if (!context) { throw new Error('useLiveKit must be used within a LiveKitProvider'); } return context; } export function LiveKitProvider({ children }: { children: React.ReactNode }) { const [room] = useState(() => new Room()); const [isConnected, setIsConnected] = useState(false); const [remoteParticipants, setRemoteParticipants] = useState<RemoteParticipant[]>([]); useEffect(() => { // Set up event listeners for the room room.on(RoomEvent.ParticipantConnected, () => { setRemoteParticipants(Array.from(room.remoteParticipants.values())); }); room.on(RoomEvent.ParticipantDisconnected, () => { setRemoteParticipants(Array.from(room.remoteParticipants.values())); }); room.on(RoomEvent.Connected, () => { setIsConnected(true); }); room.on(RoomEvent.Disconnected, () => { setIsConnected(false); }); return () => { room.off(RoomEvent.ParticipantConnected); room.off(RoomEvent.ParticipantDisconnected); room.off(RoomEvent.Connected); room.off(RoomEvent.Disconnected); }; }, [room]); const connect = async (token: string, url: string) => { try { await room.connect(url, token); console.log('Connected to LiveKit room'); } catch (error) { console.error('Failed to connect to LiveKit room:', error); throw error; } }; const disconnect = () => { room.disconnect(); }; const value = { room, connect, disconnect, isConnected, localParticipant: room.localParticipant, remoteParticipants, }; return ( <LiveKitContext.Provider value={value}>{children}</LiveKitContext.Provider> ); }

Next, let's create a component for the conversation transcript that supports real-time streaming updates:

// components/ConversationTranscript.tsx import React, { useEffect, useState, useRef } from 'react'; import { useLiveKit } from './LiveKitProvider'; import { DataPacket_Kind, RemoteParticipant, Track } from 'livekit-client'; interface Message { id: string; sender: string; text: string; timestamp: Date; isUser: boolean; isStreaming: boolean; // Flag to indicate if a message is currently streaming streamingComplete: boolean; // Flag to indicate if streaming is complete } export default function ConversationTranscript() { const { room, localParticipant, remoteParticipants } = useLiveKit(); const [messages, setMessages] = useState<Message[]>([]); const transcriptRef = useRef<HTMLDivElement>(null); const activeStreamingMessageId = useRef<string | null>(null); // Function to add a new message const addMessage = (sender: string, text: string, isUser: boolean, isStreaming = false) => { const newMessageId = `${Date.now()}-${Math.random()}`; const newMessage: Message = { id: newMessageId, sender, text, timestamp: new Date(), isUser, isStreaming, streamingComplete: !isStreaming, }; if (isStreaming) { activeStreamingMessageId.current = newMessageId; } setMessages((prev) => [...prev, newMessage]); return newMessageId; }; // Function to update a streaming message const updateStreamingMessage = (id: string, newText: string, complete = false) => { setMessages((prev) => prev.map((message) => message.id === id ? { ...message, text: newText, streamingComplete: complete, isStreaming: !complete } : message ) ); if (complete) { activeStreamingMessageId.current = null; } }; // Set up data channel for receiving text messages useEffect(() => { if (!room || !localParticipant) return; // Set up local data channel to send user messages const sendData = (text: string) => { if (room && localParticipant) { // Send the message via LiveKit data channel const data = JSON.stringify({ type: 'message', text }); room.localParticipant.publishData( new TextEncoder().encode(data), DataPacket_Kind.RELIABLE ); // Add to local transcript addMessage(localParticipant.identity || 'You', text, true); } }; // Handle streaming messages from the assistant const onStreamingDataReceived = (payload: Uint8Array, participant: RemoteParticipant) => { try { const data = JSON.parse(new TextDecoder().decode(payload)); if (data.type === 'streaming_start') { // Create a new streaming message const messageId = addMessage( participant.identity || 'Assistant', '', // Empty initial text false, true // Mark as streaming ); } else if (data.type === 'streaming_chunk') { // Update existing streaming message with new content if (activeStreamingMessageId.current) { updateStreamingMessage( activeStreamingMessageId.current, data.text, // Partial text from streaming chunk false // Not complete yet ); } } else if (data.type === 'streaming_end') { // Finalize the streaming message if (activeStreamingMessageId.current) { updateStreamingMessage( activeStreamingMessageId.current, data.text, // Final complete text true // Mark as complete ); } } else if (data.type === 'message') { // Legacy non-streaming message addMessage(participant.identity || 'Assistant', data.text, false); } } catch (error) { console.error('Error parsing received data:', error); } }; // Subscribe to remote participants' data remoteParticipants.forEach((participant) => { participant.on('dataReceived', (payload) => onStreamingDataReceived(payload, participant)); }); // Set up audio streaming const setupAudioStreaming = () => { remoteParticipants.forEach((participant) => { participant.audioTracks.forEach((track) => { // Handle incoming audio tracks if (track.kind === Track.Kind.Audio) { // Subscribe to audio track automatically track.setSubscribed(true); // Handle audio elements for streaming playback const audioElement = new Audio(); audioElement.autoplay = true; // Attach track to audio element for streaming playback track.track?.attach(audioElement); // Setup ended event to clean up audioElement.addEventListener('ended', () => { track.track?.detach(audioElement); }); } }); // Listen for new tracks participant.on('trackSubscribed', (track) => { if (track.kind === Track.Kind.Audio) { // New audio track arrived, attach it for streaming playback const audioElement = new Audio(); audioElement.autoplay = true; track.attach(audioElement); } }); }); }; // Initialize audio streaming setupAudioStreaming(); // Clean up event listeners return () => { remoteParticipants.forEach((participant) => { participant.off('dataReceived'); participant.off('trackSubscribed'); // Detach all audio tracks participant.audioTracks.forEach((publication) => { publication.track?.detach(); }); }); }; }, [room, localParticipant, remoteParticipants]); // Auto-scroll to the latest message useEffect(() => { if (transcriptRef.current) { transcriptRef.current.scrollTop = transcriptRef.current.scrollHeight; } }, [messages]); return ( <div className="w-full max-w-3xl mx-auto"> <div className="border rounded-lg p-4 h-96 overflow-y-auto bg-slate-900" ref={transcriptRef}> {messages.length === 0 ? ( <div className="text-center text-gray-400 my-8"> Your conversation will appear here </div> ) : ( messages.map((message) => ( <div key={message.id} className={`mb-4 ${message.isUser ? 'text-right' : 'text-left'}`} > <div className={`inline-block px-4 py-2 rounded-lg ${ message.isUser ? 'bg-blue-600 text-white rounded-br-none' : 'bg-gray-700 text-white rounded-bl-none' }`}> <div className="text-sm font-semibold flex items-center gap-2"> {message.sender} {message.isStreaming && ( <span className="inline-block w-2 h-2 bg-green-400 rounded-full animate-pulse" title="Streaming response"></span> )} </div> <div> {message.text} {message.isStreaming && ( <span className="inline-block w-1 h-4 bg-white ml-1 animate-blink"></span> )} </div> <div className="text-xs opacity-70 mt-1"> {message.timestamp.toLocaleTimeString()} </div> </div> </div> )) )} </div> {/* Streaming status indicator */} {messages.some(m => m.isStreaming) && ( <div className="text-center text-sm text-green-400 mt-2 flex items-center justify-center gap-2"> <span className="inline-block w-2 h-2 bg-green-400 rounded-full animate-pulse"></span> Streaming response in real-time... </div> )} </div> ); }

This enhanced conversation transcript component is specifically designed to handle streaming messages. Let's break down the key streaming-focused features:

  • Streaming Message States: We track whether messages are currently streaming or complete
  • Real-time Updates: The component updates messages incrementally as new chunks arrive
  • Streaming Indicators: Visual cues like pulsing dots and a blinking cursor show when a message is streaming
  • Different Message Types: Handles different streaming events (start, chunk, end) to create a smooth experience
  • Audio Streaming: Automatically attaches and plays audio tracks as they arrive

With this implementation, users will see the AI's response appearing in real-time, letter by letter, creating a much more engaging and responsive experience than waiting for complete messages.

You
Can you explain how WebRTC streaming works?
10:15 AM
Assistant
WebRTC (Web Real-Time Communication) enables direct peer-to-peer streaming of audio, video, and data between browsers without plugins. It works by establishing a connection using ICE (Interactive Connectivity Establishment) protocols to navigate NATs and firewalls. The media stream is then transferred directly between peers using UDP for low latency
10:15 AM
Streaming response in real-time...

Figure 4: Live streaming text response with visual indicators

Now, let's create an enhanced audio control panel component that supports streaming audio for real-time voice interaction. We'll break it down into smaller, digestible parts:

1. Component Setup and State

// components/VoiceControls.tsx import React, { useState, useEffect, useRef } from 'react'; import { useLiveKit } from './LiveKitProvider'; import { LocalTrack, Track, DataPacket_Kind } from 'livekit-client'; export default function VoiceControls() { // Get LiveKit context const { room, localParticipant, remoteParticipants } = useLiveKit(); // Audio state const [isMuted, setIsMuted] = useState(true); const [isPublishing, setIsPublishing] = useState(false); const [microphoneTrack, setMicrophoneTrack] = useState<LocalTrack | null>(null); // Streaming state const [isStreaming, setIsStreaming] = useState(false); const [streamingLevel, setStreamingLevel] = useState(0); // Audio processing references const audioLevelInterval = useRef<NodeJS.Timeout | null>(null); const audioAnalyser = useRef<AnalyserNode | null>(null); const audioContext = useRef<AudioContext | null>(null);

2. Audio Processing Setup for Streaming

This function sets up real-time audio processing to detect voice activity and visualize audio levels:

// Initialize audio streaming processing const setupAudioProcessing = async (track: LocalTrack) => { if (!track || !track.mediaStreamTrack) return; try { // Create audio context for stream processing audioContext.current = new AudioContext(); // Create media stream source from microphone track const stream = new MediaStream([track.mediaStreamTrack]); const source = audioContext.current.createMediaStreamSource(stream); // Create analyser for audio level monitoring const analyser = audioContext.current.createAnalyser(); analyser.fftSize = 256; analyser.smoothingTimeConstant = 0.8; // More smoothing for level visualization source.connect(analyser); audioAnalyser.current = analyser; // Set up interval to continuously monitor audio levels const dataArray = new Uint8Array(analyser.frequencyBinCount); startAudioMonitoring(dataArray); } catch (error) { console.error('Error setting up audio processing:', error); } };

3. Continuous Audio Level Monitoring

This function continuously analyzes the audio stream to detect when a user is speaking:

// Monitor audio levels and detect voice activity const startAudioMonitoring = (dataArray: Uint8Array) => { audioLevelInterval.current = setInterval(() => { if (audioAnalyser.current && !isMuted) { audioAnalyser.current.getByteFrequencyData(dataArray); // Calculate average audio level let sum = 0; for (let i = 0; i < dataArray.length; i++) { sum += dataArray[i]; } const average = sum / dataArray.length; // Scale to 0-10 for visualization const level = Math.floor((average / 255) * 10); setStreamingLevel(level); // Voice activity detection logic handleVoiceActivity(level); } else { setStreamingLevel(0); } }, 100); // Check every 100ms for responsive detection }; // Handle voice activity detection for streaming const handleVoiceActivity = (level: number) => { // If voice detected and we're not already streaming, start stream if (level > 2 && !isStreaming) { startStreamingSession(); } // If voice stops and we were streaming, end stream with delay else if (level <= 1 && isStreaming) { // Add slight delay to prevent stopping on brief pauses debounceStreamingEnd(); } };

4. Stream Session Management

These functions handle starting and ending streaming sessions with notifications:

// Start streaming session and notify system const startStreamingSession = () => { setIsStreaming(true); // Notify system of streaming start through data channel if (room && localParticipant) { const data = JSON.stringify({ type: 'streaming_start', timestamp: Date.now() }); localParticipant.publishData( new TextEncoder().encode(data), DataPacket_Kind.RELIABLE ); } }; // Debounce stream ending to avoid false stops const debounceStreamingEnd = () => { setTimeout(() => { // Re-check audio level to confirm it's still low if (audioAnalyser.current) { const confirmArray = new Uint8Array(audioAnalyser.current.frequencyBinCount); audioAnalyser.current.getByteFrequencyData(confirmArray); const average = confirmArray.reduce((a, b) => a + b, 0) / confirmArray.length; const level = Math.floor((average / 255) * 10); // If level is still low, end streaming if (level <= 1) { endStreamingSession(); } } }, 800); // Wait to make sure it's not just a brief pause }; // End streaming session and notify system const endStreamingSession = () => { setIsStreaming(false); // Notify system of streaming end if (room && localParticipant) { const data = JSON.stringify({ type: 'streaming_end', timestamp: Date.now() }); localParticipant.publishData( new TextEncoder().encode(data), DataPacket_Kind.RELIABLE ); } };

5. Microphone Control

This function handles toggling the microphone with optimized streaming settings:

// Start or stop audio streaming publication const toggleMicrophone = async () => { if (!room || !localParticipant) return; try { if (isMuted) { // Start publishing streaming audio await startMicrophone(); } else { // Stop streaming and clean up await stopMicrophone(); } } catch (error) { console.error('Error toggling streaming microphone:', error); setIsPublishing(false); } }; // Start microphone with streaming optimizations const startMicrophone = async () => { // If currently muted, start publishing streaming audio if (!microphoneTrack) { setIsPublishing(true); // Create a new microphone track with streaming options const tracks = await LocalTrack.createMicrophoneTrack({ // Enable high quality audio for better streaming results echoCancellation: true, noiseSuppression: true, autoGainControl: true, // Set higher quality for better streaming experience sampleRate: 48000, channelCount: 1, }); // Setup streaming audio visualization before publishing await setupAudioProcessing(tracks); // Publish the track to the room with streaming metadata await localParticipant.publishTrack(tracks, { name: 'streaming-audio', // Setting lower simulcast for audio but higher quality streamingOption: { maximumBitrate: 64000, // 64kbps is good for voice priority: Track.Priority.HIGH, } }); setMicrophoneTrack(tracks); setIsPublishing(false); } else { // If we already have a track, just resume it with streaming await microphoneTrack.unmute(); setupAudioProcessing(microphoneTrack); } setIsMuted(false); }; // Stop microphone and clean up resources const stopMicrophone = async () => { // If unmuted, mute the track and stop streaming if (microphoneTrack) { await microphoneTrack.mute(); endStreamingSession(); } // Clean up audio processing cleanupAudioProcessing(); setStreamingLevel(0); setIsMuted(true); }; // Clean up audio processing resources const cleanupAudioProcessing = () => { if (audioLevelInterval.current) { clearInterval(audioLevelInterval.current); audioLevelInterval.current = null; } if (audioContext.current) { audioContext.current.close(); audioContext.current = null; } };

6. Component Cleanup and UI Rendering

Handle cleanup on unmount and render the visual interface:

// Clean up tracks and audio processing when component unmounts useEffect(() => { return () => { if (microphoneTrack) { microphoneTrack.stop(); } cleanupAudioProcessing(); }; }, [microphoneTrack]); return ( <div className="mt-6 flex flex-col items-center"> {/* Audio level visualization meter */} {!isMuted && ( <div className="mb-4 w-64 h-8 bg-gray-800 rounded-full p-1 flex items-center"> <div className={`h-6 rounded-full transition-all duration-100 ${ isStreaming ? 'bg-green-500' : 'bg-blue-500' }`} style={{ width: `${streamingLevel * 10}%` }} > {streamingLevel > 4 && ( <div className="text-xs text-white text-center w-full"> {isStreaming ? 'Streaming' : 'Listening'} </div> )} </div> </div> )} {/* Microphone toggle button */} <button onClick={toggleMicrophone} disabled={isPublishing || !room} className={`px-6 py-3 rounded-full flex items-center ${ isMuted ? 'bg-green-600 hover:bg-green-700' : 'bg-red-600 hover:bg-red-700' } text-white font-medium transition-colors`} > {isPublishing ? ( <span>Initializing microphone for streaming...</span> ) : ( <> {isMuted ? ( <> <svg xmlns="http://www.w3.org/2000/svg" className="h-5 w-5 mr-2" viewBox="0 0 20 20" fill="currentColor" > <path fillRule="evenodd" d="M7 4a3 3 0 016 0v4a3 3 0 11-6 0V4zm4 10.93A7.001 7.001 0 0017 8a1 1 0 10-2 0A5 5 0 015 8a1 1 0 00-2 0 7.001 7.001 0 006 6.93V17H6a1 1 0 100 2h8a1 1 0 100-2h-3v-2.07z" clipRule="evenodd" /> </svg> Start Streaming Audio </> ) : ( <> <svg xmlns="http://www.w3.org/2000/svg" className="h-5 w-5 mr-2" viewBox="0 0 20 20" fill="currentColor" > <path fillRule="evenodd" d="M13.477 14.89A6 6 0 015.11 6.524l8.367 8.368zm1.414-1.414L6.524 5.11a6 6 0 018.367 8.367zM18 10a8 8 0 11-16 0 8 8 0 0116 0z" clipRule="evenodd" /> </svg> Stop Streaming Audio </> )} </> )} </button> {/* Streaming status indicator */} {isStreaming && ( <div className="text-xs text-green-400 animate-pulse mt-2"> Real-time streaming active - AI is processing as you speak </div> )} {/* Streaming stats */} {!isMuted && ( <div className="mt-4 text-xs text-gray-400 text-center"> <div>Audio quality: 48kHz mono | Bitrate: 64kbps</div> <div>Connected participants: {remoteParticipants.length + 1}</div> </div> )} </div> ); }

Breaking down the code into smaller, focused sections makes it easier to understand how the streaming audio controls work. Each section handles a specific aspect of the streaming functionality:

  1. Setup and State - Initializes the component with appropriate state variables
  2. Audio Processing - Sets up the Web Audio API for real-time audio analysis
  3. Audio Monitoring - Continuously monitors audio levels to detect speech
  4. Stream Management - Handles starting and stopping streaming sessions
  5. Microphone Control - Manages the microphone with optimized streaming settings
  6. UI Rendering - Provides visual feedback on streaming status

This enhanced voice controls component adds several streaming-specific features:

  • Audio Visualization: Real-time meter showing audio levels during streaming
  • Stream State Management: Detects when voice is active and sends streaming events
  • Enhanced Audio Quality: Configures higher quality audio settings for better streaming
  • Web Audio API Integration: Uses the Web Audio API for advanced audio analysis
  • Visual Feedback: Shows active streaming status for better user experience

The enhanced audio processing is especially important for streaming applications. Instead of only capturing and sending audio, we're continuously analyzing the audio stream to detect speech, visualize audio levels, and provide feedback on streaming status.

Streaming
Real-time streaming active - AI is processing as you speak
Audio quality: 48kHz mono | Bitrate: 64kbps
Connected participants: 2

Figure 5: Real-time audio streaming controls with voice level visualization

Finally, let's create a main component that brings everything together:

// components/VoiceAssistant.tsx import React, { useState, useEffect } from 'react'; import { LiveKitProvider, useLiveKit } from './LiveKitProvider'; import ConversationTranscript from './ConversationTranscript'; import VoiceControls from './VoiceControls'; function AssistantInner() { const { connect, disconnect, isConnected } = useLiveKit(); const [isJoining, setIsJoining] = useState(false); const [error, setError] = useState<string | null>(null); const [roomName, setRoomName] = useState(''); const [userName, setUserName] = useState(''); // Function to join a room const joinRoom = async () => { if (!roomName || !userName) { setError('Room name and user name are required'); return; } setIsJoining(true); setError(null); try { // Get token from our API const tokenResponse = await fetch('/api/token', { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify({ roomName, participantName: userName, }), }); if (!tokenResponse.ok) { throw new Error('Failed to get token'); } const { token } = await tokenResponse.json(); // Connect to the LiveKit room await connect(token, process.env.NEXT_PUBLIC_LIVEKIT_URL!); // Start the agent await fetch('/api/agent/start', { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify({ roomName, }), }); } catch (err) { console.error('Error joining room:', err); setError(err instanceof Error ? err.message : 'Failed to join room'); } finally { setIsJoining(false); } }; // Function to leave the room const leaveRoom = () => { disconnect(); }; return ( <div className="max-w-4xl mx-auto p-4"> <div className="text-center mb-8"> <h1 className="text-3xl font-bold mb-2">AI Voice Assistant</h1> <p className="text-gray-400"> Talk with an AI assistant using your voice </p> </div> {!isConnected ? ( <div className="bg-slate-800 p-6 rounded-lg shadow-lg max-w-md mx-auto"> <h2 className="text-xl font-semibold mb-4">Join a Room</h2> {error && ( <div className="bg-red-600 bg-opacity-20 border border-red-400 text-red-200 px-4 py-2 rounded mb-4"> {error} </div> )} <div className="space-y-4"> <div> <label htmlFor="roomName" className="block text-sm font-medium mb-1"> Room Name </label> <input type="text" id="roomName" value={roomName} onChange={(e) => setRoomName(e.target.value)} placeholder="Enter a room name" className="w-full px-4 py-2 bg-slate-700 rounded border border-slate-600 focus:ring-2 focus:ring-blue-500 focus:border-transparent" disabled={isJoining} /> </div> <div> <label htmlFor="userName" className="block text-sm font-medium mb-1"> Your Name </label> <input type="text" id="userName" value={userName} onChange={(e) => setUserName(e.target.value)} placeholder="Enter your name" className="w-full px-4 py-2 bg-slate-700 rounded border border-slate-600 focus:ring-2 focus:ring-blue-500 focus:border-transparent" disabled={isJoining} /> </div> <button onClick={joinRoom} disabled={isJoining} className="w-full py-2 px-4 bg-blue-600 hover:bg-blue-700 text-white font-medium rounded transition-colors disabled:opacity-50" > {isJoining ? 'Joining...' : 'Join Room'} </button> </div> </div> ) : ( <div className="space-y-6"> <div className="flex justify-between items-center"> <h2 className="text-xl font-semibold"> Room: <span className="text-blue-400">{roomName}</span> </h2> <button onClick={leaveRoom} className="px-4 py-2 bg-red-600 hover:bg-red-700 text-white font-medium rounded-md transition-colors" > Leave Room </button> </div> {/* Conversation transcript */} <ConversationTranscript /> {/* Voice controls */} <VoiceControls /> </div> )} </div> ); } export default function VoiceAssistant() { return ( <LiveKitProvider> <AssistantInner /> </LiveKitProvider> ); }

Creating the Main Page

Let's implement our main page component that will use our voice assistant:

// app/page.tsx 'use client'; import dynamic from 'next/dynamic'; import { Suspense } from 'react'; // Dynamically import the voice assistant component to avoid SSR issues // with browser APIs like getUserMedia const VoiceAssistant = dynamic( () => import('@/components/VoiceAssistant'), { ssr: false } ); export default function Home() { return ( <main className="min-h-screen p-4"> <Suspense fallback={<div>Loading voice assistant...</div>}> <VoiceAssistant /> </Suspense> </main> ); }

Note that we're using Next.js's dynamic import to avoid server-side rendering of components that use browser-specific APIs like getUserMedia.

AI Voice Assistant

Talk with an AI assistant using your voice

Join a Room

tech-chat-room
john_doe

Figure 2: Room joining interface for the voice assistant

Room: tech-chat-room

You
Can you explain how WebRTC works with LiveKit?
10:15 AM
Assistant
LiveKit uses WebRTC for real-time audio and video streaming. WebRTC (Web Real-Time Communication) is a technology that enables direct peer-to-peer communication between browsers without requiring additional plugins.
10:15 AM
Assistant
LiveKit provides a server-side implementation that handles the complex signaling, TURN/STUN servers, and media processing needed for robust WebRTC applications. This makes it easier to build scalable real-time applications.
10:16 AM

Figure 3: Conversation interface with transcript and voice control

Advanced Streaming Features and Optimizations

Now that we have a basic real-time voice assistant with streaming capabilities, let's explore advanced streaming features and optimizations that will make our application truly responsive and natural:

1. Optimizing Streaming Performance

To achieve the lowest possible latency in streaming responses, we need to fine-tune several parameters:

// Optimize streaming configuration in agent definition streaming: { enabled: true, // Lower values = more responsive but higher bandwidth usage // For voice assistants, 2-3 is ideal for real-time feel partialUpdateInterval: 2, // Enable chunked transfer encoding for faster delivery chunkedTransfer: true, // Prioritize partial content delivery prioritizePartialDelivery: true, // Specify response format for most efficient parsing responseFormat: { type: "text", chunk_size: 20, // Characters per chunk, balance between smoothness and overhead }, // Pipeline optimization pipelineConfig: { // Reduce overhead by avoiding full validation between chunks validatePartialResponses: false, // Process chunks in parallel when possible parallelProcessing: true, } }

The parameters above are crucial for achieving the lowest possible streaming latency. For voice assistants, responsiveness is key, so we prioritize speed over bandwidth efficiency.

2. Enhanced Voice Activity Detection for Streaming

To enable truly natural conversation flow, we need sophisticated voice activity detection that can determine when to start and stop streaming:

// Enhanced streaming-optimized VAD configuration turnDetection: { // Server-side VAD offers better accuracy for streaming type: 'server_vad', // Fine-tuned confidence threshold (0.0-1.0) // Lower = more responsive but may trigger on background noise // Higher = fewer false triggers but may miss the start of speech threshold: 0.4, // How long of silence before considering turn complete (ms) // Shorter makes streaming more responsive but may split utterances silence_duration_ms: 400, // Include audio before detected speech for better context (ms) prefix_padding_ms: 200, // Advanced settings for streaming optimization streaming_specific: { // Start streaming response after this duration of speech (ms) // This allows the AI to start formulating responses mid-sentence early_response_ms: 600, // Continue listening for this duration after speech ends (ms) // Helps maintain context between pauses continuation_window_ms: 2000, // Enable continuous speech detection for more natural conversation continuous_detection: true, // Transition phase length between speakers (ms) // Controls how much overlap is allowed in conversation transition_phase_ms: 300, } }

These VAD enhancements allow the assistant to begin generating responses while the user is still talking, similar to how humans begin formulating responses before the other person has finished speaking.

Streaming Response Timing

Figure 6: Optimized streaming response timing with conversation overlap

3. Audio Streaming Optimizations

For the smoothest audio streaming experience, we need to fine-tune the audio processing:

// Create optimized audio streaming configuration const createOptimizedAudioStream = () => { // Configure audio streaming parameters const streamingConfig = { // Audio format settings audio: { // Higher sample rate for better quality voice sampleRate: 48000, // Single channel for voice is sufficient channels: 1, // Using opus codec for efficient streaming codec: 'opus', // Bitrate balanced for voice quality and bandwidth bitrate: 64000, // Frame size optimized for low latency frameSize: 20, // ms }, // Buffer settings buffer: { // Set initial buffer size // Smaller = less latency but more risk of interruptions // Larger = smoother playback but more latency initial: 300, // ms // Minimum buffer before playback starts minimum: 100, // ms // Dynamic buffer adjustment for network conditions adaptive: true, // How quickly buffer adapts (0-1, higher = faster adaptation) adaptationRate: 0.6, }, // Network optimizations network: { // Send multiple small packets rather than waiting for larger ones prioritizeLowLatency: true, // Use WebRTC data channels for control messages (faster than REST) useDataChannels: true, // Prioritize audio packets in congested networks qualityOfService: 'high', // Reconnection strategy reconnection: { attempts: 5, backoff: 'exponential', maxDelay: 3000, // ms }, }, }; return streamingConfig; }; // Apply configuration to audio stream const applyStreamingOptimizations = (audioTrack) => { const config = createOptimizedAudioStream(); // Apply codec and bitrate settings audioTrack.setEncodingParams({ maxBitrate: config.audio.bitrate, codecName: config.audio.codec, }); // Create optimized audio processor const audioContext = new AudioContext({ sampleRate: config.audio.sampleRate, latencyHint: 'interactive', }); // Create buffer with optimal settings const bufferSource = audioContext.createBufferSource(); bufferSource.buffer = audioContext.createBuffer( config.audio.channels, config.buffer.initial * config.audio.sampleRate / 1000, config.audio.sampleRate ); // Connect audio processing nodes optimized for streaming const streamProcessor = audioContext.createScriptProcessor(1024, 1, 1); streamProcessor.onaudioprocess = (e) => { // Process audio chunks with minimal latency // This function runs in the audio thread for real-time performance }; return { track: audioTrack, context: audioContext, processor: streamProcessor }; };

4. Streaming Function Calling

To maintain real-time responsiveness even when calling external functions, we can implement streaming function calls:

Complete Streaming Voice Assistant System Architecture

Figure 7: Parallel function execution with streaming responses

// Define streaming-optimized functions for real-time use const streamingFunctions = [ { name: 'stream_search', description: 'Search for information with streaming results', parameters: { type: 'object', properties: { query: { type: 'string', description: 'The search query', }, }, required: ['query'], }, // Streaming function implementation streamingHandler: async function*(query: string) { // Initialize search const search = initializeSearch(query); // Yield initial quick results immediately yield { partial: true, results: search.quickResults() }; // Start deeper search in background const searchPromise = search.executeFullSearch(); // Stream partial results as they become available for await (const partialResult of search.streamResults()) { yield { partial: true, results: partialResult, confidence: partialResult.confidence }; } // Final complete results const finalResults = await searchPromise; yield { partial: false, results: finalResults }; } }, { name: 'get_weather', description: 'Get weather information for a location with streaming updates', parameters: { type: 'object', properties: { location: { type: 'string', description: 'The city and state, e.g. San Francisco, CA', }, }, required: ['location'], }, // Stream results as they become available streamingHandler: async function*(location: string) { // Yield initial estimate from cache immediately yield { partial: true, source: 'cache', temperature: await getEstimatedTemperature(location), conditions: 'Loading...' }; // Make API request in background const weatherPromise = fetchWeatherData(location); // Stream partial weather data as it loads yield { partial: true, source: 'preliminary', temperature: await getRealtimeTemperature(location), conditions: await getBasicConditions(location) }; // Final complete weather data const weather = await weatherPromise; yield { partial: false, source: 'api', temperature: weather.temperature, conditions: weather.conditions, forecast: weather.forecast }; } } ]; // Configure streaming function calling model: new openai.realtime.RealtimeModel({ // ... other options // Enable streaming function calling functions: streamingFunctions, function_calling: { enabled: true, // Enhanced options for streaming function calls streaming: { // Allow early function calling before user finishes speaking earlyInvocation: true, // Process streaming function results in parallel with ongoing generation parallelProcessing: true, // Interleave function results with text generation interleaveResults: true, // Invoke functions with partial parameter data allowPartialParams: true, // Queue multiple function calls in parallel parallelExecution: true, // Progress updates for long-running functions progressUpdates: true, } }, })

With streaming function calls, the assistant can call external APIs while continuing to generate responses, and can even start calling functions before the user has finished speaking. This creates a much more responsive and natural interaction.

5. Multi-stream Conversation Context

For more natural conversations, we can implement a multi-stream approach that allows for overlapping turns and remembers context:

// Multi-stream conversation context configuration conversation: { // Enable natural overlapping conversation continuousTalking: true, // How much conversation history to maintain (tokens) contextWindowSize: 4000, // Strategy for handling interruptions interruptionHandling: { // Allow user to interrupt AI responses allowUserInterruptions: true, // How to handle user interruptions // "pause", "abandon", "complete", "background" userInterruptBehavior: "pause", // Allow AI to politely interrupt user in appropriate moments allowAiInterruptions: true, // When AI can interrupt (0-1, higher = more likely to interrupt) aiInterruptThreshold: 0.7, }, // Enhanced memory model for streaming context memoryModel: { // Remember more recent conversation turns in detail recencyBias: 0.8, // Maintain entity knowledge across conversation entityTracking: true, // Compress older conversation parts for efficiency compressionEnabled: true, // Hierarchical memory model for better context hierarchical: true, // Long-term conversation state persistence persistenceStrategy: "session", }, // Context adaptation for personalization contextAdaptation: { // Adapt to user speaking style adaptToUserStyle: true, // Learn user preferences over time preferenceTracking: true, // Remember correction patterns learnFromCorrections: true, } }

These advanced features represent the cutting edge of real-time voice assistant technology, creating a truly responsive and natural streaming experience that feels like talking to another person rather than a computer.

Deployment and Production Considerations

When deploying your voice assistant to production, consider the following:

1. Scalability

For production deployments, you should set up a dedicated LiveKit server or use their cloud offering with appropriate scaling. Each voice conversation requires dedicated server resources.

2. Error Handling

Implement robust error handling for network disruptions, API failures, and device permission issues.

// Example of enhanced error handling for microphone permissions const startMicrophone = async () => { try { const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); // Process the stream return stream; } catch (error) { if (error instanceof DOMException) { if (error.name === 'NotAllowedError') { // Handle permission denied return { error: 'Microphone permission denied' }; } else if (error.name === 'NotFoundError') { // Handle no microphone available return { error: 'No microphone found' }; } } // Handle other errors console.error('Microphone error:', error); return { error: 'Failed to access microphone' }; } };

3. Monitoring and Analytics

Implement monitoring to track API usage, conversation quality, and system performance.

// Simple analytics tracking const trackConversation = (roomId: string, data: { duration: number; messageCount: number; userSpeakingTime: number; assistantSpeakingTime: number; errorCount: number; }) => { // Send to your analytics service fetch('/api/analytics/conversation', { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify({ roomId, ...data, timestamp: new Date().toISOString(), }), }); };

4. Cost Management

Both OpenAI and LiveKit charge based on usage. Implement controls to manage costs:

  • Set time limits for conversations
  • Implement rate limiting for users
  • Monitor and set alerts for usage thresholds
  • Consider using smaller models for cost-sensitive applications

Conclusion: The Future of Streaming Voice Assistants

In this comprehensive guide, we've built a truly real-time streaming voice-to-voice assistant that responds naturally and fluidly using OpenAI's streaming APIs, LiveKit, WebRTC, and React. We've covered:

  • Setting up the project infrastructure for real-time communication
  • Implementing streaming as a core component with continuous audio flow
  • Creating a streaming-optimized LiveKit agent with OpenAI's real-time model
  • Building responsive backend API endpoints that support streaming data
  • Developing frontend components with streaming UI updates that respond in real-time
  • Adding advanced streaming features like optimized audio processing and streaming function calling
  • Implementing sophisticated Voice Activity Detection for natural conversation flow
  • Fine-tuning performance parameters for lowest possible latency
  • Considering production deployment for high-performance streaming applications
Streaming Response Timing

Figure 8: Future directions for streaming voice technology

The streaming voice assistant we've built represents a fundamental shift from the turn-taking model of traditional voice assistants to a truly natural, continuous conversation flow. By implementing streaming at every level - from audio capture to AI processing to response generation - we've created an experience that feels responsive and human-like.

This architecture is highly extensible, allowing you to build more sophisticated applications on this streaming foundation, such as:

  • Real-time language translation during conversations
  • Multi-participant voice meetings with AI moderation
  • Ambient voice assistants that listen and respond contextually
  • Emotion-aware voice interfaces that adapt to user sentiment
  • Voice assistants that learn and improve from continuous conversation

As streaming AI technology continues to evolve, the latency gap between human-to-human and human-to-AI conversation will continue to shrink. The streaming-first approach outlined in this guide provides a solid foundation that can be adapted to incorporate new capabilities as they become available, ensuring your voice applications remain at the cutting edge of what's possible.

Remember: without streaming, voice assistants feel robotic and unnatural. With proper streaming implementation throughout the technology stack, we can create truly seamless, responsive, and natural voice interfaces that transform how humans interact with AI.

Further Reading