Table of Contents
- Introduction
- Prerequisites
- Understanding Streaming for Real-Time Voice Assistants
- Project Setup
- Setting Up the LiveKit Agent with Streaming Capabilities
- Creating the Backend API
- Building the Frontend
- Creating the Main Page
- Advanced Streaming Features and Optimizations
- Deployment and Production Considerations
- Conclusion: The Future of Streaming Voice Assistants
- Further Reading
Introduction
Building a real-time voice-to-voice assistant with streaming capabilities and chat transcript functionality has become increasingly accessible thanks to advances in AI streaming APIs and WebRTC technologies. This guide will walk you through creating a truly real-time voice assistant using OpenAI's streaming API for AI capabilities, LiveKit for real-time communication, and Next.js with React for a responsive frontend interface.
Figure 1: Complete streaming voice assistant architecture with real-time data flow
By the end of this tutorial, you'll have a fully functional application that can listen to user speech, process it through OpenAI's streaming models in real-time, respond with synthesized speech as it's generated (not after completion), and maintain a live-updating transcript of the entire conversation. Streaming is the core technology that makes this truly real-time, enabling natural, fluid interactions that feel like talking to another person.
Without streaming, voice assistants feel robotic and unnatural, with long pauses between user input and assistant response. With proper streaming implementation, we can create an experience that's responsive, natural, and engaging.
Prerequisites
Before we begin, ensure you have the following:
- Node.js (version 18+) installed on your system
- Basic knowledge of TypeScript, React, and Next.js
- An OpenAI API key with access to speech models and streaming capabilities
- A LiveKit account with API keys (they offer a free tier for development)
- A code editor like VS Code
We'll be using TypeScript throughout this tutorial for type safety and better developer experience.
Figure 1: System architecture showing client-server communication flow
Understanding Streaming for Real-Time Voice Assistants
Before diving into the code, it's crucial to understand why streaming is the cornerstone of a truly real-time voice assistant. Without streaming, the entire interaction feels disjointed and unnatural.
The Problem with Non-Streaming Approaches
In a traditional non-streaming implementation, the communication flow follows these steps:
- User speaks completely
- Audio is sent to the server after user completes speaking
- Server processes the entire audio
- Server generates a complete response
- Complete response is sent back to client
- Client plays the response audio
This approach creates a rigid, turn-taking conversation with significant delays between utterances, making the interaction feel unnatural and "robotic."
Streaming Architecture: The Key to Natural Conversation
Figure 2: Streaming data flow architecture
With a streaming approach, the interaction becomes fluid and natural:
- User starts speaking
- Audio chunks begin streaming to the server immediately
- Server processes audio chunks as they arrive
- Response begins generating while user is still speaking
- Generated response chunks stream back to the client in real-time
- Client plays response audio as it's received, not waiting for completion
This creates a continuous, natural flow of conversation with minimal latency, similar to talking with another person.
Key Streaming Components
Our implementation will leverage several streaming technologies:
- WebRTC Audio Streaming: Provides real-time audio transmission from the browser
- OpenAI Streaming API: Enables token-by-token response generation
- LiveKit Agents: Handles real-time bidirectional streaming between client and AI models
- Server-Sent Events (SSE): Enables server-to-client streaming for transcripts
- React Streaming UI: Updates the UI smoothly as new content arrives
The key difference is that with streaming, we're processing and rendering data in small chunks as they become available, rather than waiting for entire responses. This is what creates the real-time experience.
// Example of streaming vs non-streaming approach
// Non-streaming (traditional)
async function processVoiceNonStreaming() {
// Wait for complete audio
const completeAudio = await recordCompleteUserSpeech();
// Send complete audio to API
const response = await openai.audio.speech.create({
model: "tts-1",
input: completeAudio,
voice: "nova",
});
// Wait for complete response
const audioResponse = await response.arrayBuffer();
// Play complete response
playAudio(audioResponse);
}
// Streaming (real-time)
async function processVoiceWithStreaming() {
// Create audio stream
const audioStream = createRealTimeAudioStream();
// Process chunks as they arrive
audioStream.on('data', async (chunk) => {
// Send audio chunk to API with streaming enabled
const streamingResponse = await openai.audio.speech.create({
model: "tts-1",
input: chunk,
voice: "nova",
stream: true,
});
// Process response chunks as they arrive
for await (const partialResponse of streamingResponse) {
// Play audio chunk immediately
playAudioChunk(partialResponse);
// Update transcript in real-time
updateTranscriptInRealTime(partialResponse.text);
}
});
}
As you can see from the example above, streaming creates a fundamentally different user experience by processing and responding to data continuously rather than in discrete steps.
Project Setup
Let's start by creating a new Next.js project with the App Router:
npx create-next-app@latest voice-assistant --typescript
cd voice-assistant
Now, install the required dependencies:
npm install @livekit/components-react @livekit/components-styles
npm install @livekit/agents @livekit/agents-plugin-openai
npm install openai
npm install livekit-client livekit-server-sdk
Create a new .env.local
file in the root of your project to store your API keys:
# OpenAI
OPENAI_API_KEY=your_openai_api_key
# LiveKit
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
LIVEKIT_URL=your_livekit_url # e.g., wss://your-project.livekit.cloud
Setting Up the LiveKit Agent with Streaming Capabilities
First, let's define our voice agent using the LiveKit Agents framework with full streaming capabilities. Create a new directory agents
in the project root and add a file called voice-agent.ts
:
// agents/voice-agent.ts
import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';
import { JobType } from '@livekit/protocol';
import { fileURLToPath } from 'node:url';
export default defineAgent({
entry: async (ctx: JobContext) => {
// Connect to the LiveKit room
await ctx.connect();
// Define our multimodal agent with OpenAI's real-time streaming model
const agent = new multimodal.MultimodalAgent({
model: new openai.realtime.RealtimeModel({
instructions: `You are a helpful AI assistant that specializes in technology topics.
You should respond conversationally and be friendly but concise.
Always try to provide valuable information and answer any questions to the best of your ability.
If you don't know something, admit it rather than making up information.`,
// You can choose from available OpenAI voices: alloy, echo, fable, onyx, nova, shimmer
voice: 'nova',
// Control the creativity of responses
temperature: 0.7,
// Allow unlimited response length
maxResponseOutputTokens: Infinity,
// Enable both text and audio modalities with streaming
modalities: ['text', 'audio'],
// CRUCIAL: Enable real-time streaming for immediate responses
streaming: {
enabled: true,
// Send partial updates every 3 tokens for ultra-responsive feel
partialUpdateInterval: 3,
},
// Configure voice activity detection for real-time turn-taking
turnDetection: {
// Use server-side VAD for greater accuracy
type: 'server_vad',
// How confident system should be that speech is occurring (0.0-1.0)
threshold: 0.5,
// How long of silence before considering turn complete (ms)
silence_duration_ms: 300,
// How much audio to include before detected speech (ms)
prefix_padding_ms: 300,
},
// Enable continuous conversation without strict turn-taking
// This allows more natural overlapping conversation
conversation: {
continuousTalking: true,
// How much of user context to maintain (tokens)
contextWindowSize: 4000,
},
}),
});
// Start the agent in the LiveKit room with streaming enabled
await agent.start(ctx.room);
// Log when streaming events occur for debugging
agent.on('streaming_started', () => {
console.log('Streaming started');
});
agent.on('streaming_chunk', (chunk) => {
console.log('Received streaming chunk');
});
},
});
// This code allows the agent to be run from CLI
if (require.main === module) {
cli.runApp(new WorkerOptions({
agent: fileURLToPath(import.meta.url),
workerType: JobType.JT_ROOM
}));
}
This code sets up a LiveKit agent that can process both text and audio using OpenAI's real-time streaming model. Let's break down the key streaming components:
streaming: { enabled: true }
: The critical configuration that enables real-time streaming of responses as they're generatedpartialUpdateInterval
: Controls how frequently partial updates are sent (lower values = more responsive but more network traffic)continuousTalking
: Enables more natural conversation flow without rigid turn-takingturnDetection
: Fine-tuned parameters for detecting when the user has finished speakingstreaming events
: Event listeners that provide hooks into the streaming process
The streaming configuration is what enables our assistant to respond in real-time as the user is speaking, rather than waiting for complete utterances. This creates the natural, flowing conversation experience that differentiates our system from traditional voice assistants.
Figure 3: Comparison of response timing with and without streaming
Creating the Backend API
Now we need to create API endpoints to interact with our LiveKit services. Let's set up the necessary routes:
First, create an API route to generate LiveKit access tokens:
// app/api/token/route.ts
import { AccessToken } from 'livekit-server-sdk';
import { NextRequest, NextResponse } from 'next/server';
export async function POST(req: NextRequest) {
try {
const { roomName, participantName } = await req.json();
// Validate request data
if (!roomName || !participantName) {
return NextResponse.json(
{ error: 'roomName and participantName are required' },
{ status: 400 }
);
}
// Create a new access token
const apiKey = process.env.LIVEKIT_API_KEY!;
const apiSecret = process.env.LIVEKIT_API_SECRET!;
const token = new AccessToken(apiKey, apiSecret, {
identity: participantName,
});
// Grant publish and subscribe permissions
token.addGrant({
roomJoin: true,
room: roomName,
canPublish: true,
canSubscribe: true,
});
// Return the token
return NextResponse.json({ token: token.toJwt() });
} catch (error) {
console.error('Error generating token:', error);
return NextResponse.json(
{ error: 'Failed to generate token' },
{ status: 500 }
);
}
}
Next, create an API endpoint to start our voice agent for a specific room:
// app/api/agent/start/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { LivekitAgentService } from '@livekit/agents';
export async function POST(req: NextRequest) {
try {
const { roomName } = await req.json();
if (!roomName) {
return NextResponse.json(
{ error: 'roomName is required' },
{ status: 400 }
);
}
// Initialize the LiveKit Agent Service
const agentService = new LivekitAgentService({
apiKey: process.env.LIVEKIT_API_KEY!,
apiSecret: process.env.LIVEKIT_API_SECRET!,
livekitUrl: process.env.LIVEKIT_URL!,
});
// Start the agent for the given room
const jobId = await agentService.startWorker({
type: 'room',
room: roomName,
agentPath: '../../agents/voice-agent.ts', // Path to our agent file
});
return NextResponse.json({ jobId });
} catch (error) {
console.error('Error starting agent:', error);
return NextResponse.json(
{ error: 'Failed to start agent' },
{ status: 500 }
);
}
}
Building the Frontend
Now let's create the React components for our voice assistant interface. We'll need to create:
- A LiveKit provider component for managing connections
- A conversation interface with transcripts
- Audio controls for the user
First, let's create a LiveKit context provider:
// components/LiveKitProvider.tsx
import React, { createContext, useContext, useState, useEffect } from 'react';
import { Room, RoomEvent, RemoteParticipant, LocalParticipant } from 'livekit-client';
interface LiveKitContextType {
room: Room | null;
connect: (token: string, url: string) => Promise<void>;
disconnect: () => void;
isConnected: boolean;
localParticipant: LocalParticipant | null;
remoteParticipants: RemoteParticipant[];
}
const LiveKitContext = createContext<LiveKitContextType | null>(null);
export function useLiveKit() {
const context = useContext(LiveKitContext);
if (!context) {
throw new Error('useLiveKit must be used within a LiveKitProvider');
}
return context;
}
export function LiveKitProvider({ children }: { children: React.ReactNode }) {
const [room] = useState(() => new Room());
const [isConnected, setIsConnected] = useState(false);
const [remoteParticipants, setRemoteParticipants] = useState<RemoteParticipant[]>([]);
useEffect(() => {
// Set up event listeners for the room
room.on(RoomEvent.ParticipantConnected, () => {
setRemoteParticipants(Array.from(room.remoteParticipants.values()));
});
room.on(RoomEvent.ParticipantDisconnected, () => {
setRemoteParticipants(Array.from(room.remoteParticipants.values()));
});
room.on(RoomEvent.Connected, () => {
setIsConnected(true);
});
room.on(RoomEvent.Disconnected, () => {
setIsConnected(false);
});
return () => {
room.off(RoomEvent.ParticipantConnected);
room.off(RoomEvent.ParticipantDisconnected);
room.off(RoomEvent.Connected);
room.off(RoomEvent.Disconnected);
};
}, [room]);
const connect = async (token: string, url: string) => {
try {
await room.connect(url, token);
console.log('Connected to LiveKit room');
} catch (error) {
console.error('Failed to connect to LiveKit room:', error);
throw error;
}
};
const disconnect = () => {
room.disconnect();
};
const value = {
room,
connect,
disconnect,
isConnected,
localParticipant: room.localParticipant,
remoteParticipants,
};
return (
<LiveKitContext.Provider value={value}>{children}</LiveKitContext.Provider>
);
}
Next, let's create a component for the conversation transcript that supports real-time streaming updates:
// components/ConversationTranscript.tsx
import React, { useEffect, useState, useRef } from 'react';
import { useLiveKit } from './LiveKitProvider';
import { DataPacket_Kind, RemoteParticipant, Track } from 'livekit-client';
interface Message {
id: string;
sender: string;
text: string;
timestamp: Date;
isUser: boolean;
isStreaming: boolean; // Flag to indicate if a message is currently streaming
streamingComplete: boolean; // Flag to indicate if streaming is complete
}
export default function ConversationTranscript() {
const { room, localParticipant, remoteParticipants } = useLiveKit();
const [messages, setMessages] = useState<Message[]>([]);
const transcriptRef = useRef<HTMLDivElement>(null);
const activeStreamingMessageId = useRef<string | null>(null);
// Function to add a new message
const addMessage = (sender: string, text: string, isUser: boolean, isStreaming = false) => {
const newMessageId = `${Date.now()}-${Math.random()}`;
const newMessage: Message = {
id: newMessageId,
sender,
text,
timestamp: new Date(),
isUser,
isStreaming,
streamingComplete: !isStreaming,
};
if (isStreaming) {
activeStreamingMessageId.current = newMessageId;
}
setMessages((prev) => [...prev, newMessage]);
return newMessageId;
};
// Function to update a streaming message
const updateStreamingMessage = (id: string, newText: string, complete = false) => {
setMessages((prev) =>
prev.map((message) =>
message.id === id
? {
...message,
text: newText,
streamingComplete: complete,
isStreaming: !complete
}
: message
)
);
if (complete) {
activeStreamingMessageId.current = null;
}
};
// Set up data channel for receiving text messages
useEffect(() => {
if (!room || !localParticipant) return;
// Set up local data channel to send user messages
const sendData = (text: string) => {
if (room && localParticipant) {
// Send the message via LiveKit data channel
const data = JSON.stringify({ type: 'message', text });
room.localParticipant.publishData(
new TextEncoder().encode(data),
DataPacket_Kind.RELIABLE
);
// Add to local transcript
addMessage(localParticipant.identity || 'You', text, true);
}
};
// Handle streaming messages from the assistant
const onStreamingDataReceived = (payload: Uint8Array, participant: RemoteParticipant) => {
try {
const data = JSON.parse(new TextDecoder().decode(payload));
if (data.type === 'streaming_start') {
// Create a new streaming message
const messageId = addMessage(
participant.identity || 'Assistant',
'', // Empty initial text
false,
true // Mark as streaming
);
}
else if (data.type === 'streaming_chunk') {
// Update existing streaming message with new content
if (activeStreamingMessageId.current) {
updateStreamingMessage(
activeStreamingMessageId.current,
data.text, // Partial text from streaming chunk
false // Not complete yet
);
}
}
else if (data.type === 'streaming_end') {
// Finalize the streaming message
if (activeStreamingMessageId.current) {
updateStreamingMessage(
activeStreamingMessageId.current,
data.text, // Final complete text
true // Mark as complete
);
}
}
else if (data.type === 'message') {
// Legacy non-streaming message
addMessage(participant.identity || 'Assistant', data.text, false);
}
} catch (error) {
console.error('Error parsing received data:', error);
}
};
// Subscribe to remote participants' data
remoteParticipants.forEach((participant) => {
participant.on('dataReceived', (payload) => onStreamingDataReceived(payload, participant));
});
// Set up audio streaming
const setupAudioStreaming = () => {
remoteParticipants.forEach((participant) => {
participant.audioTracks.forEach((track) => {
// Handle incoming audio tracks
if (track.kind === Track.Kind.Audio) {
// Subscribe to audio track automatically
track.setSubscribed(true);
// Handle audio elements for streaming playback
const audioElement = new Audio();
audioElement.autoplay = true;
// Attach track to audio element for streaming playback
track.track?.attach(audioElement);
// Setup ended event to clean up
audioElement.addEventListener('ended', () => {
track.track?.detach(audioElement);
});
}
});
// Listen for new tracks
participant.on('trackSubscribed', (track) => {
if (track.kind === Track.Kind.Audio) {
// New audio track arrived, attach it for streaming playback
const audioElement = new Audio();
audioElement.autoplay = true;
track.attach(audioElement);
}
});
});
};
// Initialize audio streaming
setupAudioStreaming();
// Clean up event listeners
return () => {
remoteParticipants.forEach((participant) => {
participant.off('dataReceived');
participant.off('trackSubscribed');
// Detach all audio tracks
participant.audioTracks.forEach((publication) => {
publication.track?.detach();
});
});
};
}, [room, localParticipant, remoteParticipants]);
// Auto-scroll to the latest message
useEffect(() => {
if (transcriptRef.current) {
transcriptRef.current.scrollTop = transcriptRef.current.scrollHeight;
}
}, [messages]);
return (
<div className="w-full max-w-3xl mx-auto">
<div className="border rounded-lg p-4 h-96 overflow-y-auto bg-slate-900" ref={transcriptRef}>
{messages.length === 0 ? (
<div className="text-center text-gray-400 my-8">
Your conversation will appear here
</div>
) : (
messages.map((message) => (
<div
key={message.id}
className={`mb-4 ${message.isUser ? 'text-right' : 'text-left'}`}
>
<div className={`inline-block px-4 py-2 rounded-lg ${
message.isUser
? 'bg-blue-600 text-white rounded-br-none'
: 'bg-gray-700 text-white rounded-bl-none'
}`}>
<div className="text-sm font-semibold flex items-center gap-2">
{message.sender}
{message.isStreaming && (
<span className="inline-block w-2 h-2 bg-green-400 rounded-full animate-pulse"
title="Streaming response"></span>
)}
</div>
<div>
{message.text}
{message.isStreaming && (
<span className="inline-block w-1 h-4 bg-white ml-1 animate-blink"></span>
)}
</div>
<div className="text-xs opacity-70 mt-1">
{message.timestamp.toLocaleTimeString()}
</div>
</div>
</div>
))
)}
</div>
{/* Streaming status indicator */}
{messages.some(m => m.isStreaming) && (
<div className="text-center text-sm text-green-400 mt-2 flex items-center justify-center gap-2">
<span className="inline-block w-2 h-2 bg-green-400 rounded-full animate-pulse"></span>
Streaming response in real-time...
</div>
)}
</div>
);
}
This enhanced conversation transcript component is specifically designed to handle streaming messages. Let's break down the key streaming-focused features:
- Streaming Message States: We track whether messages are currently streaming or complete
- Real-time Updates: The component updates messages incrementally as new chunks arrive
- Streaming Indicators: Visual cues like pulsing dots and a blinking cursor show when a message is streaming
- Different Message Types: Handles different streaming events (start, chunk, end) to create a smooth experience
- Audio Streaming: Automatically attaches and plays audio tracks as they arrive
With this implementation, users will see the AI's response appearing in real-time, letter by letter, creating a much more engaging and responsive experience than waiting for complete messages.
Figure 4: Live streaming text response with visual indicators
Now, let's create an enhanced audio control panel component that supports streaming audio for real-time voice interaction. We'll break it down into smaller, digestible parts:
1. Component Setup and State
// components/VoiceControls.tsx
import React, { useState, useEffect, useRef } from 'react';
import { useLiveKit } from './LiveKitProvider';
import { LocalTrack, Track, DataPacket_Kind } from 'livekit-client';
export default function VoiceControls() {
// Get LiveKit context
const { room, localParticipant, remoteParticipants } = useLiveKit();
// Audio state
const [isMuted, setIsMuted] = useState(true);
const [isPublishing, setIsPublishing] = useState(false);
const [microphoneTrack, setMicrophoneTrack] = useState<LocalTrack | null>(null);
// Streaming state
const [isStreaming, setIsStreaming] = useState(false);
const [streamingLevel, setStreamingLevel] = useState(0);
// Audio processing references
const audioLevelInterval = useRef<NodeJS.Timeout | null>(null);
const audioAnalyser = useRef<AnalyserNode | null>(null);
const audioContext = useRef<AudioContext | null>(null);
2. Audio Processing Setup for Streaming
This function sets up real-time audio processing to detect voice activity and visualize audio levels:
// Initialize audio streaming processing
const setupAudioProcessing = async (track: LocalTrack) => {
if (!track || !track.mediaStreamTrack) return;
try {
// Create audio context for stream processing
audioContext.current = new AudioContext();
// Create media stream source from microphone track
const stream = new MediaStream([track.mediaStreamTrack]);
const source = audioContext.current.createMediaStreamSource(stream);
// Create analyser for audio level monitoring
const analyser = audioContext.current.createAnalyser();
analyser.fftSize = 256;
analyser.smoothingTimeConstant = 0.8; // More smoothing for level visualization
source.connect(analyser);
audioAnalyser.current = analyser;
// Set up interval to continuously monitor audio levels
const dataArray = new Uint8Array(analyser.frequencyBinCount);
startAudioMonitoring(dataArray);
} catch (error) {
console.error('Error setting up audio processing:', error);
}
};
3. Continuous Audio Level Monitoring
This function continuously analyzes the audio stream to detect when a user is speaking:
// Monitor audio levels and detect voice activity
const startAudioMonitoring = (dataArray: Uint8Array) => {
audioLevelInterval.current = setInterval(() => {
if (audioAnalyser.current && !isMuted) {
audioAnalyser.current.getByteFrequencyData(dataArray);
// Calculate average audio level
let sum = 0;
for (let i = 0; i < dataArray.length; i++) {
sum += dataArray[i];
}
const average = sum / dataArray.length;
// Scale to 0-10 for visualization
const level = Math.floor((average / 255) * 10);
setStreamingLevel(level);
// Voice activity detection logic
handleVoiceActivity(level);
} else {
setStreamingLevel(0);
}
}, 100); // Check every 100ms for responsive detection
};
// Handle voice activity detection for streaming
const handleVoiceActivity = (level: number) => {
// If voice detected and we're not already streaming, start stream
if (level > 2 && !isStreaming) {
startStreamingSession();
}
// If voice stops and we were streaming, end stream with delay
else if (level <= 1 && isStreaming) {
// Add slight delay to prevent stopping on brief pauses
debounceStreamingEnd();
}
};
4. Stream Session Management
These functions handle starting and ending streaming sessions with notifications:
// Start streaming session and notify system
const startStreamingSession = () => {
setIsStreaming(true);
// Notify system of streaming start through data channel
if (room && localParticipant) {
const data = JSON.stringify({
type: 'streaming_start',
timestamp: Date.now()
});
localParticipant.publishData(
new TextEncoder().encode(data),
DataPacket_Kind.RELIABLE
);
}
};
// Debounce stream ending to avoid false stops
const debounceStreamingEnd = () => {
setTimeout(() => {
// Re-check audio level to confirm it's still low
if (audioAnalyser.current) {
const confirmArray = new Uint8Array(audioAnalyser.current.frequencyBinCount);
audioAnalyser.current.getByteFrequencyData(confirmArray);
const average = confirmArray.reduce((a, b) => a + b, 0) / confirmArray.length;
const level = Math.floor((average / 255) * 10);
// If level is still low, end streaming
if (level <= 1) {
endStreamingSession();
}
}
}, 800); // Wait to make sure it's not just a brief pause
};
// End streaming session and notify system
const endStreamingSession = () => {
setIsStreaming(false);
// Notify system of streaming end
if (room && localParticipant) {
const data = JSON.stringify({
type: 'streaming_end',
timestamp: Date.now()
});
localParticipant.publishData(
new TextEncoder().encode(data),
DataPacket_Kind.RELIABLE
);
}
};
5. Microphone Control
This function handles toggling the microphone with optimized streaming settings:
// Start or stop audio streaming publication
const toggleMicrophone = async () => {
if (!room || !localParticipant) return;
try {
if (isMuted) {
// Start publishing streaming audio
await startMicrophone();
} else {
// Stop streaming and clean up
await stopMicrophone();
}
} catch (error) {
console.error('Error toggling streaming microphone:', error);
setIsPublishing(false);
}
};
// Start microphone with streaming optimizations
const startMicrophone = async () => {
// If currently muted, start publishing streaming audio
if (!microphoneTrack) {
setIsPublishing(true);
// Create a new microphone track with streaming options
const tracks = await LocalTrack.createMicrophoneTrack({
// Enable high quality audio for better streaming results
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
// Set higher quality for better streaming experience
sampleRate: 48000,
channelCount: 1,
});
// Setup streaming audio visualization before publishing
await setupAudioProcessing(tracks);
// Publish the track to the room with streaming metadata
await localParticipant.publishTrack(tracks, {
name: 'streaming-audio',
// Setting lower simulcast for audio but higher quality
streamingOption: {
maximumBitrate: 64000, // 64kbps is good for voice
priority: Track.Priority.HIGH,
}
});
setMicrophoneTrack(tracks);
setIsPublishing(false);
} else {
// If we already have a track, just resume it with streaming
await microphoneTrack.unmute();
setupAudioProcessing(microphoneTrack);
}
setIsMuted(false);
};
// Stop microphone and clean up resources
const stopMicrophone = async () => {
// If unmuted, mute the track and stop streaming
if (microphoneTrack) {
await microphoneTrack.mute();
endStreamingSession();
}
// Clean up audio processing
cleanupAudioProcessing();
setStreamingLevel(0);
setIsMuted(true);
};
// Clean up audio processing resources
const cleanupAudioProcessing = () => {
if (audioLevelInterval.current) {
clearInterval(audioLevelInterval.current);
audioLevelInterval.current = null;
}
if (audioContext.current) {
audioContext.current.close();
audioContext.current = null;
}
};
6. Component Cleanup and UI Rendering
Handle cleanup on unmount and render the visual interface:
// Clean up tracks and audio processing when component unmounts
useEffect(() => {
return () => {
if (microphoneTrack) {
microphoneTrack.stop();
}
cleanupAudioProcessing();
};
}, [microphoneTrack]);
return (
<div className="mt-6 flex flex-col items-center">
{/* Audio level visualization meter */}
{!isMuted && (
<div className="mb-4 w-64 h-8 bg-gray-800 rounded-full p-1 flex items-center">
<div
className={`h-6 rounded-full transition-all duration-100 ${
isStreaming ? 'bg-green-500' : 'bg-blue-500'
}`}
style={{ width: `${streamingLevel * 10}%` }}
>
{streamingLevel > 4 && (
<div className="text-xs text-white text-center w-full">
{isStreaming ? 'Streaming' : 'Listening'}
</div>
)}
</div>
</div>
)}
{/* Microphone toggle button */}
<button
onClick={toggleMicrophone}
disabled={isPublishing || !room}
className={`px-6 py-3 rounded-full flex items-center ${
isMuted
? 'bg-green-600 hover:bg-green-700'
: 'bg-red-600 hover:bg-red-700'
} text-white font-medium transition-colors`}
>
{isPublishing ? (
<span>Initializing microphone for streaming...</span>
) : (
<>
{isMuted ? (
<>
<svg
xmlns="http://www.w3.org/2000/svg"
className="h-5 w-5 mr-2"
viewBox="0 0 20 20"
fill="currentColor"
>
<path
fillRule="evenodd"
d="M7 4a3 3 0 016 0v4a3 3 0 11-6 0V4zm4 10.93A7.001 7.001 0 0017 8a1 1 0 10-2 0A5 5 0 015 8a1 1 0 00-2 0 7.001 7.001 0 006 6.93V17H6a1 1 0 100 2h8a1 1 0 100-2h-3v-2.07z"
clipRule="evenodd"
/>
</svg>
Start Streaming Audio
</>
) : (
<>
<svg
xmlns="http://www.w3.org/2000/svg"
className="h-5 w-5 mr-2"
viewBox="0 0 20 20"
fill="currentColor"
>
<path
fillRule="evenodd"
d="M13.477 14.89A6 6 0 015.11 6.524l8.367 8.368zm1.414-1.414L6.524 5.11a6 6 0 018.367 8.367zM18 10a8 8 0 11-16 0 8 8 0 0116 0z"
clipRule="evenodd"
/>
</svg>
Stop Streaming Audio
</>
)}
</>
)}
</button>
{/* Streaming status indicator */}
{isStreaming && (
<div className="text-xs text-green-400 animate-pulse mt-2">
Real-time streaming active - AI is processing as you speak
</div>
)}
{/* Streaming stats */}
{!isMuted && (
<div className="mt-4 text-xs text-gray-400 text-center">
<div>Audio quality: 48kHz mono | Bitrate: 64kbps</div>
<div>Connected participants: {remoteParticipants.length + 1}</div>
</div>
)}
</div>
);
}
Breaking down the code into smaller, focused sections makes it easier to understand how the streaming audio controls work. Each section handles a specific aspect of the streaming functionality:
- Setup and State - Initializes the component with appropriate state variables
- Audio Processing - Sets up the Web Audio API for real-time audio analysis
- Audio Monitoring - Continuously monitors audio levels to detect speech
- Stream Management - Handles starting and stopping streaming sessions
- Microphone Control - Manages the microphone with optimized streaming settings
- UI Rendering - Provides visual feedback on streaming status
This enhanced voice controls component adds several streaming-specific features:
- Audio Visualization: Real-time meter showing audio levels during streaming
- Stream State Management: Detects when voice is active and sends streaming events
- Enhanced Audio Quality: Configures higher quality audio settings for better streaming
- Web Audio API Integration: Uses the Web Audio API for advanced audio analysis
- Visual Feedback: Shows active streaming status for better user experience
The enhanced audio processing is especially important for streaming applications. Instead of only capturing and sending audio, we're continuously analyzing the audio stream to detect speech, visualize audio levels, and provide feedback on streaming status.
Figure 5: Real-time audio streaming controls with voice level visualization
Finally, let's create a main component that brings everything together:
// components/VoiceAssistant.tsx
import React, { useState, useEffect } from 'react';
import { LiveKitProvider, useLiveKit } from './LiveKitProvider';
import ConversationTranscript from './ConversationTranscript';
import VoiceControls from './VoiceControls';
function AssistantInner() {
const { connect, disconnect, isConnected } = useLiveKit();
const [isJoining, setIsJoining] = useState(false);
const [error, setError] = useState<string | null>(null);
const [roomName, setRoomName] = useState('');
const [userName, setUserName] = useState('');
// Function to join a room
const joinRoom = async () => {
if (!roomName || !userName) {
setError('Room name and user name are required');
return;
}
setIsJoining(true);
setError(null);
try {
// Get token from our API
const tokenResponse = await fetch('/api/token', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
roomName,
participantName: userName,
}),
});
if (!tokenResponse.ok) {
throw new Error('Failed to get token');
}
const { token } = await tokenResponse.json();
// Connect to the LiveKit room
await connect(token, process.env.NEXT_PUBLIC_LIVEKIT_URL!);
// Start the agent
await fetch('/api/agent/start', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
roomName,
}),
});
} catch (err) {
console.error('Error joining room:', err);
setError(err instanceof Error ? err.message : 'Failed to join room');
} finally {
setIsJoining(false);
}
};
// Function to leave the room
const leaveRoom = () => {
disconnect();
};
return (
<div className="max-w-4xl mx-auto p-4">
<div className="text-center mb-8">
<h1 className="text-3xl font-bold mb-2">AI Voice Assistant</h1>
<p className="text-gray-400">
Talk with an AI assistant using your voice
</p>
</div>
{!isConnected ? (
<div className="bg-slate-800 p-6 rounded-lg shadow-lg max-w-md mx-auto">
<h2 className="text-xl font-semibold mb-4">Join a Room</h2>
{error && (
<div className="bg-red-600 bg-opacity-20 border border-red-400 text-red-200 px-4 py-2 rounded mb-4">
{error}
</div>
)}
<div className="space-y-4">
<div>
<label htmlFor="roomName" className="block text-sm font-medium mb-1">
Room Name
</label>
<input
type="text"
id="roomName"
value={roomName}
onChange={(e) => setRoomName(e.target.value)}
placeholder="Enter a room name"
className="w-full px-4 py-2 bg-slate-700 rounded border border-slate-600 focus:ring-2 focus:ring-blue-500 focus:border-transparent"
disabled={isJoining}
/>
</div>
<div>
<label htmlFor="userName" className="block text-sm font-medium mb-1">
Your Name
</label>
<input
type="text"
id="userName"
value={userName}
onChange={(e) => setUserName(e.target.value)}
placeholder="Enter your name"
className="w-full px-4 py-2 bg-slate-700 rounded border border-slate-600 focus:ring-2 focus:ring-blue-500 focus:border-transparent"
disabled={isJoining}
/>
</div>
<button
onClick={joinRoom}
disabled={isJoining}
className="w-full py-2 px-4 bg-blue-600 hover:bg-blue-700 text-white font-medium rounded transition-colors disabled:opacity-50"
>
{isJoining ? 'Joining...' : 'Join Room'}
</button>
</div>
</div>
) : (
<div className="space-y-6">
<div className="flex justify-between items-center">
<h2 className="text-xl font-semibold">
Room: <span className="text-blue-400">{roomName}</span>
</h2>
<button
onClick={leaveRoom}
className="px-4 py-2 bg-red-600 hover:bg-red-700 text-white font-medium rounded-md transition-colors"
>
Leave Room
</button>
</div>
{/* Conversation transcript */}
<ConversationTranscript />
{/* Voice controls */}
<VoiceControls />
</div>
)}
</div>
);
}
export default function VoiceAssistant() {
return (
<LiveKitProvider>
<AssistantInner />
</LiveKitProvider>
);
}
Creating the Main Page
Let's implement our main page component that will use our voice assistant:
// app/page.tsx
'use client';
import dynamic from 'next/dynamic';
import { Suspense } from 'react';
// Dynamically import the voice assistant component to avoid SSR issues
// with browser APIs like getUserMedia
const VoiceAssistant = dynamic(
() => import('@/components/VoiceAssistant'),
{ ssr: false }
);
export default function Home() {
return (
<main className="min-h-screen p-4">
<Suspense fallback={<div>Loading voice assistant...</div>}>
<VoiceAssistant />
</Suspense>
</main>
);
}
Note that we're using Next.js's dynamic import to avoid server-side rendering of components that use browser-specific APIs like getUserMedia
.
AI Voice Assistant
Talk with an AI assistant using your voice
Join a Room
Figure 2: Room joining interface for the voice assistant
Room: tech-chat-room
Figure 3: Conversation interface with transcript and voice control
Advanced Streaming Features and Optimizations
Now that we have a basic real-time voice assistant with streaming capabilities, let's explore advanced streaming features and optimizations that will make our application truly responsive and natural:
1. Optimizing Streaming Performance
To achieve the lowest possible latency in streaming responses, we need to fine-tune several parameters:
// Optimize streaming configuration in agent definition
streaming: {
enabled: true,
// Lower values = more responsive but higher bandwidth usage
// For voice assistants, 2-3 is ideal for real-time feel
partialUpdateInterval: 2,
// Enable chunked transfer encoding for faster delivery
chunkedTransfer: true,
// Prioritize partial content delivery
prioritizePartialDelivery: true,
// Specify response format for most efficient parsing
responseFormat: {
type: "text",
chunk_size: 20, // Characters per chunk, balance between smoothness and overhead
},
// Pipeline optimization
pipelineConfig: {
// Reduce overhead by avoiding full validation between chunks
validatePartialResponses: false,
// Process chunks in parallel when possible
parallelProcessing: true,
}
}
The parameters above are crucial for achieving the lowest possible streaming latency. For voice assistants, responsiveness is key, so we prioritize speed over bandwidth efficiency.
2. Enhanced Voice Activity Detection for Streaming
To enable truly natural conversation flow, we need sophisticated voice activity detection that can determine when to start and stop streaming:
// Enhanced streaming-optimized VAD configuration
turnDetection: {
// Server-side VAD offers better accuracy for streaming
type: 'server_vad',
// Fine-tuned confidence threshold (0.0-1.0)
// Lower = more responsive but may trigger on background noise
// Higher = fewer false triggers but may miss the start of speech
threshold: 0.4,
// How long of silence before considering turn complete (ms)
// Shorter makes streaming more responsive but may split utterances
silence_duration_ms: 400,
// Include audio before detected speech for better context (ms)
prefix_padding_ms: 200,
// Advanced settings for streaming optimization
streaming_specific: {
// Start streaming response after this duration of speech (ms)
// This allows the AI to start formulating responses mid-sentence
early_response_ms: 600,
// Continue listening for this duration after speech ends (ms)
// Helps maintain context between pauses
continuation_window_ms: 2000,
// Enable continuous speech detection for more natural conversation
continuous_detection: true,
// Transition phase length between speakers (ms)
// Controls how much overlap is allowed in conversation
transition_phase_ms: 300,
}
}
These VAD enhancements allow the assistant to begin generating responses while the user is still talking, similar to how humans begin formulating responses before the other person has finished speaking.
Figure 6: Optimized streaming response timing with conversation overlap
3. Audio Streaming Optimizations
For the smoothest audio streaming experience, we need to fine-tune the audio processing:
// Create optimized audio streaming configuration
const createOptimizedAudioStream = () => {
// Configure audio streaming parameters
const streamingConfig = {
// Audio format settings
audio: {
// Higher sample rate for better quality voice
sampleRate: 48000,
// Single channel for voice is sufficient
channels: 1,
// Using opus codec for efficient streaming
codec: 'opus',
// Bitrate balanced for voice quality and bandwidth
bitrate: 64000,
// Frame size optimized for low latency
frameSize: 20, // ms
},
// Buffer settings
buffer: {
// Set initial buffer size
// Smaller = less latency but more risk of interruptions
// Larger = smoother playback but more latency
initial: 300, // ms
// Minimum buffer before playback starts
minimum: 100, // ms
// Dynamic buffer adjustment for network conditions
adaptive: true,
// How quickly buffer adapts (0-1, higher = faster adaptation)
adaptationRate: 0.6,
},
// Network optimizations
network: {
// Send multiple small packets rather than waiting for larger ones
prioritizeLowLatency: true,
// Use WebRTC data channels for control messages (faster than REST)
useDataChannels: true,
// Prioritize audio packets in congested networks
qualityOfService: 'high',
// Reconnection strategy
reconnection: {
attempts: 5,
backoff: 'exponential',
maxDelay: 3000, // ms
},
},
};
return streamingConfig;
};
// Apply configuration to audio stream
const applyStreamingOptimizations = (audioTrack) => {
const config = createOptimizedAudioStream();
// Apply codec and bitrate settings
audioTrack.setEncodingParams({
maxBitrate: config.audio.bitrate,
codecName: config.audio.codec,
});
// Create optimized audio processor
const audioContext = new AudioContext({
sampleRate: config.audio.sampleRate,
latencyHint: 'interactive',
});
// Create buffer with optimal settings
const bufferSource = audioContext.createBufferSource();
bufferSource.buffer = audioContext.createBuffer(
config.audio.channels,
config.buffer.initial * config.audio.sampleRate / 1000,
config.audio.sampleRate
);
// Connect audio processing nodes optimized for streaming
const streamProcessor = audioContext.createScriptProcessor(1024, 1, 1);
streamProcessor.onaudioprocess = (e) => {
// Process audio chunks with minimal latency
// This function runs in the audio thread for real-time performance
};
return { track: audioTrack, context: audioContext, processor: streamProcessor };
};
4. Streaming Function Calling
To maintain real-time responsiveness even when calling external functions, we can implement streaming function calls:
Figure 7: Parallel function execution with streaming responses
// Define streaming-optimized functions for real-time use
const streamingFunctions = [
{
name: 'stream_search',
description: 'Search for information with streaming results',
parameters: {
type: 'object',
properties: {
query: {
type: 'string',
description: 'The search query',
},
},
required: ['query'],
},
// Streaming function implementation
streamingHandler: async function*(query: string) {
// Initialize search
const search = initializeSearch(query);
// Yield initial quick results immediately
yield { partial: true, results: search.quickResults() };
// Start deeper search in background
const searchPromise = search.executeFullSearch();
// Stream partial results as they become available
for await (const partialResult of search.streamResults()) {
yield {
partial: true,
results: partialResult,
confidence: partialResult.confidence
};
}
// Final complete results
const finalResults = await searchPromise;
yield { partial: false, results: finalResults };
}
},
{
name: 'get_weather',
description: 'Get weather information for a location with streaming updates',
parameters: {
type: 'object',
properties: {
location: {
type: 'string',
description: 'The city and state, e.g. San Francisco, CA',
},
},
required: ['location'],
},
// Stream results as they become available
streamingHandler: async function*(location: string) {
// Yield initial estimate from cache immediately
yield {
partial: true,
source: 'cache',
temperature: await getEstimatedTemperature(location),
conditions: 'Loading...'
};
// Make API request in background
const weatherPromise = fetchWeatherData(location);
// Stream partial weather data as it loads
yield {
partial: true,
source: 'preliminary',
temperature: await getRealtimeTemperature(location),
conditions: await getBasicConditions(location)
};
// Final complete weather data
const weather = await weatherPromise;
yield {
partial: false,
source: 'api',
temperature: weather.temperature,
conditions: weather.conditions,
forecast: weather.forecast
};
}
}
];
// Configure streaming function calling
model: new openai.realtime.RealtimeModel({
// ... other options
// Enable streaming function calling
functions: streamingFunctions,
function_calling: {
enabled: true,
// Enhanced options for streaming function calls
streaming: {
// Allow early function calling before user finishes speaking
earlyInvocation: true,
// Process streaming function results in parallel with ongoing generation
parallelProcessing: true,
// Interleave function results with text generation
interleaveResults: true,
// Invoke functions with partial parameter data
allowPartialParams: true,
// Queue multiple function calls in parallel
parallelExecution: true,
// Progress updates for long-running functions
progressUpdates: true,
}
},
})
With streaming function calls, the assistant can call external APIs while continuing to generate responses, and can even start calling functions before the user has finished speaking. This creates a much more responsive and natural interaction.
5. Multi-stream Conversation Context
For more natural conversations, we can implement a multi-stream approach that allows for overlapping turns and remembers context:
// Multi-stream conversation context configuration
conversation: {
// Enable natural overlapping conversation
continuousTalking: true,
// How much conversation history to maintain (tokens)
contextWindowSize: 4000,
// Strategy for handling interruptions
interruptionHandling: {
// Allow user to interrupt AI responses
allowUserInterruptions: true,
// How to handle user interruptions
// "pause", "abandon", "complete", "background"
userInterruptBehavior: "pause",
// Allow AI to politely interrupt user in appropriate moments
allowAiInterruptions: true,
// When AI can interrupt (0-1, higher = more likely to interrupt)
aiInterruptThreshold: 0.7,
},
// Enhanced memory model for streaming context
memoryModel: {
// Remember more recent conversation turns in detail
recencyBias: 0.8,
// Maintain entity knowledge across conversation
entityTracking: true,
// Compress older conversation parts for efficiency
compressionEnabled: true,
// Hierarchical memory model for better context
hierarchical: true,
// Long-term conversation state persistence
persistenceStrategy: "session",
},
// Context adaptation for personalization
contextAdaptation: {
// Adapt to user speaking style
adaptToUserStyle: true,
// Learn user preferences over time
preferenceTracking: true,
// Remember correction patterns
learnFromCorrections: true,
}
}
These advanced features represent the cutting edge of real-time voice assistant technology, creating a truly responsive and natural streaming experience that feels like talking to another person rather than a computer.
Deployment and Production Considerations
When deploying your voice assistant to production, consider the following:
1. Scalability
For production deployments, you should set up a dedicated LiveKit server or use their cloud offering with appropriate scaling. Each voice conversation requires dedicated server resources.
2. Error Handling
Implement robust error handling for network disruptions, API failures, and device permission issues.
// Example of enhanced error handling for microphone permissions
const startMicrophone = async () => {
try {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
// Process the stream
return stream;
} catch (error) {
if (error instanceof DOMException) {
if (error.name === 'NotAllowedError') {
// Handle permission denied
return { error: 'Microphone permission denied' };
} else if (error.name === 'NotFoundError') {
// Handle no microphone available
return { error: 'No microphone found' };
}
}
// Handle other errors
console.error('Microphone error:', error);
return { error: 'Failed to access microphone' };
}
};
3. Monitoring and Analytics
Implement monitoring to track API usage, conversation quality, and system performance.
// Simple analytics tracking
const trackConversation = (roomId: string, data: {
duration: number;
messageCount: number;
userSpeakingTime: number;
assistantSpeakingTime: number;
errorCount: number;
}) => {
// Send to your analytics service
fetch('/api/analytics/conversation', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
roomId,
...data,
timestamp: new Date().toISOString(),
}),
});
};
4. Cost Management
Both OpenAI and LiveKit charge based on usage. Implement controls to manage costs:
- Set time limits for conversations
- Implement rate limiting for users
- Monitor and set alerts for usage thresholds
- Consider using smaller models for cost-sensitive applications
Conclusion: The Future of Streaming Voice Assistants
In this comprehensive guide, we've built a truly real-time streaming voice-to-voice assistant that responds naturally and fluidly using OpenAI's streaming APIs, LiveKit, WebRTC, and React. We've covered:
- Setting up the project infrastructure for real-time communication
- Implementing streaming as a core component with continuous audio flow
- Creating a streaming-optimized LiveKit agent with OpenAI's real-time model
- Building responsive backend API endpoints that support streaming data
- Developing frontend components with streaming UI updates that respond in real-time
- Adding advanced streaming features like optimized audio processing and streaming function calling
- Implementing sophisticated Voice Activity Detection for natural conversation flow
- Fine-tuning performance parameters for lowest possible latency
- Considering production deployment for high-performance streaming applications
Figure 8: Future directions for streaming voice technology
The streaming voice assistant we've built represents a fundamental shift from the turn-taking model of traditional voice assistants to a truly natural, continuous conversation flow. By implementing streaming at every level - from audio capture to AI processing to response generation - we've created an experience that feels responsive and human-like.
This architecture is highly extensible, allowing you to build more sophisticated applications on this streaming foundation, such as:
- Real-time language translation during conversations
- Multi-participant voice meetings with AI moderation
- Ambient voice assistants that listen and respond contextually
- Emotion-aware voice interfaces that adapt to user sentiment
- Voice assistants that learn and improve from continuous conversation
As streaming AI technology continues to evolve, the latency gap between human-to-human and human-to-AI conversation will continue to shrink. The streaming-first approach outlined in this guide provides a solid foundation that can be adapted to incorporate new capabilities as they become available, ensuring your voice applications remain at the cutting edge of what's possible.
Remember: without streaming, voice assistants feel robotic and unnatural. With proper streaming implementation throughout the technology stack, we can create truly seamless, responsive, and natural voice interfaces that transform how humans interact with AI.
Further Reading
Additional resources to deepen your understanding:
Key Resources
Official LiveKit Agents documentation for building real-time AI applications
OpenAI's repository for real-time agents with examples and documentation
Official Next.js documentation for building React applications with server components