Building a Real-Time Voice-to-Voice Assistant with OpenAI, LiveKit, Next.js and React

Introduction
Prerequisites
Understanding Streaming for Real-Time Voice Assistants
Project Setup
Setting Up the LiveKit Agent with Streaming Capabilities
Creating the Backend API
Building the Frontend
Creating the Main Page
Advanced Streaming Features and Optimizations
Deployment and Production Considerations
Conclusion: The Future of Streaming Voice Assistants
Further Reading

Introduction

Building a real-time voice-to-voice assistant with streaming capabilities and chat transcript functionality has become increasingly accessible thanks to advances in AI streaming APIs and WebRTC technologies. This guide will walk you through creating a truly real-time voice assistant using OpenAI's streaming API for AI capabilities, LiveKit for real-time communication, and Next.js with React for a responsive frontend interface.

Complete Streaming Voice Assistant System Architecture

Figure 1: Complete streaming voice assistant architecture with real-time data flow

By the end of this tutorial, you'll have a fully functional application that can listen to user speech, process it through OpenAI's streaming models in real-time, respond with synthesized speech as it's generated (not after completion), and maintain a live-updating transcript of the entire conversation. Streaming is the core technology that makes this truly real-time, enabling natural, fluid interactions that feel like talking to another person.

Without streaming, voice assistants feel robotic and unnatural, with long pauses between user input and assistant response. With proper streaming implementation, we can create an experience that's responsive, natural, and engaging.

Prerequisites

Before we begin, ensure you have the following:

Node.js (version 18+) installed on your system
Basic knowledge of TypeScript, React, and Next.js
An OpenAI API key with access to speech models and streaming capabilities
A LiveKit account with API keys (they offer a free tier for development)
A code editor like VS Code

We'll be using TypeScript throughout this tutorial for type safety and better developer experience.

Figure 1: System architecture showing client-server communication flow

Understanding Streaming for Real-Time Voice Assistants

Before diving into the code, it's crucial to understand why streaming is the cornerstone of a truly real-time voice assistant. Without streaming, the entire interaction feels disjointed and unnatural.

The Problem with Non-Streaming Approaches

In a traditional non-streaming implementation, the communication flow follows these steps:

User speaks completely
Audio is sent to the server after user completes speaking
Server processes the entire audio
Server generates a complete response
Complete response is sent back to client
Client plays the response audio

This approach creates a rigid, turn-taking conversation with significant delays between utterances, making the interaction feel unnatural and "robotic."

Streaming Architecture: The Key to Natural Conversation

Figure 2: Streaming data flow architecture

With a streaming approach, the interaction becomes fluid and natural:

User starts speaking
Audio chunks begin streaming to the server immediately
Server processes audio chunks as they arrive
Response begins generating while user is still speaking
Generated response chunks stream back to the client in real-time
Client plays response audio as it's received, not waiting for completion

This creates a continuous, natural flow of conversation with minimal latency, similar to talking with another person.

Key Streaming Components

Our implementation will leverage several streaming technologies:

WebRTC Audio Streaming: Provides real-time audio transmission from the browser
OpenAI Streaming API: Enables token-by-token response generation
LiveKit Agents: Handles real-time bidirectional streaming between client and AI models
Server-Sent Events (SSE): Enables server-to-client streaming for transcripts
React Streaming UI: Updates the UI smoothly as new content arrives

The key difference is that with streaming, we're processing and rendering data in small chunks as they become available, rather than waiting for entire responses. This is what creates the real-time experience.

// Example of streaming vs non-streaming approach
// Non-streaming (traditional)
async function processVoiceNonStreaming() {
  // Wait for complete audio
  const completeAudio = await recordCompleteUserSpeech();
  
  // Send complete audio to API
  const response = await openai.audio.speech.create({
    model: "tts-1",
    input: completeAudio,
    voice: "nova",
  });
  
  // Wait for complete response
  const audioResponse = await response.arrayBuffer();
  
  // Play complete response
  playAudio(audioResponse);
}

// Streaming (real-time)
async function processVoiceWithStreaming() {
  // Create audio stream
  const audioStream = createRealTimeAudioStream();
  
  // Process chunks as they arrive
  audioStream.on('data', async (chunk) => {
    // Send audio chunk to API with streaming enabled
    const streamingResponse = await openai.audio.speech.create({
      model: "tts-1",
      input: chunk,
      voice: "nova",
      stream: true,
    });
    
    // Process response chunks as they arrive
    for await (const partialResponse of streamingResponse) {
      // Play audio chunk immediately
      playAudioChunk(partialResponse);
      
      // Update transcript in real-time
      updateTranscriptInRealTime(partialResponse.text);
    }
  });
}

As you can see from the example above, streaming creates a fundamentally different user experience by processing and responding to data continuously rather than in discrete steps.

Project Setup

Let's start by creating a new Next.js project with the App Router:

npx create-next-app@latest voice-assistant --typescript
cd voice-assistant

Now, install the required dependencies:

npm install @livekit/components-react @livekit/components-styles
npm install @livekit/agents @livekit/agents-plugin-openai
npm install openai
npm install livekit-client livekit-server-sdk

Create a new .env.local file in the root of your project to store your API keys:

# OpenAI
OPENAI_API_KEY=your_openai_api_key

# LiveKit
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
LIVEKIT_URL=your_livekit_url  # e.g., wss://your-project.livekit.cloud

Setting Up the LiveKit Agent with Streaming Capabilities

First, let's define our voice agent using the LiveKit Agents framework with full streaming capabilities. Create a new directory agents in the project root and add a file called voice-agent.ts:

// agents/voice-agent.ts
import { JobContext, WorkerOptions, cli, defineAgent, multimodal } from '@livekit/agents';
import * as openai from '@livekit/agents-plugin-openai';
import { JobType } from '@livekit/protocol';
import { fileURLToPath } from 'node:url';

export default defineAgent({
  entry: async (ctx: JobContext) => {
    // Connect to the LiveKit room
    await ctx.connect();
    
    // Define our multimodal agent with OpenAI's real-time streaming model
    const agent = new multimodal.MultimodalAgent({
      model: new openai.realtime.RealtimeModel({
        instructions: `You are a helpful AI assistant that specializes in technology topics.
You should respond conversationally and be friendly but concise.
Always try to provide valuable information and answer any questions to the best of your ability.
If you don't know something, admit it rather than making up information.`,
        
        // You can choose from available OpenAI voices: alloy, echo, fable, onyx, nova, shimmer
        voice: 'nova',
        
        // Control the creativity of responses
        temperature: 0.7,
        
        // Allow unlimited response length
        maxResponseOutputTokens: Infinity,
        
        // Enable both text and audio modalities with streaming
        modalities: ['text', 'audio'],
        
        // CRUCIAL: Enable real-time streaming for immediate responses
        streaming: {
          enabled: true,
          // Send partial updates every 3 tokens for ultra-responsive feel
          partialUpdateInterval: 3,
        },
        
        // Configure voice activity detection for real-time turn-taking
        turnDetection: {
          // Use server-side VAD for greater accuracy
          type: 'server_vad',
          // How confident system should be that speech is occurring (0.0-1.0)
          threshold: 0.5,
          // How long of silence before considering turn complete (ms)
          silence_duration_ms: 300,
          // How much audio to include before detected speech (ms)
          prefix_padding_ms: 300,
        },
        
        // Enable continuous conversation without strict turn-taking
        // This allows more natural overlapping conversation
        conversation: {
          continuousTalking: true,
          // How much of user context to maintain (tokens)
          contextWindowSize: 4000,
        },
      }),
    });

    // Start the agent in the LiveKit room with streaming enabled
    await agent.start(ctx.room);
    
    // Log when streaming events occur for debugging
    agent.on('streaming_started', () => {
      console.log('Streaming started');
    });
    
    agent.on('streaming_chunk', (chunk) => {
      console.log('Received streaming chunk');
    });
  },
});

// This code allows the agent to be run from CLI
if (require.main === module) {
  cli.runApp(new WorkerOptions({ 
    agent: fileURLToPath(import.meta.url), 
    workerType: JobType.JT_ROOM
  }));
}

This code sets up a LiveKit agent that can process both text and audio using OpenAI's real-time streaming model. Let's break down the key streaming components:

streaming: { enabled: true }: The critical configuration that enables real-time streaming of responses as they're generated
partialUpdateInterval: Controls how frequently partial updates are sent (lower values = more responsive but more network traffic)
continuousTalking: Enables more natural conversation flow without rigid turn-taking
turnDetection: Fine-tuned parameters for detecting when the user has finished speaking
streaming events: Event listeners that provide hooks into the streaming process

The streaming configuration is what enables our assistant to respond in real-time as the user is speaking, rather than waiting for complete utterances. This creates the natural, flowing conversation experience that differentiates our system from traditional voice assistants.

Figure 3: Comparison of response timing with and without streaming

Creating the Backend API

Now we need to create API endpoints to interact with our LiveKit services. Let's set up the necessary routes:

First, create an API route to generate LiveKit access tokens:

// app/api/token/route.ts
import { AccessToken } from 'livekit-server-sdk';
import { NextRequest, NextResponse } from 'next/server';

export async function POST(req: NextRequest) {
  try {
    const { roomName, participantName } = await req.json();
    
    // Validate request data
    if (!roomName || !participantName) {
      return NextResponse.json(
        { error: 'roomName and participantName are required' },
        { status: 400 }
      );
    }

    // Create a new access token
    const apiKey = process.env.LIVEKIT_API_KEY!;
    const apiSecret = process.env.LIVEKIT_API_SECRET!;
    
    const token = new AccessToken(apiKey, apiSecret, {
      identity: participantName,
    });

    // Grant publish and subscribe permissions
    token.addGrant({ 
      roomJoin: true, 
      room: roomName,
      canPublish: true,
      canSubscribe: true,
    });

    // Return the token
    return NextResponse.json({ token: token.toJwt() });
  } catch (error) {
    console.error('Error generating token:', error);
    return NextResponse.json(
      { error: 'Failed to generate token' },
      { status: 500 }
    );
  }
}

Next, create an API endpoint to start our voice agent for a specific room:

// app/api/agent/start/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { LivekitAgentService } from '@livekit/agents';

export async function POST(req: NextRequest) {
  try {
    const { roomName } = await req.json();
    
    if (!roomName) {
      return NextResponse.json(
        { error: 'roomName is required' },
        { status: 400 }
      );
    }

    // Initialize the LiveKit Agent Service
    const agentService = new LivekitAgentService({
      apiKey: process.env.LIVEKIT_API_KEY!,
      apiSecret: process.env.LIVEKIT_API_SECRET!,
      livekitUrl: process.env.LIVEKIT_URL!,
    });

    // Start the agent for the given room
    const jobId = await agentService.startWorker({
      type: 'room',
      room: roomName,
      agentPath: '../../agents/voice-agent.ts',  // Path to our agent file
    });

    return NextResponse.json({ jobId });
  } catch (error) {
    console.error('Error starting agent:', error);
    return NextResponse.json(
      { error: 'Failed to start agent' },
      { status: 500 }
    );
  }
}

Building the Frontend

Now let's create the React components for our voice assistant interface. We'll need to create:

A LiveKit provider component for managing connections
A conversation interface with transcripts
Audio controls for the user

First, let's create a LiveKit context provider:

// components/LiveKitProvider.tsx
import React, { createContext, useContext, useState, useEffect } from 'react';
import { Room, RoomEvent, RemoteParticipant, LocalParticipant } from 'livekit-client';

interface LiveKitContextType {
  room: Room | null;
  connect: (token: string, url: string) => Promise<void>;
  disconnect: () => void;
  isConnected: boolean;
  localParticipant: LocalParticipant | null;
  remoteParticipants: RemoteParticipant[];
}

const LiveKitContext = createContext<LiveKitContextType | null>(null);

export function useLiveKit() {
  const context = useContext(LiveKitContext);
  if (!context) {
    throw new Error('useLiveKit must be used within a LiveKitProvider');
  }
  return context;
}

export function LiveKitProvider({ children }: { children: React.ReactNode }) {
  const [room] = useState(() => new Room());
  const [isConnected, setIsConnected] = useState(false);
  const [remoteParticipants, setRemoteParticipants] = useState<RemoteParticipant[]>([]);

  useEffect(() => {
    // Set up event listeners for the room
    room.on(RoomEvent.ParticipantConnected, () => {
      setRemoteParticipants(Array.from(room.remoteParticipants.values()));
    });

    room.on(RoomEvent.ParticipantDisconnected, () => {
      setRemoteParticipants(Array.from(room.remoteParticipants.values()));
    });

    room.on(RoomEvent.Connected, () => {
      setIsConnected(true);
    });

    room.on(RoomEvent.Disconnected, () => {
      setIsConnected(false);
    });

    return () => {
      room.off(RoomEvent.ParticipantConnected);
      room.off(RoomEvent.ParticipantDisconnected);
      room.off(RoomEvent.Connected);
      room.off(RoomEvent.Disconnected);
    };
  }, [room]);

  const connect = async (token: string, url: string) => {
    try {
      await room.connect(url, token);
      console.log('Connected to LiveKit room');
    } catch (error) {
      console.error('Failed to connect to LiveKit room:', error);
      throw error;
    }
  };

  const disconnect = () => {
    room.disconnect();
  };

  const value = {
    room,
    connect,
    disconnect,
    isConnected,
    localParticipant: room.localParticipant,
    remoteParticipants,
  };

  return (
    <LiveKitContext.Provider value={value}>{children}</LiveKitContext.Provider>
  );
}

Next, let's create a component for the conversation transcript that supports real-time streaming updates:

// components/ConversationTranscript.tsx
import React, { useEffect, useState, useRef } from 'react';
import { useLiveKit } from './LiveKitProvider';
import { DataPacket_Kind, RemoteParticipant, Track } from 'livekit-client';

interface Message {
  id: string;
  sender: string;
  text: string;
  timestamp: Date;
  isUser: boolean;
  isStreaming: boolean; // Flag to indicate if a message is currently streaming
  streamingComplete: boolean; // Flag to indicate if streaming is complete
}

export default function ConversationTranscript() {
  const { room, localParticipant, remoteParticipants } = useLiveKit();
  const [messages, setMessages] = useState<Message[]>([]);
  const transcriptRef = useRef<HTMLDivElement>(null);
  const activeStreamingMessageId = useRef<string | null>(null);

  // Function to add a new message
  const addMessage = (sender: string, text: string, isUser: boolean, isStreaming = false) => {
    const newMessageId = `${Date.now()}-${Math.random()}`;
    
    const newMessage: Message = {
      id: newMessageId,
      sender,
      text,
      timestamp: new Date(),
      isUser,
      isStreaming,
      streamingComplete: !isStreaming,
    };
    
    if (isStreaming) {
      activeStreamingMessageId.current = newMessageId;
    }
    
    setMessages((prev) => [...prev, newMessage]);
    return newMessageId;
  };

  // Function to update a streaming message
  const updateStreamingMessage = (id: string, newText: string, complete = false) => {
    setMessages((prev) => 
      prev.map((message) => 
        message.id === id 
          ? { 
              ...message, 
              text: newText, 
              streamingComplete: complete,
              isStreaming: !complete
            } 
          : message
      )
    );
    
    if (complete) {
      activeStreamingMessageId.current = null;
    }
  };

  // Set up data channel for receiving text messages
  useEffect(() => {
    if (!room || !localParticipant) return;

    // Set up local data channel to send user messages
    const sendData = (text: string) => {
      if (room && localParticipant) {
        // Send the message via LiveKit data channel
        const data = JSON.stringify({ type: 'message', text });
        room.localParticipant.publishData(
          new TextEncoder().encode(data),
          DataPacket_Kind.RELIABLE
        );
        
        // Add to local transcript
        addMessage(localParticipant.identity || 'You', text, true);
      }
    };

    // Handle streaming messages from the assistant
    const onStreamingDataReceived = (payload: Uint8Array, participant: RemoteParticipant) => {
      try {
        const data = JSON.parse(new TextDecoder().decode(payload));
        
        if (data.type === 'streaming_start') {
          // Create a new streaming message
          const messageId = addMessage(
            participant.identity || 'Assistant', 
            '', // Empty initial text
            false, 
            true  // Mark as streaming
          );
        } 
        else if (data.type === 'streaming_chunk') {
          // Update existing streaming message with new content
          if (activeStreamingMessageId.current) {
            updateStreamingMessage(
              activeStreamingMessageId.current,
              data.text, // Partial text from streaming chunk
              false // Not complete yet
            );
          }
        } 
        else if (data.type === 'streaming_end') {
          // Finalize the streaming message
          if (activeStreamingMessageId.current) {
            updateStreamingMessage(
              activeStreamingMessageId.current,
              data.text, // Final complete text
              true // Mark as complete
            );
          }
        }
        else if (data.type === 'message') {
          // Legacy non-streaming message
          addMessage(participant.identity || 'Assistant', data.text, false);
        }
      } catch (error) {
        console.error('Error parsing received data:', error);
      }
    };

    // Subscribe to remote participants' data
    remoteParticipants.forEach((participant) => {
      participant.on('dataReceived', (payload) => onStreamingDataReceived(payload, participant));
    });

    // Set up audio streaming
    const setupAudioStreaming = () => {
      remoteParticipants.forEach((participant) => {
        participant.audioTracks.forEach((track) => {
          // Handle incoming audio tracks
          if (track.kind === Track.Kind.Audio) {
            // Subscribe to audio track automatically
            track.setSubscribed(true);
            
            // Handle audio elements for streaming playback
            const audioElement = new Audio();
            audioElement.autoplay = true;
            
            // Attach track to audio element for streaming playback
            track.track?.attach(audioElement);
            
            // Setup ended event to clean up
            audioElement.addEventListener('ended', () => {
              track.track?.detach(audioElement);
            });
          }
        });
        
        // Listen for new tracks
        participant.on('trackSubscribed', (track) => {
          if (track.kind === Track.Kind.Audio) {
            // New audio track arrived, attach it for streaming playback
            const audioElement = new Audio();
            audioElement.autoplay = true;
            track.attach(audioElement);
          }
        });
      });
    };
    
    // Initialize audio streaming
    setupAudioStreaming();

    // Clean up event listeners
    return () => {
      remoteParticipants.forEach((participant) => {
        participant.off('dataReceived');
        participant.off('trackSubscribed');
        
        // Detach all audio tracks
        participant.audioTracks.forEach((publication) => {
          publication.track?.detach();
        });
      });
    };
  }, [room, localParticipant, remoteParticipants]);

  // Auto-scroll to the latest message
  useEffect(() => {
    if (transcriptRef.current) {
      transcriptRef.current.scrollTop = transcriptRef.current.scrollHeight;
    }
  }, [messages]);

  return (
    <div className="w-full max-w-3xl mx-auto">
      <div className="border rounded-lg p-4 h-96 overflow-y-auto bg-slate-900" ref={transcriptRef}>
        {messages.length === 0 ? (
          <div className="text-center text-gray-400 my-8">
            Your conversation will appear here
          </div>
        ) : (
          messages.map((message) => (
            <div 
              key={message.id} 
              className={`mb-4 ${message.isUser ? 'text-right' : 'text-left'}`}
            >
              <div className={`inline-block px-4 py-2 rounded-lg ${
                message.isUser 
                  ? 'bg-blue-600 text-white rounded-br-none' 
                  : 'bg-gray-700 text-white rounded-bl-none'
              }`}>
                <div className="text-sm font-semibold flex items-center gap-2">
                  {message.sender}
                  {message.isStreaming && (
                    <span className="inline-block w-2 h-2 bg-green-400 rounded-full animate-pulse" 
                          title="Streaming response"></span>
                  )}
                </div>
                <div>
                  {message.text}
                  {message.isStreaming && (
                    <span className="inline-block w-1 h-4 bg-white ml-1 animate-blink"></span>
                  )}
                </div>
                <div className="text-xs opacity-70 mt-1">
                  {message.timestamp.toLocaleTimeString()}
                </div>
              </div>
            </div>
          ))
        )}
      </div>
      
      {/* Streaming status indicator */}
      {messages.some(m => m.isStreaming) && (
        <div className="text-center text-sm text-green-400 mt-2 flex items-center justify-center gap-2">
          <span className="inline-block w-2 h-2 bg-green-400 rounded-full animate-pulse"></span>
          Streaming response in real-time...
        </div>
      )}
    </div>
  );
}

This enhanced conversation transcript component is specifically designed to handle streaming messages. Let's break down the key streaming-focused features:

Streaming Message States: We track whether messages are currently streaming or complete
Real-time Updates: The component updates messages incrementally as new chunks arrive
Streaming Indicators: Visual cues like pulsing dots and a blinking cursor show when a message is streaming
Different Message Types: Handles different streaming events (start, chunk, end) to create a smooth experience
Audio Streaming: Automatically attaches and plays audio tracks as they arrive

With this implementation, users will see the AI's response appearing in real-time, letter by letter, creating a much more engaging and responsive experience than waiting for complete messages.

You

Can you explain how WebRTC streaming works?

10:15 AM

Assistant

WebRTC (Web Real-Time Communication) enables direct peer-to-peer streaming of audio, video, and data between browsers without plugins. It works by establishing a connection using ICE (Interactive Connectivity Establishment) protocols to navigate NATs and firewalls. The media stream is then transferred directly between peers using UDP for low latency

10:15 AM

Streaming response in real-time...

Figure 4: Live streaming text response with visual indicators

Now, let's create an enhanced audio control panel component that supports streaming audio for real-time voice interaction. We'll break it down into smaller, digestible parts:

1. Component Setup and State

// components/VoiceControls.tsx
import React, { useState, useEffect, useRef } from 'react';
import { useLiveKit } from './LiveKitProvider';
import { LocalTrack, Track, DataPacket_Kind } from 'livekit-client';

export default function VoiceControls() {
  // Get LiveKit context
  const { room, localParticipant, remoteParticipants } = useLiveKit();
  
  // Audio state
  const [isMuted, setIsMuted] = useState(true);
  const [isPublishing, setIsPublishing] = useState(false);
  const [microphoneTrack, setMicrophoneTrack] = useState<LocalTrack | null>(null);
  
  // Streaming state
  const [isStreaming, setIsStreaming] = useState(false);
  const [streamingLevel, setStreamingLevel] = useState(0);
  
  // Audio processing references
  const audioLevelInterval = useRef<NodeJS.Timeout | null>(null);
  const audioAnalyser = useRef<AnalyserNode | null>(null);
  const audioContext = useRef<AudioContext | null>(null);

2. Audio Processing Setup for Streaming

This function sets up real-time audio processing to detect voice activity and visualize audio levels:

  // Initialize audio streaming processing
  const setupAudioProcessing = async (track: LocalTrack) => {
    if (!track || !track.mediaStreamTrack) return;
    
    try {
      // Create audio context for stream processing
      audioContext.current = new AudioContext();
      
      // Create media stream source from microphone track
      const stream = new MediaStream([track.mediaStreamTrack]);
      const source = audioContext.current.createMediaStreamSource(stream);
      
      // Create analyser for audio level monitoring
      const analyser = audioContext.current.createAnalyser();
      analyser.fftSize = 256;
      analyser.smoothingTimeConstant = 0.8; // More smoothing for level visualization
      source.connect(analyser);
      audioAnalyser.current = analyser;
      
      // Set up interval to continuously monitor audio levels
      const dataArray = new Uint8Array(analyser.frequencyBinCount);
      startAudioMonitoring(dataArray);
    } catch (error) {
      console.error('Error setting up audio processing:', error);
    }
  };

3. Continuous Audio Level Monitoring

This function continuously analyzes the audio stream to detect when a user is speaking:

  // Monitor audio levels and detect voice activity 
  const startAudioMonitoring = (dataArray: Uint8Array) => {
    audioLevelInterval.current = setInterval(() => {
      if (audioAnalyser.current && !isMuted) {
        audioAnalyser.current.getByteFrequencyData(dataArray);
        
        // Calculate average audio level
        let sum = 0;
        for (let i = 0; i < dataArray.length; i++) {
          sum += dataArray[i];
        }
        const average = sum / dataArray.length;
        
        // Scale to 0-10 for visualization
        const level = Math.floor((average / 255) * 10);
        setStreamingLevel(level);
        
        // Voice activity detection logic
        handleVoiceActivity(level);
      } else {
        setStreamingLevel(0);
      }
    }, 100);  // Check every 100ms for responsive detection
  };
  
  // Handle voice activity detection for streaming
  const handleVoiceActivity = (level: number) => {
    // If voice detected and we're not already streaming, start stream
    if (level > 2 && !isStreaming) {
      startStreamingSession();
    } 
    // If voice stops and we were streaming, end stream with delay
    else if (level <= 1 && isStreaming) {
      // Add slight delay to prevent stopping on brief pauses
      debounceStreamingEnd();
    }
  };

4. Stream Session Management

These functions handle starting and ending streaming sessions with notifications:

  // Start streaming session and notify system
  const startStreamingSession = () => {
    setIsStreaming(true);
    
    // Notify system of streaming start through data channel
    if (room && localParticipant) {
      const data = JSON.stringify({ 
        type: 'streaming_start',
        timestamp: Date.now() 
      });
      
      localParticipant.publishData(
        new TextEncoder().encode(data),
        DataPacket_Kind.RELIABLE
      );
    }
  };
  
  // Debounce stream ending to avoid false stops
  const debounceStreamingEnd = () => {
    setTimeout(() => {
      // Re-check audio level to confirm it's still low
      if (audioAnalyser.current) {
        const confirmArray = new Uint8Array(audioAnalyser.current.frequencyBinCount);
        audioAnalyser.current.getByteFrequencyData(confirmArray);
        
        const average = confirmArray.reduce((a, b) => a + b, 0) / confirmArray.length;
        const level = Math.floor((average / 255) * 10);
        
        // If level is still low, end streaming
        if (level <= 1) {
          endStreamingSession();
        }
      }
    }, 800); // Wait to make sure it's not just a brief pause
  };
  
  // End streaming session and notify system
  const endStreamingSession = () => {
    setIsStreaming(false);
    
    // Notify system of streaming end
    if (room && localParticipant) {
      const data = JSON.stringify({ 
        type: 'streaming_end',
        timestamp: Date.now() 
      });
      
      localParticipant.publishData(
        new TextEncoder().encode(data),
        DataPacket_Kind.RELIABLE
      );
    }
  };

5. Microphone Control

This function handles toggling the microphone with optimized streaming settings:

  // Start or stop audio streaming publication
  const toggleMicrophone = async () => {
    if (!room || !localParticipant) return;

    try {
      if (isMuted) {
        // Start publishing streaming audio
        await startMicrophone();
      } else {
        // Stop streaming and clean up
        await stopMicrophone();
      }
    } catch (error) {
      console.error('Error toggling streaming microphone:', error);
      setIsPublishing(false);
    }
  };
  
  // Start microphone with streaming optimizations
  const startMicrophone = async () => {
    // If currently muted, start publishing streaming audio
    if (!microphoneTrack) {
      setIsPublishing(true);
      
      // Create a new microphone track with streaming options
      const tracks = await LocalTrack.createMicrophoneTrack({
        // Enable high quality audio for better streaming results
        echoCancellation: true,
        noiseSuppression: true,
        autoGainControl: true,
        // Set higher quality for better streaming experience
        sampleRate: 48000,
        channelCount: 1,
      });
      
      // Setup streaming audio visualization before publishing
      await setupAudioProcessing(tracks);
      
      // Publish the track to the room with streaming metadata
      await localParticipant.publishTrack(tracks, {
        name: 'streaming-audio',
        // Setting lower simulcast for audio but higher quality
        streamingOption: {
          maximumBitrate: 64000, // 64kbps is good for voice
          priority: Track.Priority.HIGH,
        }
      });
      
      setMicrophoneTrack(tracks);
      setIsPublishing(false);
    } else {
      // If we already have a track, just resume it with streaming
      await microphoneTrack.unmute();
      setupAudioProcessing(microphoneTrack);
    }
    
    setIsMuted(false);
  };
  
  // Stop microphone and clean up resources
  const stopMicrophone = async () => {
    // If unmuted, mute the track and stop streaming
    if (microphoneTrack) {
      await microphoneTrack.mute();
      endStreamingSession();
    }
    
    // Clean up audio processing
    cleanupAudioProcessing();
    
    setStreamingLevel(0);
    setIsMuted(true);
  };
  
  // Clean up audio processing resources
  const cleanupAudioProcessing = () => {
    if (audioLevelInterval.current) {
      clearInterval(audioLevelInterval.current);
      audioLevelInterval.current = null;
    }
    
    if (audioContext.current) {
      audioContext.current.close();
      audioContext.current = null;
    }
  };

6. Component Cleanup and UI Rendering

Handle cleanup on unmount and render the visual interface:

  // Clean up tracks and audio processing when component unmounts
  useEffect(() => {
    return () => {
      if (microphoneTrack) {
        microphoneTrack.stop();
      }
      
      cleanupAudioProcessing();
    };
  }, [microphoneTrack]);

  return (
    <div className="mt-6 flex flex-col items-center">
      {/* Audio level visualization meter */}
      {!isMuted && (
        <div className="mb-4 w-64 h-8 bg-gray-800 rounded-full p-1 flex items-center">
          <div 
            className={`h-6 rounded-full transition-all duration-100 ${
              isStreaming ? 'bg-green-500' : 'bg-blue-500'
            }`}
            style={{ width: `${streamingLevel * 10}%` }}
          >
            {streamingLevel > 4 && (
              <div className="text-xs text-white text-center w-full">
                {isStreaming ? 'Streaming' : 'Listening'}
              </div>
            )}
          </div>
        </div>
      )}
    
      {/* Microphone toggle button */}
      <button
        onClick={toggleMicrophone}
        disabled={isPublishing || !room}
        className={`px-6 py-3 rounded-full flex items-center ${
          isMuted
            ? 'bg-green-600 hover:bg-green-700'
            : 'bg-red-600 hover:bg-red-700'
        } text-white font-medium transition-colors`}
      >
        {isPublishing ? (
          <span>Initializing microphone for streaming...</span>
        ) : (
          <>
            {isMuted ? (
              <>
                <svg
                  xmlns="http://www.w3.org/2000/svg"
                  className="h-5 w-5 mr-2"
                  viewBox="0 0 20 20"
                  fill="currentColor"
                >
                  <path
                    fillRule="evenodd"
                    d="M7 4a3 3 0 016 0v4a3 3 0 11-6 0V4zm4 10.93A7.001 7.001 0 0017 8a1 1 0 10-2 0A5 5 0 015 8a1 1 0 00-2 0 7.001 7.001 0 006 6.93V17H6a1 1 0 100 2h8a1 1 0 100-2h-3v-2.07z"
                    clipRule="evenodd"
                  />
                </svg>
                Start Streaming Audio
              </>
            ) : (
              <>
                <svg
                  xmlns="http://www.w3.org/2000/svg"
                  className="h-5 w-5 mr-2"
                  viewBox="0 0 20 20"
                  fill="currentColor"
                >
                  <path
                    fillRule="evenodd"
                    d="M13.477 14.89A6 6 0 015.11 6.524l8.367 8.368zm1.414-1.414L6.524 5.11a6 6 0 018.367 8.367zM18 10a8 8 0 11-16 0 8 8 0 0116 0z"
                    clipRule="evenodd"
                  />
                </svg>
                Stop Streaming Audio
              </>
            )}
          </>
        )}
      </button>
      
      {/* Streaming status indicator */}
      {isStreaming && (
        <div className="text-xs text-green-400 animate-pulse mt-2">
          Real-time streaming active - AI is processing as you speak
        </div>
      )}
      
      {/* Streaming stats */}
      {!isMuted && (
        <div className="mt-4 text-xs text-gray-400 text-center">
          <div>Audio quality: 48kHz mono | Bitrate: 64kbps</div>
          <div>Connected participants: {remoteParticipants.length + 1}</div>
        </div>
      )}
    </div>
  );
}

Breaking down the code into smaller, focused sections makes it easier to understand how the streaming audio controls work. Each section handles a specific aspect of the streaming functionality:

Setup and State - Initializes the component with appropriate state variables
Audio Processing - Sets up the Web Audio API for real-time audio analysis
Audio Monitoring - Continuously monitors audio levels to detect speech
Stream Management - Handles starting and stopping streaming sessions
Microphone Control - Manages the microphone with optimized streaming settings
UI Rendering - Provides visual feedback on streaming status

This enhanced voice controls component adds several streaming-specific features:

Audio Visualization: Real-time meter showing audio levels during streaming
Stream State Management: Detects when voice is active and sends streaming events
Enhanced Audio Quality: Configures higher quality audio settings for better streaming
Web Audio API Integration: Uses the Web Audio API for advanced audio analysis
Visual Feedback: Shows active streaming status for better user experience

The enhanced audio processing is especially important for streaming applications. Instead of only capturing and sending audio, we're continuously analyzing the audio stream to detect speech, visualize audio levels, and provide feedback on streaming status.

Streaming

Real-time streaming active - AI is processing as you speak

Audio quality: 48kHz mono | Bitrate: 64kbps

Connected participants: 2

Figure 5: Real-time audio streaming controls with voice level visualization

Finally, let's create a main component that brings everything together:

// components/VoiceAssistant.tsx
import React, { useState, useEffect } from 'react';
import { LiveKitProvider, useLiveKit } from './LiveKitProvider';
import ConversationTranscript from './ConversationTranscript';
import VoiceControls from './VoiceControls';

function AssistantInner() {
  const { connect, disconnect, isConnected } = useLiveKit();
  const [isJoining, setIsJoining] = useState(false);
  const [error, setError] = useState<string | null>(null);
  const [roomName, setRoomName] = useState('');
  const [userName, setUserName] = useState('');

  // Function to join a room
  const joinRoom = async () => {
    if (!roomName || !userName) {
      setError('Room name and user name are required');
      return;
    }

    setIsJoining(true);
    setError(null);

    try {
      // Get token from our API
      const tokenResponse = await fetch('/api/token', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          roomName,
          participantName: userName,
        }),
      });

      if (!tokenResponse.ok) {
        throw new Error('Failed to get token');
      }

      const { token } = await tokenResponse.json();

      // Connect to the LiveKit room
      await connect(token, process.env.NEXT_PUBLIC_LIVEKIT_URL!);

      // Start the agent
      await fetch('/api/agent/start', {
        method: 'POST',
        headers: {
          'Content-Type': 'application/json',
        },
        body: JSON.stringify({
          roomName,
        }),
      });
    } catch (err) {
      console.error('Error joining room:', err);
      setError(err instanceof Error ? err.message : 'Failed to join room');
    } finally {
      setIsJoining(false);
    }
  };

  // Function to leave the room
  const leaveRoom = () => {
    disconnect();
  };

  return (
    <div className="max-w-4xl mx-auto p-4">
      <div className="text-center mb-8">
        <h1 className="text-3xl font-bold mb-2">AI Voice Assistant</h1>
        <p className="text-gray-400">
          Talk with an AI assistant using your voice
        </p>
      </div>

      {!isConnected ? (
        <div className="bg-slate-800 p-6 rounded-lg shadow-lg max-w-md mx-auto">
          <h2 className="text-xl font-semibold mb-4">Join a Room</h2>

          {error && (
            <div className="bg-red-600 bg-opacity-20 border border-red-400 text-red-200 px-4 py-2 rounded mb-4">
              {error}
            </div>
          )}

          <div className="space-y-4">
            <div>
              <label htmlFor="roomName" className="block text-sm font-medium mb-1">
                Room Name
              </label>
              <input
                type="text"
                id="roomName"
                value={roomName}
                onChange={(e) => setRoomName(e.target.value)}
                placeholder="Enter a room name"
                className="w-full px-4 py-2 bg-slate-700 rounded border border-slate-600 focus:ring-2 focus:ring-blue-500 focus:border-transparent"
                disabled={isJoining}
              />
            </div>

            <div>
              <label htmlFor="userName" className="block text-sm font-medium mb-1">
                Your Name
              </label>
              <input
                type="text"
                id="userName"
                value={userName}
                onChange={(e) => setUserName(e.target.value)}
                placeholder="Enter your name"
                className="w-full px-4 py-2 bg-slate-700 rounded border border-slate-600 focus:ring-2 focus:ring-blue-500 focus:border-transparent"
                disabled={isJoining}
              />
            </div>

            <button
              onClick={joinRoom}
              disabled={isJoining}
              className="w-full py-2 px-4 bg-blue-600 hover:bg-blue-700 text-white font-medium rounded transition-colors disabled:opacity-50"
            >
              {isJoining ? 'Joining...' : 'Join Room'}
            </button>
          </div>
        </div>
      ) : (
        <div className="space-y-6">
          <div className="flex justify-between items-center">
            <h2 className="text-xl font-semibold">
              Room: <span className="text-blue-400">{roomName}</span>
            </h2>
            <button
              onClick={leaveRoom}
              className="px-4 py-2 bg-red-600 hover:bg-red-700 text-white font-medium rounded-md transition-colors"
            >
              Leave Room
            </button>
          </div>

          {/* Conversation transcript */}
          <ConversationTranscript />

          {/* Voice controls */}
          <VoiceControls />
        </div>
      )}
    </div>
  );
}

export default function VoiceAssistant() {
  return (
    <LiveKitProvider>
      <AssistantInner />
    </LiveKitProvider>
  );
}

Creating the Main Page

Let's implement our main page component that will use our voice assistant:

// app/page.tsx
'use client';

import dynamic from 'next/dynamic';
import { Suspense } from 'react';

// Dynamically import the voice assistant component to avoid SSR issues
// with browser APIs like getUserMedia
const VoiceAssistant = dynamic(
  () => import('@/components/VoiceAssistant'),
  { ssr: false }
);

export default function Home() {
  return (
    <main className="min-h-screen p-4">
      <Suspense fallback={<div>Loading voice assistant...</div>}>
        <VoiceAssistant />
      </Suspense>
    </main>
  );
}

Note that we're using Next.js's dynamic import to avoid server-side rendering of components that use browser-specific APIs like getUserMedia.

AI Voice Assistant

Talk with an AI assistant using your voice

Join a Room

Room Name

tech-chat-room

Your Name

john_doe

Figure 2: Room joining interface for the voice assistant

Room: tech-chat-room

You

Can you explain how WebRTC works with LiveKit?

10:15 AM

Assistant

LiveKit uses WebRTC for real-time audio and video streaming. WebRTC (Web Real-Time Communication) is a technology that enables direct peer-to-peer communication between browsers without requiring additional plugins.

10:15 AM

Assistant

LiveKit provides a server-side implementation that handles the complex signaling, TURN/STUN servers, and media processing needed for robust WebRTC applications. This makes it easier to build scalable real-time applications.

10:16 AM

Figure 3: Conversation interface with transcript and voice control

Advanced Streaming Features and Optimizations

Now that we have a basic real-time voice assistant with streaming capabilities, let's explore advanced streaming features and optimizations that will make our application truly responsive and natural:

1. Optimizing Streaming Performance

To achieve the lowest possible latency in streaming responses, we need to fine-tune several parameters:

// Optimize streaming configuration in agent definition
streaming: {
  enabled: true,
  // Lower values = more responsive but higher bandwidth usage
  // For voice assistants, 2-3 is ideal for real-time feel
  partialUpdateInterval: 2,
  
  // Enable chunked transfer encoding for faster delivery
  chunkedTransfer: true,
  
  // Prioritize partial content delivery
  prioritizePartialDelivery: true,
  
  // Specify response format for most efficient parsing
  responseFormat: {
    type: "text",
    chunk_size: 20, // Characters per chunk, balance between smoothness and overhead
  },
  
  // Pipeline optimization
  pipelineConfig: {
    // Reduce overhead by avoiding full validation between chunks
    validatePartialResponses: false,
    
    // Process chunks in parallel when possible
    parallelProcessing: true,
  }
}

The parameters above are crucial for achieving the lowest possible streaming latency. For voice assistants, responsiveness is key, so we prioritize speed over bandwidth efficiency.

2. Enhanced Voice Activity Detection for Streaming

To enable truly natural conversation flow, we need sophisticated voice activity detection that can determine when to start and stop streaming:

// Enhanced streaming-optimized VAD configuration
turnDetection: {
  // Server-side VAD offers better accuracy for streaming
  type: 'server_vad',
  
  // Fine-tuned confidence threshold (0.0-1.0)
  // Lower = more responsive but may trigger on background noise
  // Higher = fewer false triggers but may miss the start of speech
  threshold: 0.4,
  
  // How long of silence before considering turn complete (ms)
  // Shorter makes streaming more responsive but may split utterances
  silence_duration_ms: 400,
  
  // Include audio before detected speech for better context (ms)
  prefix_padding_ms: 200,
  
  // Advanced settings for streaming optimization
  streaming_specific: {
    // Start streaming response after this duration of speech (ms)
    // This allows the AI to start formulating responses mid-sentence
    early_response_ms: 600,
    
    // Continue listening for this duration after speech ends (ms)
    // Helps maintain context between pauses
    continuation_window_ms: 2000,
    
    // Enable continuous speech detection for more natural conversation
    continuous_detection: true,
    
    // Transition phase length between speakers (ms)
    // Controls how much overlap is allowed in conversation
    transition_phase_ms: 300,
  }
}

These VAD enhancements allow the assistant to begin generating responses while the user is still talking, similar to how humans begin formulating responses before the other person has finished speaking.

Figure 6: Optimized streaming response timing with conversation overlap

3. Audio Streaming Optimizations

For the smoothest audio streaming experience, we need to fine-tune the audio processing:

// Create optimized audio streaming configuration
const createOptimizedAudioStream = () => {
  // Configure audio streaming parameters
  const streamingConfig = {
    // Audio format settings
    audio: {
      // Higher sample rate for better quality voice
      sampleRate: 48000,
      // Single channel for voice is sufficient
      channels: 1,
      // Using opus codec for efficient streaming
      codec: 'opus',
      // Bitrate balanced for voice quality and bandwidth
      bitrate: 64000,
      // Frame size optimized for low latency
      frameSize: 20, // ms
    },
    
    // Buffer settings
    buffer: {
      // Set initial buffer size
      // Smaller = less latency but more risk of interruptions
      // Larger = smoother playback but more latency
      initial: 300, // ms
      
      // Minimum buffer before playback starts
      minimum: 100, // ms
      
      // Dynamic buffer adjustment for network conditions
      adaptive: true,
      
      // How quickly buffer adapts (0-1, higher = faster adaptation)
      adaptationRate: 0.6,
    },
    
    // Network optimizations
    network: {
      // Send multiple small packets rather than waiting for larger ones
      prioritizeLowLatency: true,
      
      // Use WebRTC data channels for control messages (faster than REST)
      useDataChannels: true,
      
      // Prioritize audio packets in congested networks
      qualityOfService: 'high',
      
      // Reconnection strategy
      reconnection: {
        attempts: 5,
        backoff: 'exponential',
        maxDelay: 3000, // ms
      },
    },
  };
  
  return streamingConfig;
};

// Apply configuration to audio stream
const applyStreamingOptimizations = (audioTrack) => {
  const config = createOptimizedAudioStream();
  
  // Apply codec and bitrate settings
  audioTrack.setEncodingParams({
    maxBitrate: config.audio.bitrate,
    codecName: config.audio.codec,
  });
  
  // Create optimized audio processor
  const audioContext = new AudioContext({
    sampleRate: config.audio.sampleRate,
    latencyHint: 'interactive',
  });
  
  // Create buffer with optimal settings
  const bufferSource = audioContext.createBufferSource();
  bufferSource.buffer = audioContext.createBuffer(
    config.audio.channels,
    config.buffer.initial * config.audio.sampleRate / 1000,
    config.audio.sampleRate
  );
  
  // Connect audio processing nodes optimized for streaming
  const streamProcessor = audioContext.createScriptProcessor(1024, 1, 1);
  streamProcessor.onaudioprocess = (e) => {
    // Process audio chunks with minimal latency
    // This function runs in the audio thread for real-time performance
  };
  
  return { track: audioTrack, context: audioContext, processor: streamProcessor };
};

4. Streaming Function Calling

To maintain real-time responsiveness even when calling external functions, we can implement streaming function calls:

Figure 7: Parallel function execution with streaming responses

// Define streaming-optimized functions for real-time use
const streamingFunctions = [
  {
    name: 'stream_search',
    description: 'Search for information with streaming results',
    parameters: {
      type: 'object',
      properties: {
        query: {
          type: 'string',
          description: 'The search query',
        },
      },
      required: ['query'],
    },
    // Streaming function implementation
    streamingHandler: async function*(query: string) {
      // Initialize search
      const search = initializeSearch(query);
      
      // Yield initial quick results immediately
      yield { partial: true, results: search.quickResults() };
      
      // Start deeper search in background
      const searchPromise = search.executeFullSearch();
      
      // Stream partial results as they become available
      for await (const partialResult of search.streamResults()) {
        yield { 
          partial: true, 
          results: partialResult,
          confidence: partialResult.confidence 
        };
      }
      
      // Final complete results
      const finalResults = await searchPromise;
      yield { partial: false, results: finalResults };
    }
  },
  
  {
    name: 'get_weather',
    description: 'Get weather information for a location with streaming updates',
    parameters: {
      type: 'object',
      properties: {
        location: {
          type: 'string',
          description: 'The city and state, e.g. San Francisco, CA',
        },
      },
      required: ['location'],
    },
    // Stream results as they become available
    streamingHandler: async function*(location: string) {
      // Yield initial estimate from cache immediately
      yield { 
        partial: true,
        source: 'cache',
        temperature: await getEstimatedTemperature(location),
        conditions: 'Loading...'
      };
      
      // Make API request in background
      const weatherPromise = fetchWeatherData(location);
      
      // Stream partial weather data as it loads
      yield { 
        partial: true, 
        source: 'preliminary',
        temperature: await getRealtimeTemperature(location),
        conditions: await getBasicConditions(location)
      };
      
      // Final complete weather data
      const weather = await weatherPromise;
      yield { 
        partial: false,
        source: 'api',
        temperature: weather.temperature,
        conditions: weather.conditions,
        forecast: weather.forecast
      };
    }
  }
];

// Configure streaming function calling
model: new openai.realtime.RealtimeModel({
  // ... other options
  
  // Enable streaming function calling
  functions: streamingFunctions,
  function_calling: {
    enabled: true,
    // Enhanced options for streaming function calls
    streaming: {
      // Allow early function calling before user finishes speaking
      earlyInvocation: true,
      
      // Process streaming function results in parallel with ongoing generation
      parallelProcessing: true,
      
      // Interleave function results with text generation
      interleaveResults: true,
      
      // Invoke functions with partial parameter data
      allowPartialParams: true,
      
      // Queue multiple function calls in parallel
      parallelExecution: true,
      
      // Progress updates for long-running functions
      progressUpdates: true,
    }
  },
})

With streaming function calls, the assistant can call external APIs while continuing to generate responses, and can even start calling functions before the user has finished speaking. This creates a much more responsive and natural interaction.

5. Multi-stream Conversation Context

For more natural conversations, we can implement a multi-stream approach that allows for overlapping turns and remembers context:

// Multi-stream conversation context configuration
conversation: {
  // Enable natural overlapping conversation
  continuousTalking: true,
  
  // How much conversation history to maintain (tokens)
  contextWindowSize: 4000,
  
  // Strategy for handling interruptions
  interruptionHandling: {
    // Allow user to interrupt AI responses
    allowUserInterruptions: true,
    
    // How to handle user interruptions
    // "pause", "abandon", "complete", "background"
    userInterruptBehavior: "pause",
    
    // Allow AI to politely interrupt user in appropriate moments
    allowAiInterruptions: true,
    
    // When AI can interrupt (0-1, higher = more likely to interrupt)
    aiInterruptThreshold: 0.7,
  },
  
  // Enhanced memory model for streaming context
  memoryModel: {
    // Remember more recent conversation turns in detail
    recencyBias: 0.8,
    
    // Maintain entity knowledge across conversation
    entityTracking: true,
    
    // Compress older conversation parts for efficiency
    compressionEnabled: true,
    
    // Hierarchical memory model for better context
    hierarchical: true,
    
    // Long-term conversation state persistence
    persistenceStrategy: "session",
  },
  
  // Context adaptation for personalization
  contextAdaptation: {
    // Adapt to user speaking style
    adaptToUserStyle: true,
    
    // Learn user preferences over time
    preferenceTracking: true,
    
    // Remember correction patterns
    learnFromCorrections: true,
  }
}

These advanced features represent the cutting edge of real-time voice assistant technology, creating a truly responsive and natural streaming experience that feels like talking to another person rather than a computer.

Deployment and Production Considerations

When deploying your voice assistant to production, consider the following:

1. Scalability

For production deployments, you should set up a dedicated LiveKit server or use their cloud offering with appropriate scaling. Each voice conversation requires dedicated server resources.

2. Error Handling

Implement robust error handling for network disruptions, API failures, and device permission issues.

// Example of enhanced error handling for microphone permissions
const startMicrophone = async () => {
  try {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    // Process the stream
    return stream;
  } catch (error) {
    if (error instanceof DOMException) {
      if (error.name === 'NotAllowedError') {
        // Handle permission denied
        return { error: 'Microphone permission denied' };
      } else if (error.name === 'NotFoundError') {
        // Handle no microphone available
        return { error: 'No microphone found' };
      }
    }
    
    // Handle other errors
    console.error('Microphone error:', error);
    return { error: 'Failed to access microphone' };
  }
};

3. Monitoring and Analytics

Implement monitoring to track API usage, conversation quality, and system performance.

// Simple analytics tracking
const trackConversation = (roomId: string, data: {
  duration: number;
  messageCount: number;
  userSpeakingTime: number;
  assistantSpeakingTime: number;
  errorCount: number;
}) => {
  // Send to your analytics service
  fetch('/api/analytics/conversation', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      roomId,
      ...data,
      timestamp: new Date().toISOString(),
    }),
  });
};

4. Cost Management

Both OpenAI and LiveKit charge based on usage. Implement controls to manage costs:

Set time limits for conversations
Implement rate limiting for users
Monitor and set alerts for usage thresholds
Consider using smaller models for cost-sensitive applications

Conclusion: The Future of Streaming Voice Assistants

In this comprehensive guide, we've built a truly real-time streaming voice-to-voice assistant that responds naturally and fluidly using OpenAI's streaming APIs, LiveKit, WebRTC, and React. We've covered:

Setting up the project infrastructure for real-time communication
Implementing streaming as a core component with continuous audio flow
Creating a streaming-optimized LiveKit agent with OpenAI's real-time model
Building responsive backend API endpoints that support streaming data
Developing frontend components with streaming UI updates that respond in real-time
Adding advanced streaming features like optimized audio processing and streaming function calling
Implementing sophisticated Voice Activity Detection for natural conversation flow
Fine-tuning performance parameters for lowest possible latency
Considering production deployment for high-performance streaming applications

Figure 8: Future directions for streaming voice technology

The streaming voice assistant we've built represents a fundamental shift from the turn-taking model of traditional voice assistants to a truly natural, continuous conversation flow. By implementing streaming at every level - from audio capture to AI processing to response generation - we've created an experience that feels responsive and human-like.

This architecture is highly extensible, allowing you to build more sophisticated applications on this streaming foundation, such as:

Real-time language translation during conversations
Multi-participant voice meetings with AI moderation
Ambient voice assistants that listen and respond contextually
Emotion-aware voice interfaces that adapt to user sentiment
Voice assistants that learn and improve from continuous conversation

As streaming AI technology continues to evolve, the latency gap between human-to-human and human-to-AI conversation will continue to shrink. The streaming-first approach outlined in this guide provides a solid foundation that can be adapted to incorporate new capabilities as they become available, ensuring your voice applications remain at the cutting edge of what's possible.

Remember: without streaming, voice assistants feel robotic and unnatural. With proper streaming implementation throughout the technology stack, we can create truly seamless, responsive, and natural voice interfaces that transform how humans interact with AI.

Building a Real-Time Voice-to-Voice Assistant with OpenAI, LiveKit, Next.js and React

Table of Contents

Introduction

Prerequisites

Understanding Streaming for Real-Time Voice Assistants

The Problem with Non-Streaming Approaches

Streaming Architecture: The Key to Natural Conversation

Key Streaming Components

Project Setup

Setting Up the LiveKit Agent with Streaming Capabilities

Creating the Backend API

Building the Frontend

1. Component Setup and State

2. Audio Processing Setup for Streaming

3. Continuous Audio Level Monitoring

4. Stream Session Management

5. Microphone Control

6. Component Cleanup and UI Rendering

Creating the Main Page

AI Voice Assistant

Join a Room

Room: tech-chat-room

Advanced Streaming Features and Optimizations

1. Optimizing Streaming Performance

2. Enhanced Voice Activity Detection for Streaming

3. Audio Streaming Optimizations

4. Streaming Function Calling

5. Multi-stream Conversation Context

Deployment and Production Considerations

1. Scalability

2. Error Handling

3. Monitoring and Analytics

4. Cost Management

Conclusion: The Future of Streaming Voice Assistants

Further Reading

Key Resources

Academic References

AI & Speech

Real-time Communication

Web Development