Skip to main content

Overview

Open WebUI provides comprehensive voice and video capabilities, enabling natural spoken interactions with AI models through multiple Speech-to-Text (STT) and Text-to-Speech (TTS) providers.

Speech-to-Text (STT)

Convert spoken audio to text using various providers.

Supported Providers

Faster Whisper - Self-hosted transcription
  • No API costs
  • Privacy-focused (local processing)
  • Multiple model sizes
  • GPU acceleration support
  • VAD filtering
{
  "STT_ENGINE": "",
  "WHISPER_MODEL": "base",  // tiny, base, small, medium, large
  "WHISPER_MODEL_AUTO_UPDATE": true
}

Transcription Workflow

1

Audio Input

User records or uploads audio:
  • Microphone capture
  • File upload
  • Supported formats: WAV, MP3, WebM, M4A, FLAC
2

Preprocessing

Audio preparation:
  • Format conversion (if needed)
  • Compression for large files
  • Splitting if exceeds size limits
3

Transcription

Send to configured STT provider:
  • Process in chunks if necessary
  • Apply language settings
  • Handle diarization (Azure)
4

Result Assembly

Combine and format results:
  • Merge chunk transcriptions
  • Clean up text
  • Return to chat interface

Audio File Processing

Format Support

Accepted formats:
  • WAV, MP3, WebM
  • M4A, FLAC, MPEG
  • MP4 (audio track)
  • Automatic conversion if needed

Size Limits

Maximum file sizes:
  • OpenAI/Deepgram/Mistral: 20MB
  • Azure: 200MB
  • Auto-compression if exceeded
  • Intelligent chunking

Compression & Chunking

Automatically handled by Open WebUI:
# From routers/audio.py:1109-1168
# Compression
- Frame rate reduced to 16kHz
- Mono conversion
- 32kbps bitrate

# Chunking
- Splits large files intelligently
- Maintains audio quality
- Parallel processing
- Automatic cleanup

Text-to-Speech (TTS)

Convert AI responses to natural-sounding speech.

Supported Engines

OpenAI API - High-quality voices
  • Multiple voices (alloy, echo, fable, onyx, nova, shimmer)
  • Natural intonation
  • Fast generation
  • API-based
{
  "TTS_ENGINE": "openai",
  "TTS_MODEL": "tts-1-hd",
  "TTS_VOICE": "nova",
  "TTS_OPENAI_API_BASE_URL": "https://api.openai.com/v1",
  "TTS_OPENAI_API_KEY": "sk-..."
}

Speech Generation API

# POST /api/v1/audio/speech
curl -X POST "https://your-instance/api/v1/audio/speech" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, this is a test of text to speech.",
    "voice": "nova",
    "model": "tts-1-hd"
  }'

# Returns MP3 audio file

Voice Selection

Get available voices for configured engine:
# GET /api/v1/audio/voices
{
  "voices": [
    {"id": "alloy", "name": "alloy"},
    {"id": "echo", "name": "echo"},
    {"id": "fable", "name": "fable"},
    {"id": "nova", "name": "nova"},
    {"id": "shimmer", "name": "shimmer"}
  ]
}

Response Caching

Generated speech is cached based on:
  • Input text hash
  • Engine and model
  • Voice selection
Identical requests return cached audio instantly.

Voice & Video Calling

Real-time voice and video chat with AI models.

Features

Hands-Free Mode

Voice-only conversations:
  • Continuous listening
  • Automatic speech detection
  • Voice activity detection (VAD)
  • Wake word support

Video Calling

Face-to-face AI interaction:
  • Real-time video feed
  • Avatar display
  • Screen sharing
  • Multi-modal input

Multi-Language

Global communication:
  • Auto language detection
  • Translation support
  • Multi-language voices
  • Accent options

Low Latency

Optimized performance:
  • Streaming transcription
  • Parallel processing
  • Edge caching
  • WebSocket support

Enabling Voice/Video

# Configuration
{
  # STT Setup
  "STT_ENGINE": "openai",
  "STT_MODEL": "whisper-1",
  
  # TTS Setup
  "TTS_ENGINE": "openai",
  "TTS_MODEL": "tts-1-hd",
  "TTS_VOICE": "nova",
  
  # Performance
  "TTS_SPLIT_ON": "sentence"  // Split long responses
}

Advanced Configuration

Text Splitting for TTS

Improve responsiveness by splitting long texts:
{
  "TTS_SPLIT_ON": "sentence"  // sentence, paragraph, none
}
Benefits:
  • Faster initial audio playback
  • Smoother streaming experience
  • Better for long responses

Language Settings

Configure transcription language:
# Whisper local
WHISPER_LANGUAGE = "en"  // ISO 639-1 code
WHISPER_MULTILINGUAL = true

# API-based (sent with request)
language = "en-US"

Content Type Filtering

Restrict accepted audio formats:
{
  "STT_SUPPORTED_CONTENT_TYPES": [
    "audio/wav",
    "audio/mpeg",
    "audio/webm"
  ]
}
Unsupported formats will be rejected at upload, preventing unnecessary processing.

Permissions

Control access to voice features:
{
  "USER_PERMISSIONS": {
    "chat.stt": true,  // Speech-to-text
    "chat.tts": true   // Text-to-speech
  }
}

Performance Optimization

Whisper Model Selection

Balance quality vs. speed:
Fastest, lowest accuracy
  • Use for: Quick transcription, low-resource environments
  • Size: ~75MB
  • Speed: Real-time on CPU

GPU Acceleration

Enable for Whisper:
# Environment
DEVICE_TYPE = "cuda"  // cuda, cpu, mps

# Whisper config
WHISPER_COMPUTE_TYPE = "float16"  // float16, int8, float32

VAD Filtering

Voice Activity Detection for better quality:
WHISPER_VAD_FILTER = true
Benefits:
  • Removes silence
  • Reduces hallucinations
  • Improves accuracy
  • Faster processing

API Reference

Configuration Endpoints

# Get audio config
GET /api/v1/audio/config

# Update config
POST /api/v1/audio/config/update
{
  "tts": {...},
  "stt": {...}
}

Transcription

# Transcribe audio file
POST /api/v1/audio/transcriptions

Content-Type: multipart/form-data
- file: [audio file]
- language: "en" (optional)
Response:
{
  "text": "Transcribed text content",
  "filename": "uploaded-file.mp3"
}

Speech Generation

POST /api/v1/audio/speech
{
  "input": "Text to convert to speech",
  "voice": "nova",
  "model": "tts-1-hd"
}

# Returns audio/mpeg file

Models & Voices

# Get available TTS models
GET /api/v1/audio/models

# Get available voices
GET /api/v1/audio/voices

Best Practices

Choose Right Provider

Consider:
  • Privacy needs (local vs. cloud)
  • Accuracy requirements
  • Budget constraints
  • Language support
  • Latency tolerance

Optimize Audio Quality

Tips:
  • Use high-quality microphone
  • Reduce background noise
  • Clear pronunciation
  • Proper audio levels
  • Supported format

Manage Costs

Strategies:
  • Use local Whisper when possible
  • Cache common phrases
  • Monitor API usage
  • Set usage quotas
  • Consider hybrid approach

User Experience

Enhance UX:
  • Enable text splitting for TTS
  • Use appropriate voice
  • Match language settings
  • Provide visual feedback
  • Handle errors gracefully

Troubleshooting

Check:
  • Audio file is not silent/empty
  • Format is supported
  • File size within limits
  • API key is valid
  • Language setting correct
  • VAD not filtering entire audio
Solutions:
  • Use larger Whisper model
  • Improve audio quality
  • Reduce background noise
  • Specify correct language
  • Disable VAD if over-filtering
  • Try different provider
Try:
  • Different voice option
  • Higher quality model (tts-1-hd vs tts-1)
  • Azure neural voices
  • ElevenLabs for premium quality
  • Adjust SSML (Azure)
Optimize:
  • Use GPU for Whisper
  • Reduce audio file size
  • Enable compression
  • Use smaller model
  • Increase timeout settings
  • Check network latency