Overview
Open WebUI provides comprehensive voice and video capabilities, enabling natural spoken interactions with AI models through multiple Speech-to-Text (STT) and Text-to-Speech (TTS) providers.Speech-to-Text (STT)
Convert spoken audio to text using various providers.Supported Providers
- Local Whisper
- OpenAI Whisper
- Deepgram
- Azure Speech
- Mistral
Faster Whisper - Self-hosted transcription
- No API costs
- Privacy-focused (local processing)
- Multiple model sizes
- GPU acceleration support
- VAD filtering
Transcription Workflow
Audio Input
User records or uploads audio:
- Microphone capture
- File upload
- Supported formats: WAV, MP3, WebM, M4A, FLAC
Preprocessing
Audio preparation:
- Format conversion (if needed)
- Compression for large files
- Splitting if exceeds size limits
Transcription
Send to configured STT provider:
- Process in chunks if necessary
- Apply language settings
- Handle diarization (Azure)
Audio File Processing
Format Support
Accepted formats:
- WAV, MP3, WebM
- M4A, FLAC, MPEG
- MP4 (audio track)
- Automatic conversion if needed
Size Limits
Maximum file sizes:
- OpenAI/Deepgram/Mistral: 20MB
- Azure: 200MB
- Auto-compression if exceeded
- Intelligent chunking
Compression & Chunking
Automatically handled by Open WebUI:Text-to-Speech (TTS)
Convert AI responses to natural-sounding speech.Supported Engines
- OpenAI TTS
- ElevenLabs
- Azure Speech
- Transformers
OpenAI API - High-quality voices
- Multiple voices (alloy, echo, fable, onyx, nova, shimmer)
- Natural intonation
- Fast generation
- API-based
Speech Generation API
Voice Selection
Get available voices for configured engine:Response Caching
Generated speech is cached based on:- Input text hash
- Engine and model
- Voice selection
Voice & Video Calling
Real-time voice and video chat with AI models.Features
Hands-Free Mode
Voice-only conversations:
- Continuous listening
- Automatic speech detection
- Voice activity detection (VAD)
- Wake word support
Video Calling
Face-to-face AI interaction:
- Real-time video feed
- Avatar display
- Screen sharing
- Multi-modal input
Multi-Language
Global communication:
- Auto language detection
- Translation support
- Multi-language voices
- Accent options
Low Latency
Optimized performance:
- Streaming transcription
- Parallel processing
- Edge caching
- WebSocket support
Enabling Voice/Video
Advanced Configuration
Text Splitting for TTS
Improve responsiveness by splitting long texts:- Faster initial audio playback
- Smoother streaming experience
- Better for long responses
Language Settings
- STT Language
- TTS Language
Configure transcription language:
Content Type Filtering
Restrict accepted audio formats:Permissions
Control access to voice features:Performance Optimization
Whisper Model Selection
Balance quality vs. speed:- tiny
- base
- small
- medium/large
Fastest, lowest accuracy
- Use for: Quick transcription, low-resource environments
- Size: ~75MB
- Speed: Real-time on CPU
GPU Acceleration
Enable for Whisper:VAD Filtering
Voice Activity Detection for better quality:- Removes silence
- Reduces hallucinations
- Improves accuracy
- Faster processing
API Reference
Configuration Endpoints
Transcription
Speech Generation
Models & Voices
Best Practices
Choose Right Provider
Consider:
- Privacy needs (local vs. cloud)
- Accuracy requirements
- Budget constraints
- Language support
- Latency tolerance
Optimize Audio Quality
Tips:
- Use high-quality microphone
- Reduce background noise
- Clear pronunciation
- Proper audio levels
- Supported format
Manage Costs
Strategies:
- Use local Whisper when possible
- Cache common phrases
- Monitor API usage
- Set usage quotas
- Consider hybrid approach
User Experience
Enhance UX:
- Enable text splitting for TTS
- Use appropriate voice
- Match language settings
- Provide visual feedback
- Handle errors gracefully
Troubleshooting
Transcription fails or empty result
Transcription fails or empty result
Check:
- Audio file is not silent/empty
- Format is supported
- File size within limits
- API key is valid
- Language setting correct
- VAD not filtering entire audio
Poor transcription quality
Poor transcription quality
Solutions:
- Use larger Whisper model
- Improve audio quality
- Reduce background noise
- Specify correct language
- Disable VAD if over-filtering
- Try different provider
TTS voice sounds unnatural
TTS voice sounds unnatural
Try:
- Different voice option
- Higher quality model (tts-1-hd vs tts-1)
- Azure neural voices
- ElevenLabs for premium quality
- Adjust SSML (Azure)
Slow processing
Slow processing
Optimize:
- Use GPU for Whisper
- Reduce audio file size
- Enable compression
- Use smaller model
- Increase timeout settings
- Check network latency