Transcribe Audio

curl --request POST \
  --url https://api.example.com/api/audio/transcriptions \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "language": "<string>"
}
'

{
  "text": "<string>",
  "filename": "<string>"
}

POST

api

audio

transcriptions

Transcribe Audio

curl --request POST \
  --url https://api.example.com/api/audio/transcriptions \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "language": "<string>"
}
'

{
  "text": "<string>",
  "filename": "<string>"
}

Transcribe audio files to text using configured speech-to-text engines. Supports multiple STT providers including OpenAI Whisper, Deepgram, Azure, and Mistral.

Request

Headers

Authorization

string

required

Bearer token for authentication

Body

file

required

Audio file to transcribe. Supported formats: flac, m4a, mp3, mp4, mpeg, wav, webmMaximum file size varies by engine:

Default engines: 20 MB
Azure: 200 MB

language

string

Language code for transcription (e.g., en, es, fr). Used as a hint to improve accuracy.

Response

text

string

The transcribed text from the audio file

filename

string

Name of the processed audio file

Example

curl -X POST https://your-domain.com/api/audio/transcriptions \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -F "file=@recording.mp3" \
  -F "language=en"

Response

{
  "text": "Hello, this is a sample audio transcription. The quick brown fox jumps over the lazy dog.",
  "filename": "550e8400-e29b-41d4-a716-446655440000.mp3"
}

Supported Engines

Configure the STT engine in Admin Settings > Audio:

Local Whisper (Default)

Uses Faster Whisper model running locally
Configurable model size and compute type
Supports VAD filtering and multilingual mode

OpenAI Whisper API

Cloud-based transcription using OpenAI’s API
Requires OpenAI API key
Supports language parameter

Deepgram

High-accuracy transcription API
Smart formatting enabled by default
Requires Deepgram API key

Azure Speech Services

Microsoft Azure cognitive services
Supports speaker diarization (up to 3 speakers by default)
Multi-locale detection
Requires Azure subscription key and region

Mistral

Uses Voxtral models for transcription
Two methods: dedicated transcriptions API or chat completions
Requires Mistral API key

Audio Processing

The API automatically handles:

Format conversion: Non-supported formats are converted to MP3
Compression: Large files are compressed to reduce size
Chunking: Files exceeding size limits are split into chunks
Parallel processing: Multiple chunks processed concurrently

Permissions

Requires the chat.stt permission. Admin users have access by default.

Error Responses

400

Invalid file format or transcription failed

403

User does not have permission to use transcription

Configuration

To configure STT settings:

POST /api/audio/config/update

Admin-only endpoint to update STT engine, model, and API credentials.

Delete File Generate Image

​Request

​Headers

​Body

​Response

​Example

​Supported Engines

​Local Whisper (Default)

​OpenAI Whisper API

​Deepgram

​Azure Speech Services

​Mistral

​Audio Processing

​Permissions

​Error Responses

​Configuration

Request

Headers

Body

Response

Example

Supported Engines

Local Whisper (Default)

OpenAI Whisper API

Deepgram

Azure Speech Services

Mistral

Audio Processing

Permissions

Error Responses

Configuration