Speech to Text
Transcribe speech from an audio or video URL into text, with optional speaker diarization and word-level timestamps.
Authorizations
Bearer token (API key). Format: Bearer {your_api_key}
Body
The speech-to-text request payload.
The source audio or video URL.
"https://example.com/audio.mp3"
The source language code (e.g. en, zh, ja).
"en"
An optional task title for reference.
Expected transcript output format: srt, vtt, json, txt.
"json"
The transcription engine to use. Supported: whisper, funasr.
"whisper"
Set to true to process asynchronously.
true
Whether to include word-level timestamp alignment.
false
Whether to enable speaker diarization (identify multiple speakers).
false
Minimum number of speakers for diarization.
Maximum number of speakers for diarization.
The transcription batch size.
The target output script style or format.
Response
Task accepted. Returns a taskId for async polling or the result URL directly.
Standard response for audio processing tasks.
HTTP-style status code (200 for success, 202 for in-progress).
Download URL of the generated audio file (available when completed).
Task identifier for async polling. Use with GET /v1/task/{task_id}.
"a1b2c3d4-e5f6-7890-abcd-ef1234567890"
Multiple output URLs (e.g. for separation stems).