Overview
The Live Transcription API turns a stream of audio into text in real time over a
single WebSocket connection. You send raw audio frames as you capture them and
receive interim (partial) results immediately, followed by final ,
stabilized segments — ideal for live captions, voice agents, dictation, and
meeting transcription.
Live transcription runs on Premium accuracy (punctuation, per-word
confidence, language detection, and optional speaker labels).
Key features
Sub-second interim results as the speaker talks
Stabilized final segments with word timestamps
Optional speaker labels (diarize)
50+ languages with auto-detect
One artifact : a finished session is saved as a normal transcription job you
can fetch, edit, and export (SRT/VTT/TXT/JSON) like any batch transcript
Endpoint
wss://api.audiopod.ai/api/v1/transcription/stream
Authentication
WebSocket handshakes can’t carry custom headers from every client, so the API
key travels as a query parameter:
wss://api.audiopod.ai/api/v1/transcription/stream?api_key=ap_your_key
Browser/session clients may instead pass a short-lived JWT as ?token=....
Protocol
Messages are JSON text frames, except the audio you upload, which is binary .
Client → server
Frame Type Payload Start (first frame) text { "type": "start", "language": "en", "diarize": false, "sample_rate": 16000, "channels": 1 }Audio binary Raw 16-bit little-endian PCM, 16 kHz, mono End text { "type": "end" }
Server → client
typeMeaning readySession accepted; you may start sending audio. Includes job_id, session_id, taste, max_duration_seconds. partialInterim transcript (text) — will change as more audio arrives. finalStabilized segment (text, words[], speech_final). usageRunning credit usage (credits_used). completedSession finished. Includes job_id, credits_used, streamed_seconds, transcript_path. errorSomething went wrong (code, message).
The first server frame is always ready. Send audio in small chunks (e.g. 100 ms)
for the lowest latency.
Property Value Encoding 16-bit signed PCM (little-endian) Sample rate 16000 Hz Channels 1 (mono)
Resample to 16 kHz mono before sending. Larger sample rates or stereo will be
rejected with an INVALID_CONFIG error.
Prepaid billing. Live transcription is prepaid at the standard transcription
rate (220 credits/minute). Credits for the maximum session length are reserved
up front and settled to your actual streamed duration when the session ends
(the unused reservation is released). If you don’t have enough credits to start,
the session is rejected with INSUFFICIENT_CREDITS before any audio flows.
Quickstart
Node.js
Python
Raw WebSocket
import AudioPod from "audiopod" ;
const client = new AudioPod ({ apiKey: process . env . AUDIOPOD_API_KEY });
const stream = client . transcription . stream ({ language: "en" , diarize: true });
// Event style
stream . on ( "partial" , ( e ) => process . stdout . write ( " \r " + e . text ));
stream . on ( "final" , ( e ) => console . log ( " \n " + e . text ));
stream . on ( "completed" , ( e ) =>
console . log ( `done — ${ e . credits_used } credits, ${ e . streamed_seconds } s` ),
);
stream . on ( "error" , ( e ) => console . error ( ` ${ e . code } : ${ e . message } ` ));
// Wait for the server `ready` frame, then stream 16-bit PCM (16 kHz, mono).
await stream . ready ;
for ( const chunk of pcmChunks ) {
stream . sendAudio ( chunk ); // Buffer of Int16LE samples
}
stream . end (); // flush + finalize
You can also consume events as an async iterable instead of with .on(): for await ( const event of stream ) {
if ( event . type === "final" ) console . log ( event . text );
}
import asyncio
from audiopod import AsyncClient
async def main ():
async with AsyncClient() as client:
async with client.transcription.stream( language = "en" , diarize = True ) as session:
# Feed audio and read events concurrently for true real-time.
async def feed ():
for chunk in pcm_chunks: # 16-bit PCM, 16 kHz, mono
await session.send_audio(chunk)
await session.end()
asyncio.create_task(feed())
async for event in session:
if event[ "type" ] == "partial" :
print ( " \r " + event[ "text" ], end = "" , flush = True )
elif event[ "type" ] == "final" :
print ( " \n " + event[ "text" ])
elif event[ "type" ] == "completed" :
print ( f "done — { event.get( 'credits_used' ) } credits" )
asyncio.run(main())
Streaming requires the async client (AsyncClient).
import asyncio, json, aiohttp
async def main ():
url = "wss://api.audiopod.ai/api/v1/transcription/stream"
async with aiohttp.ClientSession() as s:
async with s.ws_connect(url, params = { "api_key" : API_KEY }) as ws:
await ws.send_str(json.dumps({
"type" : "start" , "language" : "en" ,
"diarize" : False , "sample_rate" : 16000 , "channels" : 1 ,
}))
async def feed ():
for chunk in pcm_chunks: # bytes of Int16LE PCM
await ws.send_bytes(chunk)
await ws.send_str(json.dumps({ "type" : "end" }))
asyncio.create_task(feed())
async for msg in ws:
if msg.type == aiohttp.WSMsgType. TEXT :
event = json.loads(msg.data)
if event[ "type" ] in ( "final" , "completed" , "error" ):
print (event)
if event[ "type" ] in ( "completed" , "error" ):
break
asyncio.run(main())
Capturing microphone audio in the browser
Browsers capture at the hardware sample rate, so request a 16 kHz AudioContext
and convert float samples to 16-bit PCM before sending:
const ctx = new AudioContext ({ sampleRate: 16000 });
const source = ctx . createMediaStreamSource (
await navigator . mediaDevices . getUserMedia ({ audio: { channelCount: 1 } }),
);
const proc = ctx . createScriptProcessor ( 4096 , 1 , 1 );
proc . onaudioprocess = ( e ) => {
const input = e . inputBuffer . getChannelData ( 0 );
const pcm = new Int16Array ( input . length );
for ( let i = 0 ; i < input . length ; i ++ ) {
const s = Math . max ( - 1 , Math . min ( 1 , input [ i ]));
pcm [ i ] = s < 0 ? s * 0x8000 : s * 0x7fff ;
}
ws . send ( pcm . buffer ); // send to the live endpoint
};
source . connect ( proc );
proc . connect ( ctx . destination );
Error codes
codeMeaning PREMIUM_TIER_REQUIREDLive transcription requires a paid plan (free sessions get a short preview). INSUFFICIENT_CREDITSNot enough credits to reserve the session. Top up and retry. MAX_DURATION_REACHEDThe session hit max_duration_seconds; it was finalized and saved. INVALID_CONFIGThe start frame was malformed (e.g. unsupported sample rate). UPSTREAM_ERRORA transient error in the transcription engine — reconnect and retry.
Next steps
Speech-to-Text (batch) Transcribe files and URLs, with Standard or Premium accuracy.
Authentication Create an API key to authenticate your streams.