Overview
AudioPod AI’s Speech-to-Text API converts audio and video content into accurate text transcriptions using advanced AI models including WhisperX and Faster-Whisper. Get detailed transcriptions with speaker diarization, word-level timestamps, and confidence scores.Key Features
- Multi-Model Support: WhisperX, Whisper-Timestamped, Faster-Whisper
- Speaker Diarization: Automatic speaker identification and separation
- Word-Level Timestamps: Precise timing for each word
- Confidence Scores: Quality metrics for transcription accuracy
- 50+ Languages: Automatic language detection or manual specification
- Large File Support: Handle videos up to 15 hours with chunking
- Multiple Sources: Upload files or provide YouTube/video URLs
- Editable Transcripts: Edit and refine transcription results
Authentication
All endpoints require authentication:- API Key:
Authorization: Bearer your_api_key - JWT Token:
Authorization: Bearer your_jwt_token
Transcribe from URLs
Transcribe YouTube Videos
Transcribe audio from YouTube or other video platforms.- Python
- Node.js
- Raw HTTP
- cURL
Transcribe from Files
Upload Audio/Video Files
Transcribe from uploaded audio or video files.- Python
- Node.js
- Raw HTTP
- cURL
Job Management
Get Transcription Status
Check the progress and status of transcription jobs.- GET
- Python
List Transcription Jobs
Get all transcription jobs for the authenticated user.- GET
- Python
Download Transcripts
Get Transcript in Multiple Formats
Download transcripts in various formats including JSON, TXT, PDF, SRT, VTT, DOCX, and HTML.- GET
- Python
Edit Transcripts
Get Editable Transcript
Retrieve transcript in editable format for corrections.- GET
- Python
Update Transcript
Submit edited transcript with corrections.- PUT
- Python
Get Transcript Versions
View edit history and versions of transcripts.- GET
- Python
Extract Audio
Download Extracted Audio
Get clean audio files extracted from videos during transcription.- GET
- Python
Delete Jobs
Delete Transcription Job
Remove transcription jobs and associated data.- DELETE
- Python
Supported Languages
AudioPod AI supports automatic language detection or manual specification for 50+ languages:| Language | Code | Quality | Notes |
|---|---|---|---|
| English | en | Excellent | Best supported language |
| Spanish | es | Excellent | High accuracy |
| French | fr | Excellent | Good speaker diarization |
| German | de | Excellent | Technical content support |
| Portuguese | pt | Very Good | Brazilian and European |
| Italian | it | Very Good | Good word timestamps |
| Russian | ru | Very Good | Cyrillic text support |
| Japanese | ja | Good | Hiragana/Katakana/Kanji |
| Chinese | zh | Good | Simplified and Traditional |
| Arabic | ar | Good | RTL text support |
| Hindi | hi | Good | Devanagari script |
| Korean | ko | Good | Hangul script |
Model Comparison
Choose the best model for your use case:| Model | Speed | Accuracy | Speaker Diarization | Best For |
|---|---|---|---|---|
| whisperx | Medium | Highest | Excellent | Production transcription |
| faster-whisper | Fastest | High | Good | Real-time applications |
| whisper-timestamped | Slow | High | Good | Detailed analysis |
Best Practices
Audio Quality Guidelines
For best transcription results:Cost Optimization
Error Handling
400 Bad Request - Invalid Audio
400 Bad Request - Invalid Audio
Causes: - Unsupported audio format - Corrupted audio file - Audio too short
Solutions: - Use supported formats (WAV, MP3, M4A, MP4) - Verify file integrity -
Ensure minimum 10 seconds audio
413 Payload Too Large
413 Payload Too Large
Causes: - File size exceeds limits - Too many files in single request
Solutions: - Split large files into smaller chunks - Reduce number of files per request -
Use URL transcription for large videos
422 Processing Error
422 Processing Error
Causes: - Audio has no speech content - Extremely poor audio quality
Solutions: - Verify audio contains speech - Improve audio quality -
Try different transcription model
Pricing
Transcription pricing is based on audio duration:| Service | Cost | Description |
|---|---|---|
| Basic Transcription | 660 credits/minute | Text-only transcription |
| With Speaker Diarization | 660 credits/minute | Speaker identification included |
| With Word Timestamps | 660 credits/minute | Word-level timing data |
| Transcript Editing | Free | No additional cost for edits |
Cost Examples
| Duration | Features | Credits | USD Cost |
|---|---|---|---|
| 10 minutes | Basic transcription | 6600 | $0.88 |
| 30 minutes | With speakers + timestamps | 19800 | $2.64 |
| 1 hour | Full features | 39600 | $5.28 |
| 2 hours | Full features | 79200 | $10.56 |
