Overview

AudioPod AI’s Speaker Extraction API automatically separates multiple speakers in audio recordings into individual speaker-specific audio files. The service identifies who speaks when and creates clean, separate audio tracks for each speaker while preserving original audio quality.

Key Features

  • Speaker Separation: Generate separate audio files for each detected speaker
  • Timeline Generation: Get detailed RTTM files with speaker timestamps
  • Speaker Analytics: Duration and quality statistics for each speaker
  • Multi-Format Support: Process audio and video files (WAV, MP3, M4A, MP4, etc.)
  • URL Processing: Extract speakers from YouTube and other video platforms
  • Smart Detection: Automatic speaker detection or specify expected number
  • Quality Preservation: Maintains original audio quality in extracted files

Authentication

All endpoints require authentication:
  • API Key: Authorization: Bearer your_api_key
  • JWT Token: Authorization: Bearer your_jwt_token

Speaker Extraction

Extract from File Upload

Upload an audio or video file to extract individual speaker tracks.
POST /api/v1/speaker/extract
Authorization: Bearer {api_key}
Content-Type: multipart/form-data

file: (audio/video file)
num_speakers: 4

Extract from URL

Extract speakers from audio/video URLs (YouTube, Vimeo, etc.).
POST /api/v1/speaker/extract
Authorization: Bearer {api_key}
Content-Type: application/x-www-form-urlencoded

url=https://youtube.com/watch?v=example123&num_speakers=3
Response:
{
  "id": 123,
  "job_type": "extraction",
  "status": "PENDING",
  "created_at": "2024-01-15T10:30:00Z",
  "user_id": "550e8400-e29b-41d4-a716-446655440000",
  "task_id": "celery_task_uuid_here"
}

Job Management

Get Job Status

Monitor the progress of speaker extraction jobs.
GET /api/v1/speaker/jobs/{job_id}
Authorization: Bearer {api_key}
Response (Completed Extraction):
{
  "id": 123,
  "job_type": "extraction",
  "status": "COMPLETED",
  "created_at": "2024-01-15T10:30:00Z",
  "completed_at": "2024-01-15T10:35:30Z",
  "user_id": "550e8400-e29b-41d4-a716-446655440000",
  "task_id": "celery_task_uuid_here",
  "result": {
    "speakers": [
      {
        "id": 0,
        "label": "SPEAKER_0",
        "audio_path": "processed/123/speaker_0.wav",
        "download_url": "https://s3.amazonaws.com/...",
        "audio_stats": {
          "rms_db": -12.3,
          "peak": 0.85
        }
      },
      {
        "id": 1,
        "label": "SPEAKER_1",
        "audio_path": "processed/123/speaker_1.wav",
        "download_url": "https://s3.amazonaws.com/...",
        "audio_stats": {
          "rms_db": -15.7,
          "peak": 0.72
        }
      }
    ],
    "files": [
      {
        "type": "audio",
        "speaker": "SPEAKER_0",
        "path": "processed/123/speaker_0.wav",
        "download_url": "https://s3.amazonaws.com/..."
      },
      {
        "type": "audio",
        "speaker": "SPEAKER_1",
        "path": "processed/123/speaker_1.wav",
        "download_url": "https://s3.amazonaws.com/..."
      },
      {
        "type": "rttm",
        "path": "processed/123/extraction.rttm",
        "download_url": "https://s3.amazonaws.com/..."
      }
    ],
    "rttm_path": "processed/123/extraction.rttm"
  }
}

List Extraction Jobs

Get all speaker extraction jobs for the authenticated user.
GET /api/v1/speaker/jobs?job_type=extraction&status=COMPLETED&limit=50
Authorization: Bearer {api_key}
Response:
{
  "items": [
    {
      "id": 123,
      "job_type": "extraction",
      "status": "COMPLETED",
      "created_at": "2024-01-15T10:30:00Z",
      "completed_at": "2024-01-15T10:35:30Z",
      "user_id": "550e8400-e29b-41d4-a716-446655440000",
      "task_id": "celery_task_uuid_here",
      "filename": "podcast_episode.mp3",
      "display_name": "podcast_episode.mp3",
      "outputFiles": [
        {
          "type": "audio",
          "speaker": "SPEAKER_0",
          "path": "processed/123/speaker_0.wav"
        },
        {
          "type": "audio", 
          "speaker": "SPEAKER_1",
          "path": "processed/123/speaker_1.wav"
        }
      ]
    }
  ],
  "hasMore": false,
  "total": 1
}

Retry Failed Job

Retry a failed speaker extraction job.
POST /api/v1/speaker/jobs/{job_id}/retry
Authorization: Bearer {api_key}
Response:
{
  "id": 123,
  "job_type": "extraction",
  "status": "PROCESSING",
  "created_at": "2024-01-15T10:30:00Z",
  "task_id": "new_celery_task_uuid_here",
  "user_id": "550e8400-e29b-41d4-a716-446655440000"
}

Delete Job

Remove a speaker extraction job and its associated files.
DELETE /api/v1/speaker/jobs/{job_id}
Authorization: Bearer {api_key}
Response: 204 No Content on successful deletion

Supported Formats

Audio Formats:
  • WAV, MP3, M4A, AAC, FLAC, OGG, OPUS, WebM
  • WMA, Speex, and other common formats
Video Formats:
  • MP4, AVI, MOV, MKV, WebM
  • Audio will be extracted automatically from video files
URL Sources:
  • YouTube, Vimeo, and other video platforms
  • Direct audio/video file URLs

Error Handling

Pricing

Speaker extraction costs are based on audio duration:
ServiceCostDescription
Speaker Extraction1650 credits/minuteGenerate separate audio files for each speaker
Note: Credits are charged per second of audio (27.5 credits/second)

Cost Examples

DurationServiceCreditsUSD Cost*
5 minutesExtraction8,250~$1.10
15 minutesExtraction24,750~$3.30
30 minutesExtraction49,500~$6.60
1 hourExtraction99,000~$13.20
*USD cost estimates based on standard credit pricing. Actual costs may vary based on subscription plan.

Rate Limits

  • 100 requests per minute per API key
  • Rate limits apply per endpoint
  • Exceeding limits returns 429 Too Many Requests

Next Steps