This repository contains the Faster Whisper Worker for RunPod, optimized to run distilled Whisper models. The Worker is designed to process audio files using efficient distilled variants of Whisper that maintain high accuracy while offering improved performance. It supports transcription formatting, language translation, and more as part of the RunPod Workers collection.
Input | Type | Description |
---|---|---|
audio |
Path | Audio file |
audio_base64 |
str | Base64-encoded audio file |
model |
str | Choose a Whisper model. Choices: "distil-large-v2", "distil-large-v3". Default: "distil-large-v2" |
transcription |
str | Choose the format for the transcription. Choices: "plain_text", "formatted_text", "srt", "vtt". Default: "plain_text" |
translate |
bool | Translate the text to English when set to True. Default: False |
translation |
str | Choose the format for the translation. Choices: "plain_text", "formatted_text", "srt", "vtt". Default: "plain_text" |
language |
str | Language spoken in the audio, specify None to perform language detection. Default: None |
temperature |
float | Temperature to use for sampling. Default: 0 |
best_of |
int | Number of candidates when sampling with non-zero temperature. Default: 5 |
beam_size |
int | Number of beams in beam search, only applicable when temperature is zero. Default: 5 |
patience |
float | Optional patience value to use in beam decoding. Default: None |
length_penalty |
float | Optional token length penalty coefficient (alpha). Default: None |
suppress_tokens |
str | Comma-separated list of token ids to suppress during sampling. Default: "-1" |
initial_prompt |
str | Optional text to provide as a prompt for the first window. Default: None |
condition_on_previous_text |
bool | If True, provide the previous output of the model as a prompt for the next window. Default: True |
temperature_increment_on_fallback |
float | Temperature to increase when falling back when the decoding fails. Default: 0.2 |
compression_ratio_threshold |
float | If the gzip compression ratio is higher than this value, treat the decoding as failed. Default: 2.4 |
logprob_threshold |
float | If the average log probability is lower than this value, treat the decoding as failed. Default: -1.0 |
no_speech_threshold |
float | If the probability of the token is higher than this value, consider the segment as silence. Default: 0.6 |
enable_vad |
bool | If True, use the voice activity detection (VAD) to filter out parts of the audio without speech. This step is using the Silero VAD model. Default: False |
word_timestamps |
bool | If True, include word timestamps in the output. Default: False |
The following inputs can be used for testing the model:
{
"input": {
"audio": "https://github.com/runpod-workers/sample-inputs/raw/main/audio/gettysburg.wav"
}
}
{
"segments": [
{
"id": 1,
"seek": 106,
"start": 0.11,
"end": 3.11,
"text": " Hola mundo.",
"tokens": [
50364,
22637,
7968,
13,
50514
],
"temperature": 0.1,
"avg_logprob": -0.8348079785480325,
"compression_ratio": 0.5789473684210527,
"no_speech_prob": 0.1453857421875
}
],
"detected_language": "es",
"transcription": "Hola mundo.",
"translation": null,
"device": "cuda",
"model": "distil-large-v2",
"translation_time": 0.7796223163604736
}