The WhisperX API is a containerized solution for transcribing audio files with diarization using the powerful whisperX
project. This API provides an easy-to-use endpoint for audio transcription and is packaged into a Docker container for easy deployment.
- Docker with GPU support. Follow the instructions to install NVIDIA Docker.
- A Huggingface API token.
Include your Huggingface access token that you can generate from Here. After generating the token, accept the user agreement for the following models:
- Rename config.py.example to config.py and update it with your Huggingface token:
mv config.py.example config.py
echo "HF_TOKEN = '<my-hf-token>'" > config.py
Replace with your actual Hugging Face token.
Build the Docker image for the WhisperX API:
docker build -t whisperx-api --network=host --build-arg hftoken=<my-hf-token> .
Again, replace <my-hf-token>
with your Huggingface token. This might take a while.
After building the Docker image, you can run the WhisperX API with:
docker run --gpus all -p 5000:5000 whisperx-api
This will start the API and make it accessible on port 5000.
To transcribe an audio file, send a POST request to the API endpoint. Here's an example using curl
:
curl http://127.0.0.1:5000/transcribe -X POST -F "file=@./audio_en.mp3"
Replace ./audio.mp3
with the path to your audio file.
The output looks as following:
{
"segments" : [
{
"end" : 10.192,
"speaker" : "SPEAKER_01",
"start" : 2.883,
"text" : " This is a test audio file of about phone line quality in English.",
"words" : [
{
"end" : 3.043,
"score" : 0.718,
"speaker" : "SPEAKER_00",
"start" : 2.883,
"word" : "This"
},
{
"end" : 3.163,
"score" : 0.096,
"speaker" : "SPEAKER_00",
"start" : 3.123,
"word" : "is"
},
{
"end" : 3.344,
"score" : 0.456,
"speaker" : "SPEAKER_00",
"start" : 3.324,
"word" : "a"
},
<...>
],
}
],
"word_segments" : [
{
"end" : 3.043,
"score" : 0.718,
"speaker" : "SPEAKER_00",
"start" : 2.883,
"word" : "This"
},
{
"end" : 3.163,
"score" : 0.096,
"speaker" : "SPEAKER_00",
"start" : 3.123,
"word" : "is"
},
{
"end" : 3.344,
"score" : 0.456,
"speaker" : "SPEAKER_00",
"start" : 3.324,
"word" : "a"
},
<...>
]
}