LinTO-STT-Kaldi is an API for Automatic Speech Recognition (ASR) based on models trained with Kaldi.
LinTO-STT-Kaldi can either be used as a standalone transcription service or deployed within a micro-services infrastructure using a message broker connector.
It can be used to do offline or real-time transcriptions.
To run the transcription models you'll need:
- At least 7Go of disk space to build the docker image.
- Up to 7GB of RAM depending on the model used.
- One CPU per worker. Inference time scales on CPU performances.
LinTO-STT-Kaldi accepts two kinds of models:
- LinTO Acoustic and Languages models.
- Vosk models.
We provide home-cured models (v2) on dl.linto.ai. Or you can also use Vosk models available here.
The transcription service requires docker up and running.
The STT only entry point in task mode are tasks posted on a message broker. Supported message broker are RabbitMQ, Redis, Amazon SQS. On addition, as to prevent large audio from transiting through the message broker, STT-Worker use a shared storage folder (SHARED_FOLDER).
1- First step is to build or pull the image:
git clone https://github.com/linto-ai/linto-stt.git
cd linto-stt
docker build . -f kaldi/Dockerfile -t linto-stt-kaldi:latest
or
docker pull lintoai/linto-stt-kaldi
2- Download the models
Have the acoustic and language model ready at AM_PATH and LM_PATH if you are using LinTO models. If you are using a Vosk model, have it ready at MODEL.
3- Fill the .env file
An example of .env file is provided in kaldi/.envdefault.
PARAMETER | DESCRIPTION | EXEMPLE |
---|---|---|
SERVICE_MODE | STT serving mode see Serving mode | http|task|websocket |
MODEL_TYPE | Type of STT model used. | lin|vosk |
ENABLE_STREAMING | Using http serving mode, enable the /streaming websocket route | true|false |
SERVICE_NAME | Using the task mode, set the queue's name for task processing | my-stt |
SERVICE_BROKER | Using the task mode, URL of the message broker | redis://my-broker:6379 |
BROKER_PASS | Using the task mode, broker password | my-password |
STREAMING_PORT | Using the websocket mode, the listening port for ingoing WS connexions. | 80 |
CONCURRENCY | Maximum number of parallel requests | >1 |
STT can be used three ways:
- Through an HTTP API using the http's mode.
- Through a message broker using the task's mode.
- Through a websocket server websocket's mode.
Mode is specified using the .env value or environment variable SERVING_MODE
.
SERVICE_MODE=http
The HTTP serving mode deploys a HTTP server and a swagger-ui to allow transcription request on a dedicated route.
The SERVICE_MODE value in the .env should be set to http
.
docker run --rm \
-p HOST_SERVING_PORT:80 \
-v AM_PATH:/opt/AM \
-v LM_PATH:/opt/LM \
--env-file .env \
linto-stt-kaldi:latest
This will run a container providing an HTTP API binded on the host HOST_SERVING_PORT port.
Parameters:
Variables | Description | Example |
---|---|---|
HOST_SERVING_PORT | Host serving port | 80 |
AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 |
LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 |
MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model |
The TASK serving mode connect a celery worker to a message broker.
The SERVICE_MODE value in the .env should be set to task
.
You need a message broker up and running at MY_SERVICE_BROKER.
docker run --rm \
-v AM_PATH:/opt/AM \
-v LM_PATH:/opt/LM \
-v SHARED_AUDIO_FOLDER:/opt/audio \
--env-file .env \
linto-stt-kaldi:latest
Parameters:
Variables | Description | Example |
---|---|---|
AM_PATH | Path to the acoustic model on the host machine mounted to /opt/AM | /my/path/to/models/AM_fr-FR_v2.2.0 |
LM_PATH | Path to the language model on the host machine mounted to /opt/LM | /my/path/to/models/fr-FR_big-v2.2.0 |
MODEL_PATH | Path to the model (using MODEL_TYPE=vosk) mounted to /opt/model | /my/path/to/models/vosk-model |
SHARED_AUDIO_FOLDER | Shared audio folder mounted to /opt/audio | /my/path/to/models/vosk-model |
Websocket server's mode deploy a streaming transcription service only.
The SERVICE_MODE value in the .env should be set to websocket
.
Usage is the same as the http streaming API
Returns the state of the API
Method: GET
Returns "1" if healthcheck passes.
Transcription API
- Method: POST
- Response content: text/plain or application/json
- File: An Wave file 16b 16Khz
Return the transcripted text using "text/plain" or a json object when using "application/json" structure as followed:
{
"text" : "This is the transcription",
"words" : [
{"word":"This", "start": 0.123, "end": 0.453, "conf": 0.9},
...
]
"confidence-score": 0.879
}
The /streaming route is accessible if the ENABLE_STREAMING environment variable is set to true.
The route accepts websocket connexions. Exchanges are structured as followed:
- Client send a json {"config": {"sample_rate":16000}}.
- Client send audio chunk (go to 3- ) or {"eof" : 1} (go to 5-).
- Server send either a partial result {"partial" : "this is a "} or a final result {"text": "this is a transcription"}.
- Back to 2-
- Server send a final result and close the connexion.
Connexion will be closed and the worker will be freed if no chunk are received for 10s.
The /docs route offers a OpenAPI/swagger interface.
STT-Worker accepts requests with the following arguments:
file_path: str, with_metadata: bool
- file_path: Is the location of the file within the shared_folder. /.../SHARED_FOLDER/{file_path}
- with_metadata: If True, words timestamps and confidence will be computed and returned. If false, the fields will be empty.
On a successfull transcription the returned object is a json object structured as follow:
{
"text" : "this is the transcription as text",
"words": [
{
"word" : "this",
"start": 0.0,
"end": 0.124,
"conf": 1.0
},
...
],
"confidence-score": ""
}
- The text field contains the raw transcription.
- The word field contains each word with their time stamp and individual confidence. (Empty if with_metadata=False)
- The confidence field contains the overall confidence for the transcription. (0.0 if with_metadata=False)
You can test you http API using curl:
curl -X POST "http://YOUR_SERVICE:YOUR_PORT/transcribe" -H "accept: application/json" -H "Content-Type: multipart/form-data" -F "file=@YOUR_FILE;type=audio/x-wav"
This project is developped under the AGPLv3 License (see LICENSE).