Synthetic dataset to solve the KWS problem in Russian.
This repository provides dataset and additional materials of the paper: "Synth-ruSC: Construction and Validation of Synthetic Dataset to Solve the Problem of Keyword Spotting in Russian" (submitted for review).
The Synth-ruSC dataset is available for download at this link.
Some additional materials are necessary to work with the repository:
- The Speech Command dataset version 0.02 from Google (GSC-v2). We used scripts from NeMo to get google speech commands dataset and create manifests. For more details see this link
- Voice-Activity-Detection (VAD) model: SG-VAD;
- Model for generating audio: xtts-v2
- A few Speech-To-Text (STT) models:
See "Requirements" for the rest of the required python libs.
The process included the following main stages:
- Generation of a synthetic audio recordings using a Text-to-Speech (TTS) model;
- Trimming of the generated audio recordings using Voice-Activity-Detection (VAD) model;
- Filtering of the obtained audio with faulty generation, artifacts, or low audio quality using Speech-to-Text (STT) models.
There is a python notebooks with code for each of the stages in the ./notebooks/
directory.
Table 1. The statistics of the collected Synth-ruSC dataset (”good” samples group).
Additionally, 896 audio recordings (column "real") were collected from 23 people (13 men and 10 women). Each participant spoke each of the 39 words once (1 word “over” from 1 man was pronounced incorrectly, so it was removed from the test set).
The main Requirements are as follows:
- Python 3.9+
- torch==2.3.0
- transformers==4.41.1
- TTS
- sgvad
- nemo
- vosk-transcriber
All libraries used in the project are listed in the requirements.txt
file.
Note: You may find that STT libraries may conflict with each other, so you can create a different virtual environment for each of them.
If you have found our results helpful in your work, feel free to cite our publication and this repository as
coming soon
Thanks to @naumov-al for adding the code and materials.