Synth-ruSC

Synthetic dataset to solve the KWS problem in Russian.

This repository provides dataset and additional materials of the paper: "Synth-ruSC: Construction and Validation of Synthetic Dataset to Solve the Problem of Keyword Spotting in Russian" (submitted for review).

The Synth-ruSC dataset is available for download at this link.

Required external libs

Some additional materials are necessary to work with the repository:

The Speech Command dataset version 0.02 from Google (GSC-v2). We used scripts from NeMo to get google speech commands dataset and create manifests. For more details see this link
Voice-Activity-Detection (VAD) model: SG-VAD;
Model for generating audio: xtts-v2
A few Speech-To-Text (STT) models:

See "Requirements" for the rest of the required python libs.

Pipeline

The process included the following main stages:

Generation of a synthetic audio recordings using a Text-to-Speech (TTS) model;
Trimming of the generated audio recordings using Voice-Activity-Detection (VAD) model;
Filtering of the obtained audio with faulty generation, artifacts, or low audio quality using Speech-to-Text (STT) models.

There is a python notebooks with code for each of the stages in the ./notebooks/ directory.

Dataset Statistics

Table 1. The statistics of the collected Synth-ruSC dataset (”good” samples group).

Additionally, 896 audio recordings (column "real") were collected from 23 people (13 men and 10 women). Each participant spoke each of the 39 words once (1 word “over” from 1 man was pronounced incorrectly, so it was removed from the test set).

Requirements

The main Requirements are as follows:

Python 3.9+
torch==2.3.0
transformers==4.41.1
TTS
sgvad
nemo
vosk-transcriber

All libraries used in the project are listed in the requirements.txt file.

Note: You may find that STT libraries may conflict with each other, so you can create a different virtual environment for each of them.

Citing & Authors

If you have found our results helpful in your work, feel free to cite our publication and this repository as

coming soon

Contributions

Thanks to @naumov-al for adding the code and materials.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
notebooks		notebooks
pics		pics
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synth-ruSC

Required external libs

Pipeline

Dataset Statistics

Requirements

Citing & Authors

Contributions

About

Releases

Packages

Contributors 2

Languages

sag111/Synth-ruSC

Folders and files

Latest commit

History

Repository files navigation

Synth-ruSC

Required external libs

Pipeline

Dataset Statistics

Requirements

Citing & Authors

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages