Skip to content

Synthetic dataset to solve the KWS problem in Russian

Notifications You must be signed in to change notification settings

sag111/Synth-ruSC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synth-ruSC

fig_1 Synthetic dataset to solve the KWS problem in Russian.

This repository provides dataset and additional materials of the paper: "Synth-ruSC: Construction and Validation of Synthetic Dataset to Solve the Problem of Keyword Spotting in Russian" (submitted for review).

The Synth-ruSC dataset is available for download at this link.

Required external libs

Some additional materials are necessary to work with the repository:

  1. The Speech Command dataset version 0.02 from Google (GSC-v2). We used scripts from NeMo to get google speech commands dataset and create manifests. For more details see this link
  2. Voice-Activity-Detection (VAD) model: SG-VAD;
  3. Model for generating audio: xtts-v2
  4. A few Speech-To-Text (STT) models:

See "Requirements" for the rest of the required python libs.

Pipeline

The process included the following main stages:

  1. Generation of a synthetic audio recordings using a Text-to-Speech (TTS) model;
  2. Trimming of the generated audio recordings using Voice-Activity-Detection (VAD) model;
  3. Filtering of the obtained audio with faulty generation, artifacts, or low audio quality using Speech-to-Text (STT) models.

There is a python notebooks with code for each of the stages in the ./notebooks/ directory.

Dataset Statistics

Table 1. The statistics of the collected Synth-ruSC dataset (”good” samples group). tab_1

Additionally, 896 audio recordings (column "real") were collected from 23 people (13 men and 10 women). Each participant spoke each of the 39 words once (1 word “over” from 1 man was pronounced incorrectly, so it was removed from the test set).

Requirements

The main Requirements are as follows:

All libraries used in the project are listed in the requirements.txt file.

Note: You may find that STT libraries may conflict with each other, so you can create a different virtual environment for each of them.

Citing & Authors

If you have found our results helpful in your work, feel free to cite our publication and this repository as

coming soon

Contributions

Thanks to @naumov-al for adding the code and materials.

About

Synthetic dataset to solve the KWS problem in Russian

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published