Data Prep

This repository has scripts to extract data from various sources for ASR and speaker diarization.

It was created to normalize the text by RÚV but it should be easy to use it to normalize other text as well, given that each segment has a segmentID

Rúv Data Preparation

Extract subtitle and video files from the RÚV api

Cleaning subtitle files and converting video to audio only

Given a subtitle file, segments and text files for ASR can be created. The script, create_segments_and_text.py assumes that the subtitle file is in a directory named after the show. The segments and text files will appear in in the existing local data directory.

Currently this repository works with Python 3.

888 is the name for the RÚV teletext subtitles. It is what needs to be pressed to make the subtitles appear on a TV screen. Expanded 888 are subtitles which have been text normalized to remove capitalizations, numbers(digits), and punctuation.

Template versions of the json files have been uploaded to the json folder. All the values in root_urls.json need to be filled out to extract anything from the api.

After the subtitle files are cleaned, they can be compared to automatic speech recognition(ASR) output using compare_hypothesis_and_expanded_888.sh To use, compare_hypothesis_and_expanded_888.sh, you need to set up kaldi. Then, within path.sh, change /data/kaldi to the path of your version of kaldi. Both the ASR transcript and the expanded 888 need to have reference ids at the beginning of every line.

Usage

python3 extract_from_ruv_api.py --url-file json/root_urls.json --shows-file json/all_shows.json

python3 create_segments_and_text.py --subtitle-file /absolute/path/for/webvtt/file

./compare_hypothesis_and_expanded_888.sh -h

./compare_hypothesis_and_expanded_888.sh data/ASR/transcript.txt data/ref/expanded_no_punct.txt

ELAN files extraction

ELAN is a professional tool for creating annotations for audio or video. Extract data from ELAN files. ELAN is a very populator annotation tool for conversational annotators. It works on top of PRAAT.

.psfx

files contains the speakerIDS can use this to create reco2num_spk

.eaf and .eaf001

contain the segments/time_slots, speakerIDs, and transcriptions along with annotations

.wav

contains the audio of the corresponding conversation

Credits

Developer

Judy Fong - [email protected]

This is part of the Language Technology Program by The Icelandic Government through Almannaromur.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
elan		elan
ruv		ruv
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Prep

Rúv Data Preparation

Usage

ELAN files extraction

.psfx

.eaf and .eaf001

.wav

Credits

About

Releases

Packages

Languages

License

cadia-lvl/broadcast_data_prep

Folders and files

Latest commit

History

Repository files navigation

Data Prep

Rúv Data Preparation

Usage

ELAN files extraction

.psfx

.eaf and .eaf001

.wav

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages