Skip to content

Latest commit

 

History

History
20 lines (11 loc) · 944 Bytes

README.md

File metadata and controls

20 lines (11 loc) · 944 Bytes

arabic-speech-to-text

This repository contains the code for training the QuartzNet ASR model (NeMo) on the QCRI-AL Jazeera Corpus.

Data preprocessing

Download the QCRI-AL Jazeera Corpus. The script a_preprocess_xml.py extracts the text segments from the xml files. The script b_filter_ds.py removes segments that include latin script or numerals. The script c_split_ds.py creates a training set and a test set from the segments.

TODO

  • Upload pretrained model
  • ...