Data Form of the MaLa-ASR #130

zsLin177 · 2024-08-28T02:11:53Z

System Info

torch 2.1

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

bash decode_MaLa-ASR_withkeywords_L95.sh

Hi, I'm currently working on reproducing the results of MaLa-ASR and have downloaded the slidespeech dataset from https://www.openslr.org/144/. While running the provided decoding script, I noticed that it requires the file located at /nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/. Could you please clarify what the format of this file is? Do I need to preprocess the downloaded data in any specific way, such as splitting the audio based on timestamps?

Error logs

no file named test_oracle_v1

Expected behavior

Could you please provide the steps for data processing and explain the format of the data? Thanks, looking forward to your reply.

yanghaha0908 · 2024-09-14T09:38:30Z

The location of the slidespeech dataset can be modified through config file "mala_asr_config.py".
You can change "/nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/." to your own path.

The dataset requires four files: "my_wav.scp", "utt2num_samples", "text", "hot_related/ocr_1gram_top50_mmr070_hotwords_list"

"my_wav.scp" is a file of audio path lists. We transform wav file to ark file, so this file looks like
ID1 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:22
ID2 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:90445

To generate this file, you can get audio wavs from https://www.openslr.org/144/ and get the time segments from https://slidespeech.github.io/. It provides segments, transcription text, OCR results at https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/SlideSpeech/related_files.tar.gz (~1.37GB). You need to segment the wav by the timestamps provided in segments file

This related_files.tar.gz also provides "text" and a file named "keywords". The file "keywords" refers to "hot_related/ocr_1gram_top50_mmr070_hotwords_list", which contains hotwords list.

"utt2num_samples" contains the length of the wavs, which looks like
ID1 103680
ID2 181600
...

Sorry for the late reply, been busy lately, hope your reproduction goes well!

nuaalixu · 2024-10-09T03:39:27Z

@yanghaha0908 Thank you for your answer. It is strongly recommended that this answer be written into the mala README file.

yanghaha0908 · 2024-11-08T12:07:05Z

I have added it to the README.md file of Mala-ASR, refer to #168.

ddlBoJack assigned yanghaha0908 Aug 28, 2024

nuaalixu mentioned this issue Nov 5, 2024

fix #130 update data preparation guidance for mala_asr #166

Closed

5 tasks

yanghaha0908 closed this as completed Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Form of the MaLa-ASR #130

Data Form of the MaLa-ASR #130

zsLin177 commented Aug 28, 2024

yanghaha0908 commented Sep 14, 2024 •

edited

Loading

nuaalixu commented Oct 9, 2024

yanghaha0908 commented Nov 8, 2024 •

edited

Loading

Data Form of the MaLa-ASR #130

Data Form of the MaLa-ASR #130

Comments

zsLin177 commented Aug 28, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

yanghaha0908 commented Sep 14, 2024 • edited Loading

nuaalixu commented Oct 9, 2024

yanghaha0908 commented Nov 8, 2024 • edited Loading

yanghaha0908 commented Sep 14, 2024 •

edited

Loading

yanghaha0908 commented Nov 8, 2024 •

edited

Loading