Merge japanese-to-english multilingual branch #1860

baileyeet · 2025-01-07T05:36:54Z

No description provided.

egs/reazonspeech/ASR/RESULTS.md

csukuangfj · 2025-01-08T10:28:59Z

egs/reazonspeech/ASR/RESULTS.md

+
+```shell
+./zipformer/streaming_decode.py \
+  --epoch 28 \


Since the Reazonspeech is a large dataset, I suggest that you replace --epoch with --iter. See also RESULTS.md from our Gigaspeech recipe. You can find example usages of --iter there.

baileyeet · 2025-01-14T02:25:38Z

With regards to Setup Python 3.10.15 issue - how do I resolve this issue? I didn't change anything related to Python issue.

csukuangfj · 2025-01-14T02:29:05Z

With regards to Setup Python 3.10.15 issue - how do I resolve this issue? I didn't change anything related to Python issue.

please change

icefall/.github/workflows/style_check.yml

Line 39 in ab91112

python-version: [3.10.15]

to

 python-version: [3.10]

That is, change 3.10.15 to 3.10.

It is an issue of GitHub actions and is not related to your PR.

baileyeet · 2025-01-14T02:35:13Z

With regards to Setup Python 3.10.15 issue - how do I resolve this issue? I didn't change anything related to Python issue.

please change

icefall/.github/workflows/style_check.yml

Line 39 in ab91112

python-version: [3.10.15]

to
 python-version: [3.10] 
That is, change 3.10.15 to 3.10.

It is an issue of GitHub actions and is not related to your PR.

Looks like this hasn't resolved the issue.

csukuangfj · 2025-01-14T02:42:41Z

Please use

"3.10"

not

3.10

csukuangfj · 2025-01-20T10:19:53Z

@JinZr Can you have a review?

JinZr · 2025-01-21T04:51:05Z

ok! i’ll look into it this weekend Best Regards Jin

…

On Mon, 20 Jan 2025 at 18:20 Fangjun Kuang ***@***.***> wrote: @JinZr <https://github.com/JinZr> Can you have a review? — Reply to this email directly, view it on GitHub <#1860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42C3NPJXZY77BY33URL2LTEWBAVCNFSM6AAAAABUW64FCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBRHE4TSMZQGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

JinZr

i’ve completed reviewing this PR, and it looks great overall!

there are also some unnecessary changes to other dependencies that might need to be addressed before merging

JinZr · 2025-01-26T08:23:28Z

icefall/utils.py

@@ -644,7 +644,8 @@ def write_error_stats(
            results[i] = (cut_id, ref, hyp)

    for cut_id, ref, hyp in results:
-        ali = kaldialign.align(ref, hyp, ERR, sclite_mode=sclite_mode)
+        # ali = kaldialign.align(ref, hyp, ERR, sclite_mode=sclite_mode)
+        ali = kaldialign.align(ref, hyp, ERR)


Could you provide some context on why the ''sclite_mode'' argument was removed?

JinZr · 2025-01-26T08:28:44Z

.github/scripts/docker/generate_build_matrix.py

please avoid unnecessary modifications to built-in files

JinZr · 2025-01-26T08:29:03Z

.github/workflows/build-docker-image.yml

please avoid unnecessary modifications to built-in files, thanks

.github/workflows/style_check.yml

JinZr

thanks!

i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now.

baileyeet · 2025-01-30T07:20:05Z

thanks!

i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now.

Great, thank you!

baileyeet · 2025-02-02T17:27:54Z

thanks!
i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now.

Great, thank you!

to clarify, there's no further action needed on my end, right?

JinZr · 2025-02-03T02:04:50Z

sure! pls feel free to merge it Best Regards Jin

…

On Mon, 3 Feb 2025 at 01:28 Machiko Bailey ***@***.***> wrote: thanks! i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now. Great, thank you! to clarify, there's no further action needed on my end, right? — Reply to this email directly, view it on GitHub <#1860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42BKR3VES6BS2XAO57L2NZIS7AVCNFSM6AAAAABUW64FCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRZGQ4DCNRWHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

baileyeet · 2025-02-03T17:01:35Z

sure! pls feel free to merge it Best Regards Jin
…
On Mon, 3 Feb 2025 at 01:28 Machiko Bailey @.> wrote: thanks! i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now. Great, thank you! to clarify, there's no further action needed on my end, right? — Reply to this email directly, view it on GitHub <#1860 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42BKR3VES6BS2XAO57L2NZIS7AVCNFSM6AAAAABUW64FCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRZGQ4DCNRWHA . You are receiving this because you were mentioned.Message ID: @.>

thanks, i don't have write access to merge so i will kindly wait :) thanks for the review

JinZr · 2025-02-03T17:32:21Z

sorry i didnt realize that, i'll do the operation now

baileyeet · 2025-02-27T03:12:39Z

I am investigating improving English accuracy of the recently merged multi_ja_en model and saw that both Librispeech and mult_zh_en models use 3x speed perturbation. Is there somewhere I can access that version of Librispeech data? Also, did multi_zh_en ever look into using gigaspeech model for English side?

JinZr · 2025-02-27T04:26:03Z

hi, regarding the librispeech data, you can easily conduct the 3-time speech perturbation by toggling the ``--perturb-speed`` in the ``compute_librispeech_fbank.py`` to True, the script is under ``egs/librispeech/ASR/local/``. i didn't conduct any experiment on gigaspeech corpus while making the ``multi_zh_en`` recipe, but have those part of data involved in model training should be helpful

…

On Thu, Feb 27, 2025 at 11:13 AM Machiko Bailey ***@***.***> wrote: I am investigating improving English accuracy of the recently merged multi_ja_en model and saw that both Librispeech and mult_zh_en models use 3x speed perturbation. Is there somewhere I can access that version of Librispeech data? Also, did multi_zh_en ever look into using gigaspeech model for English side? — Reply to this email directly, view it on GitHub <#1860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42CJBLSQCS6BPPAW5RL2RZ7D5AVCNFSM6AAAAABUW64FCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBWG42DINZTGU> . You are receiving this because you modified the open/close state.Message ID: ***@***.***> [image: baileyeet]*baileyeet* left a comment (k2-fsa/icefall#1860) <#1860 (comment)> I am investigating improving English accuracy of the recently merged multi_ja_en model and saw that both Librispeech and mult_zh_en models use 3x speed perturbation. Is there somewhere I can access that version of Librispeech data? Also, did multi_zh_en ever look into using gigaspeech model for English side? — Reply to this email directly, view it on GitHub <#1860 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42CJBLSQCS6BPPAW5RL2RZ7D5AVCNFSM6AAAAABUW64FCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBWG42DINZTGU> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

kinanmartin · 2025-02-28T01:20:31Z

Hello, I'm working together with @baileyeet on improving the English accuracy of this model.

We are considering training a version of this model using a larger English dataset. One dataset we are considering is the People's Speech corpus. I see there was a pull request by @yfyeung that added support for data loading of this dataset in the past, but it seems it was removed because of poor training results. Here are the relevant PRs: #1101 #1778 .

Can anyone vouch for the quality of the People's Speech corpus, or explain why it might not be a suitable dataset? It seems that it has not yet been evaluated with results.

As an alternative, we are considering using GigaSpeech, but it has less data, and it seemingly has more restrictive licensing. Would this be a better choice? Are there any other English labeled datasets that would be better, apart from these two?

Thank you for any responses!

yfyeung · 2025-02-28T03:13:01Z

Hello, I'm working together with @baileyeet on improving the English accuracy of this model.

We are considering training a version of this model using a larger English dataset. One dataset we are considering is the People's Speech corpus. I see there was a pull request by @yfyeung that added support for data loading of this dataset in the past, but it seems it was removed because of poor training results. Here are the relevant PRs: #1101 #1778 .

Can anyone vouch for the quality of the People's Speech corpus, or explain why it might not be a suitable dataset? It seems that it has not yet been evaluated with results.

As an alternative, we are considering using GigaSpeech, but it has less data, and it seemingly has more restrictive licensing. Would this be a better choice? Are there any other English labeled datasets that would be better, apart from these two?

Thank you for any responses!

Hi, according to the paper on the People's Speech corpus, models trained on it report the following WER (Word Error Rate) on LibriSpeech:

dev-clean: 9.93%
dev-other: 25.53%
test-clean: 9.98%
test-other: 26.91%

root and others added 22 commits August 1, 2024 15:08

add streaming support to reazonresearch

5062f12

update README for streaming

5a0c247

update streaming decoding file

62eb090

update streaming decode

6317405

remove streaming/greedy_search results folder

2d2daf6

update for streaming

e052481

Add docker images for torch 2.4 (k2-fsa#1704)

529d92f

remove prints

8189d11

resolve PR issue

916e84d

Add multi_ja_en

707a956

Update README.md

2e355a8

Update README.md

b6af607

Update RESULTS.md

7aedda0

Update RESULTS.md

7b1445b

Update RESULTS.md

4a55a10

Delete egs/multi_ja_en/ASR/zipformer/streaming/greedy_search directory

1bc7f07

formatting

68e1c3c

formatting

a2bb272

add onnx decode

4604be8

remove unnecessary folders

f421001

fix repeated definition of tokenize_by_ja_char

564b632

Merge branch 'master' into einichi

5c142d4

baileyeet changed the title ~~Merge multilingual branch~~ Merge japanese-to-english multilingual branch Jan 7, 2025

baileyeet marked this pull request as draft January 7, 2025 05:41

baileyeet added 2 commits January 8, 2025 16:54

clean up files

8a3790c

remove test

9d6211e

csukuangfj reviewed Jan 8, 2025

View reviewed changes

egs/reazonspeech/ASR/RESULTS.md Show resolved Hide resolved

csukuangfj reviewed Jan 8, 2025

View reviewed changes

edit prepare.sh

84c91db

update python ver

1244de9

baileyeet added 3 commits January 15, 2025 00:25

update python ver

aa74f6c

udpate symlink

b574e68

Reformatted streaming_decode.py with flake8

9ab3021

baileyeet marked this pull request as ready for review January 14, 2025 22:41

Update RESULTS.md

3eec244

JinZr requested changes Jan 26, 2025

View reviewed changes

baileyeet and others added 6 commits January 27, 2025 18:13

Merge branch 'k2-fsa:master' into einichi

efc0536

Update generate_build_matrix.py

50c3270

Update build-docker-image.yml

b8ce806

Update zipformer.py

5cf7e42

Update zipformer.py

1cb4594

Update utils.py

b9efbf8

JinZr approved these changes Jan 30, 2025

View reviewed changes

JinZr merged commit 0855b03 into k2-fsa:master Feb 3, 2025
131 of 213 checks passed

baileyeet deleted the einichi branch February 3, 2025 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge japanese-to-english multilingual branch #1860

Merge japanese-to-english multilingual branch #1860

baileyeet commented Jan 7, 2025

csukuangfj Jan 8, 2025

baileyeet commented Jan 14, 2025

csukuangfj commented Jan 14, 2025

baileyeet commented Jan 14, 2025

csukuangfj commented Jan 14, 2025

csukuangfj commented Jan 20, 2025

JinZr commented Jan 21, 2025 via email

JinZr left a comment

JinZr Jan 26, 2025

JinZr Jan 26, 2025

JinZr Jan 26, 2025

JinZr left a comment

baileyeet commented Jan 30, 2025

baileyeet commented Feb 2, 2025

JinZr commented Feb 3, 2025 via email

baileyeet commented Feb 3, 2025

JinZr commented Feb 3, 2025

baileyeet commented Feb 27, 2025

JinZr commented Feb 27, 2025 via email

kinanmartin commented Feb 28, 2025

yfyeung commented Feb 28, 2025

Merge japanese-to-english multilingual branch #1860

Merge japanese-to-english multilingual branch #1860

Conversation

baileyeet commented Jan 7, 2025

csukuangfj Jan 8, 2025

Choose a reason for hiding this comment

baileyeet commented Jan 14, 2025

csukuangfj commented Jan 14, 2025

baileyeet commented Jan 14, 2025

csukuangfj commented Jan 14, 2025

csukuangfj commented Jan 20, 2025

JinZr commented Jan 21, 2025 via email

JinZr left a comment

Choose a reason for hiding this comment

JinZr Jan 26, 2025

Choose a reason for hiding this comment

JinZr Jan 26, 2025

Choose a reason for hiding this comment

JinZr Jan 26, 2025

Choose a reason for hiding this comment

JinZr left a comment

Choose a reason for hiding this comment

baileyeet commented Jan 30, 2025

baileyeet commented Feb 2, 2025

JinZr commented Feb 3, 2025 via email

baileyeet commented Feb 3, 2025

JinZr commented Feb 3, 2025

baileyeet commented Feb 27, 2025

JinZr commented Feb 27, 2025 via email

kinanmartin commented Feb 28, 2025

yfyeung commented Feb 28, 2025