Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge japanese-to-english multilingual branch #1860

Merged
merged 36 commits into from
Feb 3, 2025
Merged

Conversation

baileyeet
Copy link
Contributor

No description provided.

@baileyeet baileyeet changed the title Merge multilingual branch Merge japanese-to-english multilingual branch Jan 7, 2025
@baileyeet baileyeet marked this pull request as draft January 7, 2025 05:41

```shell
./zipformer/streaming_decode.py \
--epoch 28 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the Reazonspeech is a large dataset, I suggest that you replace --epoch with --iter. See also RESULTS.md from our Gigaspeech recipe. You can find example usages of --iter there.

@baileyeet
Copy link
Contributor Author

With regards to Setup Python 3.10.15 issue - how do I resolve this issue? I didn't change anything related to Python issue.

@csukuangfj
Copy link
Collaborator

With regards to Setup Python 3.10.15 issue - how do I resolve this issue? I didn't change anything related to Python issue.

please change

python-version: [3.10.15]

to

 python-version: [3.10] 

That is, change 3.10.15 to 3.10.

It is an issue of GitHub actions and is not related to your PR.

@baileyeet
Copy link
Contributor Author

With regards to Setup Python 3.10.15 issue - how do I resolve this issue? I didn't change anything related to Python issue.

please change

python-version: [3.10.15]

to

 python-version: [3.10] 

That is, change 3.10.15 to 3.10.

It is an issue of GitHub actions and is not related to your PR.

Looks like this hasn't resolved the issue.

@csukuangfj
Copy link
Collaborator

Please use

"3.10"

not

3.10

@baileyeet baileyeet marked this pull request as ready for review January 14, 2025 22:41
@csukuangfj
Copy link
Collaborator

@JinZr Can you have a review?

@JinZr
Copy link
Collaborator

JinZr commented Jan 21, 2025 via email

Copy link
Collaborator

@JinZr JinZr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i’ve completed reviewing this PR, and it looks great overall!

there are also some unnecessary changes to other dependencies that might need to be addressed before merging

icefall/utils.py Outdated
@@ -644,7 +644,8 @@ def write_error_stats(
results[i] = (cut_id, ref, hyp)

for cut_id, ref, hyp in results:
ali = kaldialign.align(ref, hyp, ERR, sclite_mode=sclite_mode)
# ali = kaldialign.align(ref, hyp, ERR, sclite_mode=sclite_mode)
ali = kaldialign.align(ref, hyp, ERR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide some context on why the ''sclite_mode'' argument was removed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid unnecessary modifications to built-in files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid unnecessary modifications to built-in files, thanks

Copy link
Collaborator

@JinZr JinZr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now.

@baileyeet
Copy link
Contributor Author

thanks!

i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now.

Great, thank you!

@baileyeet
Copy link
Contributor Author

thanks!
i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now.

Great, thank you!

to clarify, there's no further action needed on my end, right?

@JinZr
Copy link
Collaborator

JinZr commented Feb 3, 2025 via email

@baileyeet
Copy link
Contributor Author

sure! pls feel free to merge it Best Regards Jin

On Mon, 3 Feb 2025 at 01:28 Machiko Bailey @.> wrote: thanks! i left a few commits to remove changed applied to built-in scripts, i think the pr is ready to be merged now. Great, thank you! to clarify, there's no further action needed on my end, right? — Reply to this email directly, view it on GitHub <#1860 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42BKR3VES6BS2XAO57L2NZIS7AVCNFSM6AAAAABUW64FCWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRZGQ4DCNRWHA . You are receiving this because you were mentioned.Message ID: @.>

thanks, i don't have write access to merge so i will kindly wait :) thanks for the review

@JinZr
Copy link
Collaborator

JinZr commented Feb 3, 2025

sorry i didnt realize that, i'll do the operation now

@JinZr JinZr merged commit 0855b03 into k2-fsa:master Feb 3, 2025
131 of 213 checks passed
@baileyeet baileyeet deleted the einichi branch February 3, 2025 21:11
@baileyeet
Copy link
Contributor Author

I am investigating improving English accuracy of the recently merged multi_ja_en model and saw that both Librispeech and mult_zh_en models use 3x speed perturbation. Is there somewhere I can access that version of Librispeech data? Also, did multi_zh_en ever look into using gigaspeech model for English side?

@JinZr
Copy link
Collaborator

JinZr commented Feb 27, 2025 via email

@kinanmartin
Copy link

Hello, I'm working together with @baileyeet on improving the English accuracy of this model.

We are considering training a version of this model using a larger English dataset. One dataset we are considering is the People's Speech corpus. I see there was a pull request by @yfyeung that added support for data loading of this dataset in the past, but it seems it was removed because of poor training results. Here are the relevant PRs: #1101 #1778 .

Can anyone vouch for the quality of the People's Speech corpus, or explain why it might not be a suitable dataset? It seems that it has not yet been evaluated with results.

As an alternative, we are considering using GigaSpeech, but it has less data, and it seemingly has more restrictive licensing. Would this be a better choice? Are there any other English labeled datasets that would be better, apart from these two?

Thank you for any responses!

@yfyeung
Copy link
Collaborator

yfyeung commented Feb 28, 2025

Hello, I'm working together with @baileyeet on improving the English accuracy of this model.

We are considering training a version of this model using a larger English dataset. One dataset we are considering is the People's Speech corpus. I see there was a pull request by @yfyeung that added support for data loading of this dataset in the past, but it seems it was removed because of poor training results. Here are the relevant PRs: #1101 #1778 .

Can anyone vouch for the quality of the People's Speech corpus, or explain why it might not be a suitable dataset? It seems that it has not yet been evaluated with results.

As an alternative, we are considering using GigaSpeech, but it has less data, and it seemingly has more restrictive licensing. Would this be a better choice? Are there any other English labeled datasets that would be better, apart from these two?

Thank you for any responses!

Hi, according to the paper on the People's Speech corpus, models trained on it report the following WER (Word Error Rate) on LibriSpeech:

dev-clean: 9.93%
dev-other: 25.53%
test-clean: 9.98%
test-other: 26.91%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants