Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding ASR testing #40

Open
Simplesss opened this issue Aug 28, 2024 · 2 comments
Open

Regarding ASR testing #40

Simplesss opened this issue Aug 28, 2024 · 2 comments

Comments

@Simplesss
Copy link

Hello, thank you very much for your work. I would like to reproduce the ASR performance of the AnyGPT Base model on the Librispeech test clean. I noticed that your paper stated a WER of 8.5, but my test result was 14.5 (using the command format speech | text | {speech file path}). Therefore, I am speculating whether this result is caused by randomly selecting a prompt for ASR during each inference in the ASR task? If possible, could you share the relevant code for calculating WER (I used 7 Composers from jiwer for calculation), as well as the text result obtained from ASR of the model. Looking forward to your reply.

@JunZhan2000
Copy link
Collaborator

JunZhan2000 commented Sep 30, 2024

Hello, I think it's probably not an issue with the prompt, each prompt has been seen many times during training.
I would like to confirm two things: First, are you using beam search as your decoding strategy? This strategy generally produces the best results. Second, it's necessary to perform some post-processing on the transcription results to standardize them, because the output format of the LLM is very different from the ground truth, including punctuation and words like "you're" which shoud be "you are" in the groundtruth.
I also use jiwer for caculating wer.
Regarding the test code, unfortunately, it was lost during an environment migration, but I believe if you use GPT to write some standardization code, you should be able to achieve the results mentioned in the paper.(I didn't handle all the standardization cases)

@Changhao-Xiang
Copy link

Hello, thank you very much for your work. I would like to reproduce the ASR performance of the AnyGPT Base model on the Librispeech test clean. I noticed that your paper stated a WER of 8.5, but my test result was 14.5 (using the command format speech | text | {speech file path}). Therefore, I am speculating whether this result is caused by randomly selecting a prompt for ASR during each inference in the ASR task? If possible, could you share the relevant code for calculating WER (I used 7 Composers from jiwer for calculation), as well as the text result obtained from ASR of the model. Looking forward to your reply.

Hello, have you reproduced the results successfully? My reproduced performance on LibriSpeech test-clean is also a WER around 15 with the following configs:

{
    "do_sample": false,
    "max_new_tokens": 100,
    "min_new_tokens": 1,
    "repetition_penalty": 1.0,
    "num_beams": 5
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants