You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! Thanks for your wonderful work. Trying to reproduce your results on the TTS task, I'm wondering if you could provide more details about the evaluation of the TTS task, especially:
How many / Which samples are used in VCTK dataset
Which ASR model is used to convert the generated speech into text
How the WER is calculated; What kind of text normalization is applied before the calculation
Thanks!
The text was updated successfully, but these errors were encountered:
Hi, I have a few questions about zero-shot TTS evaluation using the VCTK dataset.
1. Evaluation Methodology:
We randomly select a 3-second clip from each speaker as the vocal prompt along with a separate text as input.
In the paper, particularly in Appendix C, the evaluation process seems a bit open to interpretation. Could you please provide a detailed description of how the evaluation was conducted?
Additionally, did you use all audio files from the VCTK dataset for the evaluation and cut off any entries longer than 3 seconds?
2. Dataset Usage:
I am curious about the specific version of the VCTK dataset used in your study. Did you only utilize audio files from the "mic2" recordings in the VCTK 0.92 version?
Hello! Thanks for your wonderful work. Trying to reproduce your results on the TTS task, I'm wondering if you could provide more details about the evaluation of the TTS task, especially:
Thanks!
The text was updated successfully, but these errors were encountered: