-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coqui consistency #225
Comments
Temperature problem ? Default is 0.85, try something lower. |
Hi Nenesh, I tried to decrease the temperature to the minimum (0.01), but when I do that only half of the first word of the sentence is pronounced and I have no idea why |
Maybe something more in the middle like 0.45. I have no such problem at 0.65. |
I have tried different temperature values (0.75, 0.65, 0.55, 0.45, 0.35, 0.25., 0.15), what I find is that up to 0.35 the pronunciation is not consistent across the various repetitions of the same sentence, when I go below that value I also notice that the sentence still has different pronunciations and furthermore, if we consider a sentence like "Hello! What's your name? My name is Bob.", the "hello" is pronounced (always with different tone inflection), then there is a long pause (even 15/20 seconds) in which "What's your name?" is not pronounced and then "my name is Bob" is pronounced. I am quite perplexed by this behavior. |
The XTTS v2 model is based on a GPT-2 decoder transformer. The varying intonation comes from the model's stochastic nature, very much like LLMs that generate different outputs for the same prompt. At very low temperatures the model becomes more deterministic but can cause unnatural behavior, such as incomplete outputs (e.g., cutting off words) or long pauses, because the model struggles to transition fluently between tokens when diversity is too constrained. A clean reference with clear prosody becomes essential at lower temperatures, as the model relies more heavily on the reference for consistent outputs. |
@KoljaB, so if I understand correctly if I wanted to have a more repeatable output I would have to make sure I have a low temperature but at the same time have a long enough .wav file to clone where the voice is clear and clean, right? Do you have maybe any tips or references that you've used in the past for this purpose? |
For really consistent output Coqui XTTS probably isn’t the best fit. It’s inherently stochastic, similar to how LLMs generate diverse responses at higher temperatures. Setting the temperature to 0 can make LLMs deterministic, but XTTS can still have variability due to sensitivity to reference voice (or input formatting). With reference voices, it’s not so much about the length but more the prosody. Make sure the voice sample is natural and free of weird pauses, especially at the start or end. If there’s awkward timing in the middle, that can throw things off, a natural rhythm makes it easier for the model to mimic consistently. If repeatability is critical, you might want to look at a different TTS model that’s more deterministic. StyleTTS2 for example as a diffusion model does way better in terms of consistency. |
Thank you! Is StyleTTS2 compatible with Coqui engine? Or is there some modification I need to take into account? |
There is currently no official support for StyleTTS2 in RealtimeTTS, mostly because it requires EspeakNG which is a pain to install on Windows. If you want to use it together with Linux or get StyleTTS2 running under Windows I can send you the unreleased StyleEngine for RealtimeTTS though. |
Thank you so much for your help, it would be great to have the engine for StyleTTS2 as well so to have a reference for diffusion models! |
Please contact me at [email protected] and I'll send it to you |
Hello! I'm using the Coqui engine and have tried both the default voice and the voice cloning feature with a .wav file. However, I've noticed that when I repeatedly input the same sentence, the intonation or cadence varies slightly - for instance, the voice may sound more excited, tired, or serious. Why does this happen if the text and voice settings remain unchanged? Is there a way to ensure the text is always pronounced the same way?
The text was updated successfully, but these errors were encountered: