Coqui consistency #225

blewis-11 · 2024-12-02T09:16:03Z

Hello! I'm using the Coqui engine and have tried both the default voice and the voice cloning feature with a .wav file. However, I've noticed that when I repeatedly input the same sentence, the intonation or cadence varies slightly - for instance, the voice may sound more excited, tired, or serious. Why does this happen if the text and voice settings remain unchanged? Is there a way to ensure the text is always pronounced the same way?

Nenesh · 2024-12-02T15:56:20Z

Temperature problem ? Default is 0.85, try something lower.

blewis-11 · 2024-12-02T16:22:09Z

Hi Nenesh, I tried to decrease the temperature to the minimum (0.01), but when I do that only half of the first word of the sentence is pronounced and I have no idea why

Nenesh · 2024-12-02T19:03:42Z

Maybe something more in the middle like 0.45. I have no such problem at 0.65.

blewis-11 · 2024-12-03T10:52:01Z

I have tried different temperature values (0.75, 0.65, 0.55, 0.45, 0.35, 0.25., 0.15), what I find is that up to 0.35 the pronunciation is not consistent across the various repetitions of the same sentence, when I go below that value I also notice that the sentence still has different pronunciations and furthermore, if we consider a sentence like "Hello! What's your name? My name is Bob.", the "hello" is pronounced (always with different tone inflection), then there is a long pause (even 15/20 seconds) in which "What's your name?" is not pronounced and then "my name is Bob" is pronounced. I am quite perplexed by this behavior.

KoljaB · 2024-12-04T13:12:09Z

The XTTS v2 model is based on a GPT-2 decoder transformer. The varying intonation comes from the model's stochastic nature, very much like LLMs that generate different outputs for the same prompt.

At very low temperatures the model becomes more deterministic but can cause unnatural behavior, such as incomplete outputs (e.g., cutting off words) or long pauses, because the model struggles to transition fluently between tokens when diversity is too constrained. A clean reference with clear prosody becomes essential at lower temperatures, as the model relies more heavily on the reference for consistent outputs.

blewis-11 · 2024-12-04T13:27:03Z

@KoljaB, so if I understand correctly if I wanted to have a more repeatable output I would have to make sure I have a low temperature but at the same time have a long enough .wav file to clone where the voice is clear and clean, right? Do you have maybe any tips or references that you've used in the past for this purpose?

KoljaB · 2024-12-04T13:45:07Z

For really consistent output Coqui XTTS probably isn’t the best fit. It’s inherently stochastic, similar to how LLMs generate diverse responses at higher temperatures. Setting the temperature to 0 can make LLMs deterministic, but XTTS can still have variability due to sensitivity to reference voice (or input formatting).

With reference voices, it’s not so much about the length but more the prosody. Make sure the voice sample is natural and free of weird pauses, especially at the start or end. If there’s awkward timing in the middle, that can throw things off, a natural rhythm makes it easier for the model to mimic consistently.

If repeatability is critical, you might want to look at a different TTS model that’s more deterministic. StyleTTS2 for example as a diffusion model does way better in terms of consistency.

blewis-11 · 2024-12-04T14:43:18Z

Thank you! Is StyleTTS2 compatible with Coqui engine? Or is there some modification I need to take into account?

KoljaB · 2024-12-04T15:50:16Z

There is currently no official support for StyleTTS2 in RealtimeTTS, mostly because it requires EspeakNG which is a pain to install on Windows. If you want to use it together with Linux or get StyleTTS2 running under Windows I can send you the unreleased StyleEngine for RealtimeTTS though.

blewis-11 · 2024-12-05T09:02:23Z

Thank you so much for your help, it would be great to have the engine for StyleTTS2 as well so to have a reference for diffusion models!

KoljaB · 2024-12-05T10:32:46Z

Please contact me at [email protected] and I'll send it to you

samrainax mentioned this issue Dec 31, 2024

Voice Quality Issues On Linux (Potentially Due To eSpeak TTS Driver) #239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coqui consistency #225

Coqui consistency #225

blewis-11 commented Dec 2, 2024

Nenesh commented Dec 2, 2024

blewis-11 commented Dec 2, 2024

Nenesh commented Dec 2, 2024

blewis-11 commented Dec 3, 2024

KoljaB commented Dec 4, 2024

blewis-11 commented Dec 4, 2024

KoljaB commented Dec 4, 2024

blewis-11 commented Dec 4, 2024

KoljaB commented Dec 4, 2024

blewis-11 commented Dec 5, 2024

KoljaB commented Dec 5, 2024

Coqui consistency #225

Coqui consistency #225

Comments

blewis-11 commented Dec 2, 2024

Nenesh commented Dec 2, 2024

blewis-11 commented Dec 2, 2024

Nenesh commented Dec 2, 2024

blewis-11 commented Dec 3, 2024

KoljaB commented Dec 4, 2024

blewis-11 commented Dec 4, 2024

KoljaB commented Dec 4, 2024

blewis-11 commented Dec 4, 2024

KoljaB commented Dec 4, 2024

blewis-11 commented Dec 5, 2024

KoljaB commented Dec 5, 2024