OpenVoice emotion style transfer? #322
-
I noticed in the open voice repo they allow for Voice Style Control and not only voice -> voice conversion https://github.com/myshell-ai/OpenVoice/blob/main/demo_part1.ipynb How would one go about doing that in coqui-tts? Or is this not a feature of using it in coqui-tts |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Ah perhaps this needs to be implemented into the coqui-tts pipeline As it seems the speaker embeddings arn't used for the OpenVoice voice transfer |
Beta Was this translation helpful? Give feedback.
-
Could that Embedding model be used to add specific emotions to other models like xtts also? Or... |
Beta Was this translation helpful? Give feedback.
-
There is a lot of marketing speech in that repo so it is not immediately obvious how it works unfortunately. There are actually two separate components:
The first can be any TTS system that gives you an audio output. They mostly use their own MeloTTS, which is based on Vits. The second is just a separate voice conversion model that takes the TTS output and a reference speaker audio and returns the converted speech. But I guess "tone color converter" sounds fancier... These are the actual OpenVoice VC models (v1 and v2) that are added in Coqui, so you can use them with any Coqui TTS model. In that notebook they use a single-speaker TTS model that can generate different emotions (friendly, cheerful, excited, sad, angry, terrified, shouting, whispering). It's trained like a multi-speaker model, except instead of different speaker labels they use emotion labels for data from a single speaker. Then they run the OpenVoice VC afterwards so you can get different emotions for any speaker. That TTS model is not available in Coqui, but their codebase is mostly taken from Coqui, so it would be easy to integrate. Not planning to do this myself anytime soon, but I'd merge PRs for that.
That is not possible because the emotions come from the TTS model, not the OpenVoice VC. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the info. I'll continue messing with it and see if I find a solution that I can include in a PR for you 👍 |
Beta Was this translation helpful? Give feedback.
There is a lot of marketing speech in that repo so it is not immediately obvious how it works unfortunately. There are actually two separate components:
The first can be any TTS system that gives you an audio output. They mostly use their own MeloTTS, which is based on Vits.
The second is just a separate voice conversion model that takes the TTS output and a reference speaker audio and returns the converted speech. But I guess "tone color converter" sounds fancier... These are the actual OpenVoice VC models (v1 and v2) that are added in Coqui, so you can use them with any Coqui TTS model.
In that notebook they use a single-speaker TTS model that…