OpenVoice emotion style transfer? #322

DrewThomasson · 2025-02-27T23:00:25Z

DrewThomasson
Feb 27, 2025

I noticed in the open voice repo they allow for Voice Style Control and not only voice -> voice conversion

https://github.com/myshell-ai/OpenVoice/blob/main/demo_part1.ipynb

How would one go about doing that in coqui-tts?

Or is this not a feature of using it in coqui-tts

Feb 28, 2025

There is a lot of marketing speech in that repo so it is not immediately obvious how it works unfortunately. There are actually two separate components:

"Base speaker TTS"
"Tone color converter"

The first can be any TTS system that gives you an audio output. They mostly use their own MeloTTS, which is based on Vits.

The second is just a separate voice conversion model that takes the TTS output and a reference speaker audio and returns the converted speech. But I guess "tone color converter" sounds fancier... These are the actual OpenVoice VC models (v1 and v2) that are added in Coqui, so you can use them with any Coqui TTS model.

In that notebook they use a single-speaker TTS model that…

View full answer

DrewThomasson · 2025-02-28T00:19:33Z

DrewThomasson
Feb 28, 2025
Author

Ah perhaps this needs to be implemented into the coqui-tts pipeline

As it seems the speaker embeddings arn't used for the OpenVoice voice transfer

0 replies

DrewThomasson · 2025-02-28T00:47:27Z

DrewThomasson
Feb 28, 2025
Author

Could that Embedding model be used to add specific emotions to other models like xtts also? Or...

0 replies

eginhard · 2025-02-28T17:33:20Z

eginhard
Feb 28, 2025
Maintainer

There is a lot of marketing speech in that repo so it is not immediately obvious how it works unfortunately. There are actually two separate components:

"Base speaker TTS"
"Tone color converter"

The first can be any TTS system that gives you an audio output. They mostly use their own MeloTTS, which is based on Vits.

The second is just a separate voice conversion model that takes the TTS output and a reference speaker audio and returns the converted speech. But I guess "tone color converter" sounds fancier... These are the actual OpenVoice VC models (v1 and v2) that are added in Coqui, so you can use them with any Coqui TTS model.

In that notebook they use a single-speaker TTS model that can generate different emotions (friendly, cheerful, excited, sad, angry, terrified, shouting, whispering). It's trained like a multi-speaker model, except instead of different speaker labels they use emotion labels for data from a single speaker. Then they run the OpenVoice VC afterwards so you can get different emotions for any speaker. That TTS model is not available in Coqui, but their codebase is mostly taken from Coqui, so it would be easy to integrate. Not planning to do this myself anytime soon, but I'd merge PRs for that.

Could that Embedding model be used to add specific emotions to other models like xtts also?

That is not possible because the emotions come from the TTS model, not the OpenVoice VC.

0 replies

DrewThomasson · 2025-02-28T20:24:19Z

DrewThomasson
Feb 28, 2025
Author

Thanks for the info.

I'll continue messing with it and see if I find a solution that I can include in a PR for you 👍

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenVoice emotion style transfer? #322

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OpenVoice emotion style transfer? #322

DrewThomasson Feb 27, 2025

Replies: 4 comments

DrewThomasson Feb 28, 2025 Author

DrewThomasson Feb 28, 2025 Author

eginhard Feb 28, 2025 Maintainer

DrewThomasson Feb 28, 2025 Author

DrewThomasson
Feb 27, 2025

DrewThomasson
Feb 28, 2025
Author

DrewThomasson
Feb 28, 2025
Author

eginhard
Feb 28, 2025
Maintainer

DrewThomasson
Feb 28, 2025
Author