Confusion in injecting speaker information while finetuning the model. #302

RamakrishnaChaitanya · 2025-02-19T09:32:26Z

RamakrishnaChaitanya
Feb 19, 2025

Hi, I'm trying to finetune a custom fastpitch model provided by AI4 Bharat on the openSLR dataset. It looks like they have trained the model using 4 speakers. And they have provided the speaker_ids of those 4 speakers in the form of a .pth file i.e., they have provided best_model.pth, speakers.pth and a config file IndicTTS. With all this information, I'm able to finetune the model using the Coqui-ai implementation.

However, I would like to use my own speakers instead of the available speaker embeddings. So, my query is like is it possible to directly inject the custom multi speaker audio data into the coqui-ai implementation to derive the speaker embeddings alone and then to generate the output speech using the selected speaker_id? or Should i train a speaker embedding model as a standalone module and then inject the speaker embeddings externally? or Should i train a single model from the scratch to derive both the acoustic, speaker information and then having speaker_ids information in a separate .json file, as shown below?

$ tts --text "Text for TTS" --out_path output/path/speech.wav --model_path path/to/model.pth --config_path path/to/config.json --speakers_file_path path/to/speaker.json --speaker_idx <speaker_id>

Answered by eginhard

Feb 26, 2025

There are 2 ways in Coqui to encode speaker information:

Train speaker embeddings as part of the model ("use_speaker_embedding": true, in the config)
Use an external speaker encoder model to get speaker embeddings ("use_d_vector_file": true, in the config)

When fine-tuning a model, you'd need to check its config to see what method it uses.

Original repo crosslink: coqui-ai#4155

View full answer

eginhard · 2025-02-26T10:19:39Z

eginhard
Feb 26, 2025
Maintainer

There are 2 ways in Coqui to encode speaker information:

Train speaker embeddings as part of the model ("use_speaker_embedding": true, in the config)
Use an external speaker encoder model to get speaker embeddings ("use_d_vector_file": true, in the config)

When fine-tuning a model, you'd need to check its config to see what method it uses.

Original repo crosslink: coqui-ai#4155

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion in injecting speaker information while finetuning the model. #302

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Confusion in injecting speaker information while finetuning the model. #302

RamakrishnaChaitanya Feb 19, 2025

Replies: 1 comment

eginhard Feb 26, 2025 Maintainer

RamakrishnaChaitanya
Feb 19, 2025

eginhard
Feb 26, 2025
Maintainer