Confusion in injecting speaker information while finetuning the model. #302
-
Hi, I'm trying to finetune a custom fastpitch model provided by AI4 Bharat on the openSLR dataset. It looks like they have trained the model using 4 speakers. And they have provided the speaker_ids of those 4 speakers in the form of a .pth file i.e., they have provided best_model.pth, speakers.pth and a config file IndicTTS. With all this information, I'm able to finetune the model using the Coqui-ai implementation. However, I would like to use my own speakers instead of the available speaker embeddings. So, my query is like is it possible to directly inject the custom multi speaker audio data into the coqui-ai implementation to derive the speaker embeddings alone and then to generate the output speech using the selected speaker_id? or Should i train a speaker embedding model as a standalone module and then inject the speaker embeddings externally? or Should i train a single model from the scratch to derive both the acoustic, speaker information and then having speaker_ids information in a separate .json file, as shown below?
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
There are 2 ways in Coqui to encode speaker information:
When fine-tuning a model, you'd need to check its config to see what method it uses. Original repo crosslink: coqui-ai#4155 |
Beta Was this translation helpful? Give feedback.
There are 2 ways in Coqui to encode speaker information:
"use_speaker_embedding": true,
in the config)"use_d_vector_file": true,
in the config)When fine-tuning a model, you'd need to check its config to see what method it uses.
Original repo crosslink: coqui-ai#4155