information on 24khz model #2

acul3 · 2024-08-19T15:41:58Z

hey @adelacvg thank for sharing the code

after reading the code i want to ask you few question about new 24k model if you dont mind

what make different about this model from previous one (https://huggingface.co/adelacvg/Detail/tree/main) beside sample rate
did you not use speech encoder in 24k model? (i see there is speech encoder in utils.like hubert , whisper etc, but i think is from previous model), did you also still use ContentVec768L12.py ?
i see train_target in (https://github.com/adelacvg/detail_tts/blob/master/vqvae/configs/config_24k.json) , i assume it has multpile step of training, if i want to train from scratch , do i need to change it ? say "gpt" first, flowqae , and diff ( is this correct ?)
if i want to train scratch i just remove (

detail_tts/train.py

Line 461 in 7e24668

trainer.load('/home/hyc/detail_tts/logs/2024-08-19-14-46-30/model-474.pt')

) right?

sorry if is this a lot question,, thanks in advance

adelacvg · 2024-08-19T15:55:48Z

There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.
No SSL features like Whisper or cevc were used; these were merely copied from other projects.
Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.
Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

acul3 · 2024-08-19T16:09:46Z

There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.

No SSL features like Whisper or cevc were used; these were merely copied from other projects.

Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.

Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

thanks you for your quick answer @adelacvg

one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff

thanks once again,

i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first

adelacvg · 2024-08-20T02:11:55Z

There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.

No SSL features like Whisper or cevc were used; these were merely copied from other projects.

Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.

Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

thanks you for your quick answer @adelacvg

one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff

thanks once again,

i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first

For vqvae and flowvae specific config, you can check config_24k.json vaegan part. For multilingual, you can use the voice_tokenizer.py to train your custom bpe tokenizer.

acul3 · 2024-08-20T08:08:38Z

just finsih 50% step of flowvae ( 13M samples,300k of 600k step)

for the next step training (vqvae) i need to load the flowvae model .pt right ? @adelacvg and then continue training target

here sample from flowvae :
https://github.com/user-attachments/assets/a0b5151e-e13a-4f5f-86bc-e38edb4ead2a

adelacvg · 2024-08-21T18:22:12Z

Yes, just use the results from the previous step for the next step of the training.

acul3 · 2024-08-24T18:56:42Z

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

adelacvg · 2024-08-28T15:42:59Z

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

It's normal; VQ-VAE only needs to capture the semantics approximately.

acul3 · 2024-08-30T05:59:16Z

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

It's normal; VQ-VAE only needs to capture the semantics approximately.

ok,
i am at gpt stage now, after training 2 days,, the result now looks like ground truth,,but still not there

ground truth:
https://github.com/user-attachments/assets/5c27cf96-7921-4ca1-af1f-dc8d2050bfe2

sample:

sample-1049.mp4

acul3 · 2024-08-30T06:03:16Z

@adelacvg

btw i change my vocab size gpt to 512, due multilinguality

i just change the config

  "gpt":{
    "model_dim":768,
    "max_mel_tokens":1600,
    "max_text_tokens":800,
    "heads":16,
    "mel_length_compression":1024,
    "use_mel_codes_as_input":true,
    "layers":10,
    "number_text_tokens":513,
    "number_mel_codes":8194,
    "start_mel_token":8192,
    "stop_mel_token":8193,
    "start_text_token":512,
    "train_solo_embeddings":false,
    "spec_channels":128

number_text_tokens and start_text_token

it is correct right?

thank you again

adelacvg · 2024-08-31T15:27:14Z

@adelacvg

btw i change my vocab size gpt to 512, due multilinguality

i just change the config

  "gpt":{
    "model_dim":768,
    "max_mel_tokens":1600,
    "max_text_tokens":800,
    "heads":16,
    "mel_length_compression":1024,
    "use_mel_codes_as_input":true,
    "layers":10,
    "number_text_tokens":513,
    "number_mel_codes":8194,
    "start_mel_token":8192,
    "stop_mel_token":8193,
    "start_text_token":512,
    "train_solo_embeddings":false,
    "spec_channels":128

number_text_tokens and start_text_token

it is correct right?

thank you again

In the GPT step, the infering results are close to those of VQ-VAE. You just need to ensure that the semantics are correct, and after diffusion, they will become high quality.

adelacvg · 2024-08-31T15:30:56Z

Ensure that the referenced mel is a short segment of audio to avoid GPT overfitting on the speaker's conditions.
I have updated some parameters of the VQ-VAE, resulting in a higher codebook utilization, which should lead to better results.

acul3 · 2024-09-07T18:10:09Z

@adelacvg btw , how can i infer diffusion part?, it seems api.py only provide vqvae and gpt (old commit) only

finishing gpt train and continue diff now,

adelacvg · 2024-09-11T13:20:50Z

Infer_diffusion function is the same as the infer function, do_spectrogram_diffusion part do the sample process.

acul3 · 2024-09-16T05:28:38Z

@adelacvg have you got good result?

training diff 2 days i got same result as gpt(robotic sound but semantic is there)

acul3 · 2024-09-18T16:46:53Z

after using last commit , i finally got good result,,thank you

any tips how to make infer faster @adelacvg ? (maybe like tortoise sytle)

adelacvg · 2024-09-18T17:19:15Z

For the GPT part, you can use acceleration frameworks similar to VLM, and they also support GPT2. For the diffusion part, you can adopt faster sampling methods with fewer sampling steps. Alternatively, like XTTS, you can use GANs instead of diffusion, although performance may decrease, it can achieve very fast results for the timbre in the training dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

information on 24khz model #2

information on 24khz model #2

acul3 commented Aug 19, 2024 •

edited

Loading

adelacvg commented Aug 19, 2024

acul3 commented Aug 19, 2024

adelacvg commented Aug 20, 2024

acul3 commented Aug 20, 2024

adelacvg commented Aug 21, 2024

acul3 commented Aug 24, 2024

adelacvg commented Aug 28, 2024

acul3 commented Aug 30, 2024

acul3 commented Aug 30, 2024

adelacvg commented Aug 31, 2024

adelacvg commented Aug 31, 2024 •

edited

Loading

acul3 commented Sep 7, 2024

adelacvg commented Sep 11, 2024

acul3 commented Sep 16, 2024

acul3 commented Sep 18, 2024

adelacvg commented Sep 18, 2024

information on 24khz model #2

information on 24khz model #2

Comments

acul3 commented Aug 19, 2024 • edited Loading

adelacvg commented Aug 19, 2024

acul3 commented Aug 19, 2024

adelacvg commented Aug 20, 2024

acul3 commented Aug 20, 2024

adelacvg commented Aug 21, 2024

acul3 commented Aug 24, 2024

adelacvg commented Aug 28, 2024

acul3 commented Aug 30, 2024

acul3 commented Aug 30, 2024

adelacvg commented Aug 31, 2024

adelacvg commented Aug 31, 2024 • edited Loading

acul3 commented Sep 7, 2024

adelacvg commented Sep 11, 2024

acul3 commented Sep 16, 2024

acul3 commented Sep 18, 2024

adelacvg commented Sep 18, 2024

acul3 commented Aug 19, 2024 •

edited

Loading

adelacvg commented Aug 31, 2024 •

edited

Loading