Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

information on 24khz model #2

Open
acul3 opened this issue Aug 19, 2024 · 16 comments
Open

information on 24khz model #2

acul3 opened this issue Aug 19, 2024 · 16 comments

Comments

@acul3
Copy link

acul3 commented Aug 19, 2024

hey @adelacvg thank for sharing the code

after reading the code i want to ask you few question about new 24k model if you dont mind

  1. what make different about this model from previous one (https://huggingface.co/adelacvg/Detail/tree/main) beside sample rate

  2. did you not use speech encoder in 24k model? (i see there is speech encoder in utils.like hubert , whisper etc, but i think is from previous model), did you also still use ContentVec768L12.py ?

  3. i see train_target in (https://github.com/adelacvg/detail_tts/blob/master/vqvae/configs/config_24k.json) , i assume it has multpile step of training, if i want to train from scratch , do i need to change it ? say "gpt" first, flowqae , and diff ( is this correct ?)

  4. if i want to train scratch i just remove (

    trainer.load('/home/hyc/detail_tts/logs/2024-08-19-14-46-30/model-474.pt')
    ) right?

sorry if is this a lot question,, thanks in advance

@adelacvg
Copy link
Owner

  1. There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.
  2. No SSL features like Whisper or cevc were used; these were merely copied from other projects.
  3. Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.
  4. Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

@acul3
Copy link
Author

acul3 commented Aug 19, 2024

  1. There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.
  2. No SSL features like Whisper or cevc were used; these were merely copied from other projects.
  3. Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.
  4. Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

thanks you for your quick answer @adelacvg

one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff

thanks once again,

i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first

@adelacvg
Copy link
Owner

  1. There are many differences, such as the method of adding speaker information in diffusion, the approach to normalization, and which latent feature to use, among others. All these changes were made to create a more stable and hifi model.
  2. No SSL features like Whisper or cevc were used; these were merely copied from other projects.
  3. Yes, the training sequence is "flowvae" -> "vqvae" -> "gpt" -> "diff". The reason for adding step-by-step training is that it allows for better gradient accumulation, which is crucial for training VQVAE and GPT.
  4. Yes, to train from scratch, you just need to remove the load code. Please make sure to pre-process the data into a text-audiopath pair format in advance. Datasets part is written very simply, so it should be easy to modify.

thanks you for your quick answer @adelacvg

one last question if you dont mind,, for point number 3 , is there specific config( for each target layers,dimension etc) especially for flowvae, i see there are specific config for gpt and diff

thanks once again,

i am planning to reproduce your code, but using multilingual language(english and malay),, need to train bpe first

For vqvae and flowvae specific config, you can check config_24k.json vaegan part. For multilingual, you can use the voice_tokenizer.py to train your custom bpe tokenizer.

@acul3
Copy link
Author

acul3 commented Aug 20, 2024

just finsih 50% step of flowvae ( 13M samples,300k of 600k step)

for the next step training (vqvae) i need to load the flowvae model .pt right ? @adelacvg and then continue training target

here sample from flowvae :
https://github.com/user-attachments/assets/a0b5151e-e13a-4f5f-86bc-e38edb4ead2a

@adelacvg
Copy link
Owner

Yes, just use the results from the previous step for the next step of the training.

@acul3
Copy link
Author

acul3 commented Aug 24, 2024

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

@adelacvg
Copy link
Owner

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

It's normal; VQ-VAE only needs to capture the semantics approximately.

@acul3
Copy link
Author

acul3 commented Aug 30, 2024

hmm it seems my vqae training loss stuck, after 2 days, it stay the same,, the sample also not intagible from ground truth

Yes, just use the results from the previous step for the next step of the training.

It's normal; VQ-VAE only needs to capture the semantics approximately.

ok,
i am at gpt stage now, after training 2 days,, the result now looks like ground truth,,but still not there

ground truth:
https://github.com/user-attachments/assets/5c27cf96-7921-4ca1-af1f-dc8d2050bfe2

sample:

sample-1049.mp4

@acul3
Copy link
Author

acul3 commented Aug 30, 2024

@adelacvg

btw i change my vocab size gpt to 512, due multilinguality

i just change the config

  "gpt":{
    "model_dim":768,
    "max_mel_tokens":1600,
    "max_text_tokens":800,
    "heads":16,
    "mel_length_compression":1024,
    "use_mel_codes_as_input":true,
    "layers":10,
    "number_text_tokens":513,
    "number_mel_codes":8194,
    "start_mel_token":8192,
    "stop_mel_token":8193,
    "start_text_token":512,
    "train_solo_embeddings":false,
    "spec_channels":128

number_text_tokens and start_text_token

it is correct right?

thank you again

@adelacvg
Copy link
Owner

@adelacvg

btw i change my vocab size gpt to 512, due multilinguality

i just change the config

  "gpt":{
    "model_dim":768,
    "max_mel_tokens":1600,
    "max_text_tokens":800,
    "heads":16,
    "mel_length_compression":1024,
    "use_mel_codes_as_input":true,
    "layers":10,
    "number_text_tokens":513,
    "number_mel_codes":8194,
    "start_mel_token":8192,
    "stop_mel_token":8193,
    "start_text_token":512,
    "train_solo_embeddings":false,
    "spec_channels":128

number_text_tokens and start_text_token

it is correct right?

thank you again

In the GPT step, the infering results are close to those of VQ-VAE. You just need to ensure that the semantics are correct, and after diffusion, they will become high quality.

@adelacvg
Copy link
Owner

adelacvg commented Aug 31, 2024

Ensure that the referenced mel is a short segment of audio to avoid GPT overfitting on the speaker's conditions.
I have updated some parameters of the VQ-VAE, resulting in a higher codebook utilization, which should lead to better results.

@acul3
Copy link
Author

acul3 commented Sep 7, 2024

@adelacvg btw , how can i infer diffusion part?, it seems api.py only provide vqvae and gpt (old commit) only

finishing gpt train and continue diff now,

@adelacvg
Copy link
Owner

Infer_diffusion function is the same as the infer function, do_spectrogram_diffusion part do the sample process.

@acul3
Copy link
Author

acul3 commented Sep 16, 2024

@adelacvg have you got good result?

training diff 2 days i got same result as gpt(robotic sound but semantic is there)

@acul3
Copy link
Author

acul3 commented Sep 18, 2024

after using last commit , i finally got good result,,thank you

any tips how to make infer faster @adelacvg ? (maybe like tortoise sytle)

@adelacvg
Copy link
Owner

For the GPT part, you can use acceleration frameworks similar to VLM, and they also support GPT2. For the diffusion part, you can adopt faster sampling methods with fewer sampling steps. Alternatively, like XTTS, you can use GANs instead of diffusion, although performance may decrease, it can achieve very fast results for the timbre in the training dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants