Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about transcribe only singing voice data #10

Closed
Joanna1212 opened this issue Sep 10, 2024 · 10 comments
Closed

Question about transcribe only singing voice data #10

Joanna1212 opened this issue Sep 10, 2024 · 10 comments
Assignees

Comments

@Joanna1212
Copy link

Hello,

I am trying to train a model to transcribe only vocal data. I set the parameters as follows: '-tk' 'singing_v1' '-d' 'all_singing_v1', which are the task and training data. However, I encountered an error in the model part: './amt/src/model/t5mod.py', line 633 'b, k, t, d = inputs_embeds.size()', where there are only three dimensions torch.Size([6, 1024, 512]).

How should I modify this to train successfully? Should I set any other parameters?
Thanks!

@Joanna1212 Joanna1212 changed the title Regarding the issue of transcribing only the vocals Question about transcribe only singing voice data Sep 10, 2024
@mimbres mimbres self-assigned this Sep 10, 2024
@mimbres
Copy link
Owner

mimbres commented Sep 10, 2024

Hi @Joanna1212

  • Can you show me all of your train.py options? That error seems to be related to the encoder/decoder type?

  • The singing_v1 task is an experimental option. It uses a singing prefix token, which is not covered in the paper. all_singing_v1 is also just for quick experimentation, with the sampling probability of the singing dataset increased.

@Joanna1212
Copy link
Author

args=('yourmt3_only_sing_voice_3' '-tk' 'singing_v1' '-d' 'all_singing_v1' '-dec' 'multi-t5' '-nl' '26' '-enc' 'perceiver-tf' '-sqr' '1' '-ff' 'moe' '-wf' '4' '-nmoe' '8' '-kmoe' '2' '-act' 'silu' '-epe' 'rope' '-rp' '1' '-ac' 'spec' '-hop' '300' '-atc' '1' '-pr' '16-mixed' '-bsz' '12' '12' '-st' 'ddp' '-se' '1000000'
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py "${args[@]}"
This way 👆!

I only want to transcribe the singing voice track(single-track prediction).

thanks!

@Joanna1212
Copy link
Author

I set confit.py 's "num_channels": from 13 to 1 , it seems work, Let's try the training

@mimbres
Copy link
Owner

mimbres commented Sep 10, 2024

@Joanna1212
Sorry for the confusion about the task prefix. I looked further into the code, and in the current version, the 'singing_v1' task is no longer supported. We deprecated using prefix tokens for exclusive transcription of specific instruments due to no performance benefits.

  • If you set num_channel=1 with a multi-channel T5 decoder, it will behave the same as a single-channel decoder. As mentioned earlier, it will not use any prefix tokens for singing-only. Currently it is recommended to choose decoder type as 't5' and task as 'mt3_full_plus' for single channel decoding.
  • When using a multi-channel decoder, it is recommended to use decoder type as multi-t5 and task as mc13_full_plus_256.
  • The recommended approach for now is to transcribe only singing by extracting the singing program (100) through post-processing, without modifying the code. I'll provide an alternative in the next update through "exclusive" task (as prototyped in exc_v1 of config/task.py).
  • About max iterations, I prefer adjusting -it over using se or epoch-based counting for better managing the cosine scheduler. See Cannot wait to use this project~ #2 (comment)

@Joanna1212
Copy link
Author

Joanna1212 commented Sep 11, 2024

Thank you for your detailed response. I'll try training your final model.
Extracting the singing track (100) through post-processing is very easy. I have already completed it.

However, I noticed some minor errors of singing vocie on some pop music (as you mentioned in your paper). Therefore, I hope to supplement some vocal transcription data to improve the accuracy of vocal transcription.

The dataset I want to add consists of complete songs (vocals mixed with accompaniment and splited vocal track) and the corresponding vocal MIDI, just this one track.
I notice you only use vocal track of MIR-ST500, CMedia.
Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ? 🫡

@Joanna1212
Copy link
Author

Joanna1212 commented Sep 11, 2024

Perhaps I should add vocal datasets to the current dataset in "all_cross_final,"
continuously adding with splited vocal datasets like mir_st500_voc.
and keep task of "mc13_full_plus_256" with multi-channel decoder,

or to complete "p_include_singing" part (probability of including singing for cross augmented examples)

Maybe his would enhance vocal performance based on multi-track transcription?

@Joanna1212
Copy link
Author

Joanna1212 commented Sep 11, 2024

I noticed that you used temperature-based sampling in your paper to determine the proportions of each dataset.

For my scenario, where I am only interested in vocals, Do you think I should adjust the proportion of the singing voice datasets (MIR-ST500, CMedia) to be higher?

Additionally, you mentioned, 'We identified the dataset most prone to over-fitting, as shown by its validation loss curve.' Did you train each dataset separately to observe this, or did you observe the validation results of individual datasets during the overall training?
Thanks!

@mimbres
Copy link
Owner

mimbres commented Sep 11, 2024

@Joanna1212

Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ?

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

"all_cross_final"

Yes, I recommend to modify all_cross_final in data_preset.py. For example:

 "all_cross_final": {
        "presets": [
            ...
           `YOUR_DATASET_NAME`
        ],
       "weights": [..., `YOUR_SAMPLING_WEIGHT`],
       "eval_vocab": [..., SINGING_SOLO_CLASS],
       ...

I noticed that you used temperature-based sampling...

The main point of our paper is that exact temperature-based sampling (of the original MT3) significantly degrades performance. See more details in Appendix G (not F; 😬 found a typo). However, if the datasets are of similar quality, you can weight them proportionally. For example, if your custom singing data is similar in size to MIRST-500, assign them similar weights. It’s okay if the total sum of the added weights exceeds 1.

did you observe the validation results of individual dataset...

Yes. In wandb logger, dataloader_idx is in the same order as the datasets defined in the data_preset.
Screenshot 2024-09-11 at 13 00 47

@mimbres mimbres pinned this issue Sep 11, 2024
@Joanna1212
Copy link
Author

thanks, I'll try this with more vocal data.
I understand your explanation about the wandb logger. Thank you for your response and advice.

@Joanna1212
Copy link
Author

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

I tried adding some vocal data. Initially, the metrics showed a slight improvement, but soon there was a gradient explosion. The metrics were slightly better on cmedia and mir_st500. 👍

BTW, Please notify me if there is an update😄. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants