-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about transcribe only singing voice data #10
Comments
Hi @Joanna1212
|
args=('yourmt3_only_sing_voice_3' '-tk' 'singing_v1' '-d' 'all_singing_v1' '-dec' 'multi-t5' '-nl' '26' '-enc' 'perceiver-tf' '-sqr' '1' '-ff' 'moe' '-wf' '4' '-nmoe' '8' '-kmoe' '2' '-act' 'silu' '-epe' 'rope' '-rp' '1' '-ac' 'spec' '-hop' '300' '-atc' '1' '-pr' '16-mixed' '-bsz' '12' '12' '-st' 'ddp' '-se' '1000000' I only want to transcribe the singing voice track(single-track prediction). thanks! |
I set confit.py 's "num_channels": from 13 to 1 , it seems work, Let's try the training |
@Joanna1212
|
Thank you for your detailed response. I'll try training your final model. However, I noticed some minor errors of singing vocie on some pop music (as you mentioned in your paper). Therefore, I hope to supplement some vocal transcription data to improve the accuracy of vocal transcription. The dataset I want to add consists of complete songs (vocals mixed with accompaniment and splited vocal track) and the corresponding vocal MIDI, just this one track. |
Perhaps I should add vocal datasets to the current dataset in "all_cross_final," or to complete "p_include_singing" part (probability of including singing for cross augmented examples) Maybe his would enhance vocal performance based on multi-track transcription? |
I noticed that you used temperature-based sampling in your paper to determine the proportions of each dataset. For my scenario, where I am only interested in vocals, Do you think I should adjust the proportion of the singing voice datasets (MIR-ST500, CMedia) to be higher? Additionally, you mentioned, 'We identified the dataset most prone to over-fitting, as shown by its validation loss curve.' Did you train each dataset separately to observe this, or did you observe the validation results of individual datasets during the overall training? |
thanks, I'll try this with more vocal data. |
I tried adding some vocal data. Initially, the metrics showed a slight improvement, but soon there was a gradient explosion. The metrics were slightly better on cmedia and mir_st500. 👍 BTW, Please notify me if there is an update😄. Thanks |
Hello,
I am trying to train a model to transcribe only vocal data. I set the parameters as follows: '-tk' 'singing_v1' '-d' 'all_singing_v1', which are the task and training data. However, I encountered an error in the model part: './amt/src/model/t5mod.py', line 633 'b, k, t, d = inputs_embeds.size()', where there are only three dimensions torch.Size([6, 1024, 512]).
How should I modify this to train successfully? Should I set any other parameters?
Thanks!
The text was updated successfully, but these errors were encountered: