-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Model after Full Fine-tuning by LOMOTrainer #25
Comments
Hi, can you provide more training details, for example, the training loss curve? If the loss jitters a lot, maybe the best choice is to use a lower learning rate, say 1e-3. One possible verification is to check the data preprocessing. You can try to print the input_ids in the model's forward function and decode it to the natural language for verification. Another possible check is to load the model and do the eval step without any tuning, and this is often a good way to detect data preprocessing and generation errors. |
Hmm, I have disabled wandb then I can show you my loss curve, did you think 1e-3 still have to high for a learning rate, often when I tuning, it's usually about 1e-5. |
1e-5 is common for Adam. The scale of learning rate for Adam is often much smaller than SGD since the Adam will rescale its learning rate before it has enough steps to compute the actual momentum. |
As I mentioned in above, i see in your config full finetune, you dont set the optimize for this case, show the final when I Full-finetune without LoRA, must i have set the optimize to SGD, cause default it's Adam? |
Please check this line, https://github.com/OpenLMLab/LOMO/blob/main/src/train_lomo.py#L107 |
@dat-browny Hello bro. I am doing similar work with you. Can we discuss the training details further? |
Okay, can you tell me your problem? |
I want to utulize the optimizer to my other project. I was failed, I guess the reason is I dont know the related code. |
For my experience, I think the problem is on learning rate, It's not scale like Adam then you must choose a suitable for your training target. I had check all my checkpoints at all learning rate I had tried. The best checkpoints in my opinion is the first epoch checkpoints when I'm using learning rate = 3e-2. With this learning rate, all checkpoints after it (ep2, ep3) will generate many messy character as I say in the 1st comment. I have try two different learning rate: 1e-3, 1e-5. Seem like even 1e-3 still small cause at the 3rd epoch checkpoints, the model even cannot learning the pattern of training datasets. Conclusion, depend on your objective training you must choose suitable learning rate around [1e-3, 3e-2], you can try about 5e-3 -> 7e-3. Training this task take for me many times so I had dropped idea training with LOMO. But if you try a method that have a significant success, you can suggest me to do same as you done. |
I have completely fine-tuning my model by LOMO, more details i'm using bloomz-7b1-mt as the backbone and finetuning it on Alpaca Instruction Dataset. I'm using my own data processing pipeline and just replace the Trainer to your LOMOTrainer. However, the result I receive when using your directory is quite bad, my inference result is generate many messy characters while full-finetuning or LoRA on normal way not show that, and I ensure the messy character not in my training Dataset. I think the problem is on the Optimizer and Config when training. Can you see a little bit about my scripts training?
I'm not using your own file train.py, but just modify the data processing pipeline, replace your 'DataArgument', 'ModelArgument' but keep 'MyTrainingArgument' to LOMOTrainer.
And my deepspeed config file:
The reason I dont add the evaluate dataset to tracking the model training rightway cause the Instruction Following finetuning is not have a clearly benchmark.
For all important config like learning rate, weight_decay, lr_scheduler_type,... I take it from your config sample file, but I see that in
args_lomo.yaml
, you dont set the optimize function butargs_lomo_lora.yaml
, it is 'SGD'. So as my thought, when training without LoRA, you are using the default optimize function of Seq2SeqArguments is Adam and when change to LOMO + LoRA, it will be SGD for pretrained weights and AdamW for LoRA. Am I understand it right?The final question, you can help me address issue when full finetuning when the result is generate more messy character and optimize the config training of mine? I'm using 2 A100s 40GB.
The text was updated successfully, but these errors were encountered: