-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bloomz-mt universal checkpoint #20
Comments
I didn't create universal ckpts for them unfortunately. The options I see are:
|
Thinks, it seems that option 3 is best for me.
发自我的iPhone
…------------------ Original ------------------
From: Niklas Muennighoff ***@***.***>
Date: Fri,May 26,2023 7:20 AM
To: bigscience-workshop/xmtf ***@***.***>
Cc: LiuShixing ***@***.***>, Author ***@***.***>
Subject: Re: [bigscience-workshop/xmtf] bloomz-mt universal checkpoint (Issue#20)
I didn't create universal ckpts for them unfortunately. The options I see are:
Try to get resources to train for 1 step so you can convert to universal ckpt (Note that as PP=72 & TP=1 and you can drop the DP states, you need at least 72 GPUs (probably need some DP to fit it though); You can set LR=0 for that step)
Figure out how to reshape the checkpoints / convert them to universal ckpts (I would hope that DeepSpeed has some code for this by now, but maybe not)
Restart from bloom (The models were trained for 500 steps so it's not too much effort)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello!
Thanks a lot for your job!
I want to finetune bloomz-mt by your Megatron-DeepSpeed,but I can not find a universal version checkpoint of bloomz-mt or bloomz. I only found the bloom universal checkpoint below.
https://huggingface.co/bigscience/bloom-optimizer-states/tree/global_step95000_universal
With limited GPUs,I have to use TP 4, PP 12 to finetune, but I found that you suggest not to merge TP in below document. So I want to find the bloomz-mt universal checkpoint
https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/finetune.md
The text was updated successfully, but these errors were encountered: