-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pretraining Divergence #524
Comments
This is caused by flash attention. Please disable it and use original self attention. And also use the default training config, here is batch setting for 4 A100 80g: 30 batch size * 1024 block size * 4 gradaccum * 4 GPUs = 491,520 |
How do you disable flash attention? I can't find anything on the torch website which suggests it is togglable. |
Is there a way to find the correct configuration for an arbitrary setup? Based off of your comment and the original script I'm not exactly sure when to altern the batch size vs the grad acc: From iminfine comment - 4 A100s, one node30 batch size * 1024 block size * 4 gradaccum * 4 GPUs = 491,520 From
|
I have been trying to follow the steps listed under "reproducing GPT-2" from the README.md. Unfortunately, when I run the model, my training always diverges. I have tried switching up my learning rate and gradient accumulation but neither of these tactics seemed to work, although I did have to fix a bug in the learning rate after varying those parameters. I could try changing those variables again, but my latest runs lead me to think that neither of those parameters are the issue:
Here are the last two runs. The orange run decays the learning rate over 300,000 steps while the pink run decays the learning rate over 600,000 steps. For these runs the learning rate starts at 6e-5 and hits its minimum at 6e-6.
Here are some of my meta-parameters:
batch_size = 24
block_size = 1024
max_iters = 300000
lr_decay_iters = 300000
eval_interval = 1000
eval_iters = 200
log_interval = 100
weight_decay = 5e-2
I am running this model on 4 A100 80GB GPUs.
The text was updated successfully, but these errors were encountered: