-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serious conclusion: LOMO does not significantly reduce GPU memory usage! #72
Comments
The author calls it mixed precision training, but it is not! How much memory usage LOMO can reduce needs to be strictly verified through experiments, rather than attributing the effect of reducing the number of precision bits and deepspeed to LOMO! The problem with adaLOMO is similar. |
Thank you for expressing your interest in our work. I appreciate the opportunity to address your queries and provide further insights.
I hope these responses adequately address your concerns. Please review this email carefully, and feel free to ask any further questions. However, I kindly request that you thoroughly understand my replies regarding these specific queries before reiterating them. |
I don't question the advantages of integrated updates, but this is far from the author's ability, which is important! Mixed precision training only reduces dynamic memory usage, which is also important! If you use hugging face for normal mixed precision training you will observe different results. The author's approach actually does quantization, and although you set it to 16 bits, if you set the default to 8 bits I'd go back to lowering the memory even more. You didn't explain anything about this in the paper, which is also important. |
Mixed precision only reduces dynamic memory usage. Even for comparative experiments, I think the author should clearly tell readers which module reduces memory usage. Based on LLaMA-7B, I only used a batch of 1 and a sentence length of 512, which occupies about 63G of memory. |
The author should give how much memory can be reduced by just the LOMO optimizer, rather than relying on setting 16-bit precision, using deepspeed. This is very confusing, if LOMO itself can do it, why use other technologies. The code is here for anyone to try and the conclusion is pretty clear. |
I know the author's idea very well and the author does not need to explain it further. It is impossible to achieve the effect claimed by the author in the paper if we rely solely on fusion and update ideas. Remove the global 16-bit settings, use the LOMO optimizer alone, use the hugging face trainer for mixed precision training, modify its training logic, and use the author's ideas, which is easy to verify! |
Or you should clearly provide a comparative experiment to show who played a role in reducing memory. |
For example, you should compare deepspeed+fp16+... and deepspeed+fp16+...+lomo. Just like if you use BERT, you should provide the results of BERT and the results of your BERT+anything. |
Thank you for your patient responses and pleasant discussion, we just hope the results are rigorous rather than vague. |
yes, we compared deepspeed+fp16+adamw and deepspeed+fp16+lomo, didn't we? @misonsky |
I thought the author would listen and make corrections but actually quite the opposite. Which experimental results can support the author's conclusion? Did the author tell readers which module can reduce memory? Is it torch.set_default_dtype(torch.float16), or gradient checkpointing? or LoMO? previous work have evaluated that calculation graphs occupy almost more than 50% of the memory, but the author's conclusion is exactly the opposite. |
Through comparative experiments, we found that what really reduces GPU memory is "torch.set_default_dtype(torch.float16)" and deepspeed. We used LLaMA-7B to conduct experiments, using
{
"zero_optimization":{
"stage": 0
},
"gradient_accumulation_steps": 1,
"steps_per_print": 2000,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false
}
configuration to cancel the deepspeed function. When we do not enable mixed precision, the output of the model is fp16. This result is obviously abnormal. After our check, we found that “torch.set_default_dtype(torch.float16)” played a key role! When we remove deepspeed and “torch.set_default_dtype(torch.float16)”, according to the default configuration on wic datasets, out of memory on the 80G A100 card!. After adding "“torch.set_default_dtype(torch.float16)”, the memory is directly reduced to about 35G. According to normal mixed precision training, the author's LOMO still out of memory on the 80G A100 card!
The text was updated successfully, but these errors were encountered: