-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
w/o model-parallel usability numbers reproduce #284
Comments
@g-karthik, sorry for the late response. In case this is still relevant to you, these are for GPT-2 style language model, even though you should be able to run any model of similar sizes using ZeRO-1 and ZeRO-2. The exact model configuration, batch size, GPU count etc are as follows: The sequence length is 1K. You can find this table in the appendix of our archive paper (https://arxiv.org/pdf/1910.02054.pdf). You can find more details in the paper itself. To reproduce the results, you can use our Megatron-LM tutorial (https://www.deepspeed.ai/tutorials/megatron/). Once you have Megatron-LM running with DeepSpeed, you can set the model parallelism degree to 1 to disable model parallelism, use model configurations and GPU count shown in the table above and enable ZeRO-2 in deepspeed config file. Please note that to run a 13B parameter model with ZeRO-2, you would need around 128 GPUs with 32 GB memory each to share the optimizer and gradient states. But you should be able to fit a 5-8B parameter model with as little as 16 GPUs. |
Hi @samyam thanks for the detailed info! I did try to use the tutorial you linked earlier, and it seemed to work in some cases and failed in others. Back then, I faced this error in those other cases (pulled from my logs): Note that I did not create a symbolic link to the data as mentioned in the tutorial you linked - I put the actual data ( Do you know why this is happening? UPDATE: I tried running it again now, I got the same FileExistsError (multiple rows showing that in the logs) but after the error, it shows:
|
Hi @g-karthik. I am surprised by the FileExistsError. The expected path for webtext data is Also, can you share the model-parallel (MP) and data-parallel (DP) degrees of your run? Are you able to run with MP=DP=1, which should fit your .3B parameter model? |
Hi @tjruwase, the And yeah I'd modified the expected path to the webtext data in that line you pointed to, to contain the path to my Yeah the MP degree is 2 and DP degree is 8 when I get this FileExistsError followed by Socket Timeout. See command below:
I haven't tried the MP=DP=1 case with this script yet. I just kicked off the above once again and got the FileExistsError again, but the processes continue to run. I'm waiting to see if I get the Socket Timeout again (I suspect I will). |
@g-karthik thanks for sharing those details. Another question, why are you using python to launch rather than the deepspeed launcher? |
@tjruwase oh that's just the contents of the I do use the deepspeed launcher with my integration of DeepSpeed into my own codebase outside of the |
@tjruwase just an update on the MP=1, DP=1 case you asked me to run, I set model-parallel-size=1, GPUS_PER_NODE=1. It seems to be running fine, it hasn't thrown a FileExistsError or a Socket Timeout error.
|
@g-karthik thanks for sharing that update. So it seems we can assume the problem is related to distributed training. Can you enable NCCL debugging information for the distributed case? I believe you can set the NCCL_DEBUG environment variable like: Also, I will suggest using deepspeed launcher because it handles all the NCCL configuration issues for distributed training. It would be great to find out if it can fix the Socket Timeout error. Finally, for the distributed case I am curious if all or just some of the processes report the FileExists error. I see that this filename is created here. Although, I have not looked closely at this code before, but I suspect this file is meant to be created only once per node. |
Hope you're doing well! I'm having some trouble reproducing the 1K sequence length setting with Hugging Face's GPT-2 class. Specifically, I'm using 8 V100 GPUs (32 GB each) and the following deepspeed configuration:
I'm kicking these jobs off with python using |
@tjruwase @jeffra @samyam @ShadenSmith also note that with the above config, I'm able to train with a sequence length of 512 just fine. I vaguely remember seeing somewhere in the DeepSpeed docs (or the paper) that you use 16 attention heads for 1.5B, whereas Hugging Face's version uses 25 attention heads. Assuming my memory is indeed correct, I wonder if your 16 attention head version was indeed a 1.5B version? Or perhaps you compensated for it by increasing another dimension like hidden size? |
@g-karthik, I think you should be able to train 1.5B model easily on a single 32GB GPU with cpu-offload. Can you please share a link to the the GPT-2 script? I am curious as to whether you are passing an optimizer into deepspeed.initialize()? I am also interested in repro'ing the OOM. |
@tjruwase Yeah I'm using this implementation of the No, I'm not passing an optimizer into Although technically my |
I am unable to reproduce the claim in Section 10.4 of the ZeRO paper: "Fig. 4 shows that ZeRO-100B can train models with up to 13B parameters without MP on 128 GPUs, achieving throughput over 40 TFlops per GPU on average." I am using Hugging Face's I am also unable to fit 1024 sequence length in my micro-batches, due to presumable deficiencies in deepspeed's activation checkpointing, see relevant discussions #598 (comment) and #598 (comment). It would be great if these could be addressed in a subsequent PR. Could you please help identify how I could hit 40 TFLOPS per GPU with the Hugging Face implementation, just as you seem to have hit that with the Megatron implementation with MP=1? |
@ShadenSmith @tjruwase @jeffra have you had a chance to look at the above? |
@ShadenSmith @tjruwase @jeffra @samyam Hey guys I'd really appreciate a response on the specific points I raised above because it seems that DeepSpeed CANNOT help achieve high scaling efficiency as claimed in the paper on a standard V100 cluster with high RDMA bandwidth unless these are addressed. I would greatly appreciate a quick response on this. |
@g-karthik, I really apologize for the radio silence on this issue. Thanks for your patience. I am working on this now, trying to recover the configuration settings for the results. I will update you asap. |
I refreshed myself a bit of the numbers and perhaps I can now be of help. So it sounds like you want to repro the 8B numbers of 46TFLOPs/GPU. The configurations for that result on 128 GPUs are as follows: GPT2 config: hidden=3072, layers=72, attention-heads =24, batch-size = 8 DS config : {
"zero_optimization": {
"stage": 2,
"contiguous_gradients": true,
"reduce_scatter": false,
"reduce_bucket_size": 1000000000,
"allgather_bucket_size": 200000000
},
"activation_checkpointing": {
"partitioned_activations": false,
"number_checkpoints": 1,
"contiguous_memory_optimization": true,
"cpu_checkpointing": false,
"profile": true,
"synchronize_checkpoint_boundary": true
}
} Please let us know how it goes. |
@tjruwase thank you so much for your response! As I mention above, I'm not using Megatron-style model-parallelism and the numbers I'm seeing are for Hugging Face-style GPT-2 models. What surprises me is the vast difference between your numbers and mine, when all you're doing is set MP=1 (i.e., degree of model-parallelism = 1). Do you have any thoughts on the specific points and references I report above in #284 (comment)? Given that y'all are supporting Hugging Face-style models via a direct integration into the HF This is the config for a GPT-2 XL model on Hugging Face: you can instantiate it by doing I created an equivalent config JSON file for an 8B parameter version of GPT-2, by setting:
with the remaining config keys exactly the same as that for the XL model I linked above. Then, I passed the path to that new JSON as follows: And I tried to train that version of a This is where I see a terrible TFLOPS/GPU. Can you please try reproing this? |
Hi @g-karthik, so sorry on our late replies and your issues here :( this is a tricky issue. One issue is that we haven't done much performance analysis on the hugging face implementation of GPT. I think in the medium term we would really love to dive into this deeper as there are lots of folks using this implementation that could really benefit from various deepspeed features. In terms of the issues you're seeing w.r.t. performance, I suspect there could be non-deepspeed related differences going on here that could be contributing to the low perf. I see that you were trying to use the Megatron version at one point earlier in this thread, did you ever get that running? or was data the main issue here? If data is an issue i would be happy to send you our small test dataset we use for performance tuning. If so, please email me or DM me on twitter (see email/twitter in my github profile). |
@jeffra @tjruwase I don't think the recent observations I report above have anything to do with non-DeepSpeed issues or even data issues. In fact, as you can see in #284 (comment), I actually link to detailed observations and discussions regarding presumable deficiencies in DeepSpeed's activation checkpointing when attempting to use it with no model-parallelism. Yes, I originally tried to use the Megatron version for reproducibility, only because y'all used it to report your no-model-parallel (MP=1) numbers. Seeing as I was getting socket timeouts and file exists errors when using a world size > 1, I eventually just gave up on the version of the code in the |
@tjruwase @jeffra I think it should be fairly straightforward to find out what kind of performance (TFLOPS/GPU and all-reduce times) you're seeing with larger sized Hugging Face models. You could even just use T5-11B with the HF ccing @stas00 since I believe he has done some testing with T5-11B. |
I'd be totally happy to do the performance analysis for HF - if you could just give me some guidelines on how to do it. And yes I did experiments with t5-11b DS/HF. But let's finish the integration of zero3 first and then count me in! |
@g-karthik, I just wanted to chime in my agreement on the importance of DeepSpeed maintaining high TFLOPS/GPU on HF models in general. I also think that both GPT-2 is and t5-11B are exciting starting points once zero integration is completed, thanks to the amazing work of @stas00 . |
@stas00 thanks for your help - when running T5-11B, just enable the flops profiler and wall-clock breakdown in your DS config JSON as described here: https://www.deepspeed.ai/features/#performance-analysis-and-debugging You can also check out #284 (comment) to see how I tested this for GPT-2 with a different ZeRO-3 increases communication volume by 1.5x, so if ZeRO-2 itself is not giving good TFLOPS/GPU for HF models, then ZeRO-3 likely won't either. @tjruwase @jeffra can you please take a look at #598 (comment) and #598 (comment) and let me know what I am missing? It is my understanding that |
@g-karthik, regarding this issue, quite a number of improvements have gone into activation checkpointing support. Can you please check which of your concerns remain? |
@tjruwase I just installed the latest version of deepspeed and now I see that activation checkpointing cannot be configured prior to distributed initialization any more. So I'm unable to test those flags now. Specifically, @samyam's addition of this And understandably so, because the previous approach to configuring deepspeed's activation checkpointing was to configure it PRIOR to Can you please fix this? |
Also @stas00 are you free to do some profiling with HF models now? |
At this moment I don't have any resources to dedicate to this, as I'm trying to figure out this bfloat16-pretrained models not doing well on deepspeed or any other mixed precision platforms. But when I'm done, absolutely! But before we do that, we need to merge this: #910 and also this too #873 but it breaks transformers tests. I suggest the next deepspeed release performance-improvement release. |
@tjruwase @jeffra @ShadenSmith this issue still remains. I ran two jobs with a Hugging Face GPT-2 model with DeepSpeed's activation checkpointing:
The embedding dimension I used was Job 1 ran fine. Job 2 threw this error:
Can you please help with this asap? I would greatly appreciate detailed and specific answers/explanations for #598 (comment) and #598 (comment). Also, how are PyTorch Lightning and DeepSpeedExamples able to use these flags without model-parallelism (MP=1)? Does ZeRO-3 need to be enabled to get this to work? |
@g-karthik, can you please provide how to repro this failure? I could not repro with default HF GPT-2 settings. It ran without problems for me with |
@tjruwase I suspect you tried to set those specific activation checkpointing args to Change that to About changing model configurations such as |
I've been using DeepSpeed successfully with my large model train jobs. But this blog post says ZeRO-1 and ZeRO-2 power up to 6B and 13B param training respectively w/o model-parallelism.
Where exactly is the code that validates/enables others to reproduce this claim? I want to see what model was used, what was the batch size, what was the max sequence length, etc. for this claim.
The text was updated successfully, but these errors were encountered: