-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 checkpoints not working with distributed training on sagemaker #11390
Comments
cc @philschmid |
Hey @laphang, Could you please share your |
@laphang I tried to reproduce your error and for me its works using the following # estimator
huggingface_estimator = HuggingFace(entry_point='run_glue.py',
source_dir='./scripts',
metrics_definition=metric_definitions,
instance_type=instance_type,
instance_count=instance_count,
volume_size=volume_size,
role=role,
transformers_version='4.4.2',
pytorch_version='1.6.0',
checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
py_version='py36',
distribution= distribution,
hyperparameters = hyperparameters,
debugger_hook_config=False) This estimator just extends the estimator from our 04_distributed_training_model_parallelism and includes the
Reading your environment it seems that you are not yet using the new Hugging Face Deep Learning Container for Amazon SageMaker. Is that true? or have you update them? |
@philschmid |
@philschmid yeah, made the changes below from using the PyTorch estimator to the HuggingFace one, and now distributed training with s3 checkpoints is working properly now (training job completes successfully, and all the checkpoints are synced to s3). It's working both using Sagemaker distributed model parallel, and also using torch.distributed.launch Also just wanted to say that I was pleasantly surprised with how seamlessly Transformers is working with SageMaker model parallel. Great work guys!
|
@laphang that are great news and thank you for the kind words! 🤗 |
@philschmid I am getting the same error as @laphang was getting, even with hugging face estimator. Only, the first checkpoints is getting saved in the checkpoint_uri_location and rest don't appear in s3. After the end of training job, it is taking an hour showing uploading and ends with an error "InternalServerError: We encountered an internal error. Please try again". It has started since I added sagemaker distributed data parallel into Hugging face estimator. It has kind of become a blocker for our model training, any help would be really appreciated. `distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}} huggingface_estimator = HuggingFace( |
Hey @Harshitcmd, could maybe share your training script? which For
It might be possible that you are saving your checkpoint in |
Hey @philschmid thanks for replying. I have been saving my checkpoints into "check_dir": "/opt/ml/checkpoints". Before integrating data parallelism with p4d.24xlarge I was using p3.2xlarge with the same training arguments and there all the checkpoints were getting saved into s3 on the go itself. Plz have a look into my training arguments. `
|
You might need to add
overwrite_output_dir=True if get_last_checkpoint(args.output_dir) is not None else False, and to solve your upload issue you should save the model into |
Hey @philschmid, I tried adding overwrite_output_dir=True, it's partially solved my issue. Now, the checkpoints are in sync with s3(all the checkpoints and model artifacts are getting saved at the desired location). Even though all the checkpoints got uploaded to the s3 it has showed the status as Uploading for an hour and ended with an internal error(weird). PS: When I didn't integrate the data parallelism with the same instance type (p4d.24xlarge) everything worked seamlessly. |
Is there any way to do this locally without the estimator? I am trying to train modernBERT from scratch, which afaik requires the development version of transformers, which I have to install myself using
which seems to break the Estimator environment (pardon for the bad description, I have a very shallow understanding of what the estimator actually does). So I am trying to set up my own container, simply passing the s3 URI to |
Environment info
transformers
version: 4.5.0Who can help
@sgugger
Information
Model I am using (Bert, XLNet ...): gpt-neo
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
InternalServerError: We encountered an internal error. Please try again.
Expected behavior
I expected the training to work normally, and all the checkpoints and final model to get synced to the S3 location.
NB: training is working when I don't use the checkpoint_s3_uri (with both torch.distributed.launch and sagemaker distributed model parallel).
Also with a single gpu (on a p3.2xlarge), training with checkpoint_s3_uri is working, all the checkpoints and final model are synced to S3.
The text was updated successfully, but these errors were encountered: