Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 checkpoints not working with distributed training on sagemaker #11390

Closed
2 of 4 tasks
laphang opened this issue Apr 23, 2021 · 12 comments
Closed
2 of 4 tasks

S3 checkpoints not working with distributed training on sagemaker #11390

laphang opened this issue Apr 23, 2021 · 12 comments

Comments

@laphang
Copy link

laphang commented Apr 23, 2021

Environment info

  • transformers version: 4.5.0
  • Platform: AWS Sagemaker
  • Python version: 3.6
  • PyTorch version (GPU?): 1.7.1
  • Tensorflow version (GPU?):
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): gpt-neo

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Use the run_clm.py example script to finetune gpt-neo in Sagemaker with either torch.distributed.launch, or using Sagemaker distributed model parallel (say on a p4d.24xlarge with 8 gpus)
  2. Only the first checkpoint is synced to the checkpoint_s3_uri location. Subsequent checkpoints do not appear in S3
  3. Also, at the end of the training job, it spends around 1 hour in the "Uploading" state and ends with the error below.

InternalServerError: We encountered an internal error. Please try again.

Expected behavior

I expected the training to work normally, and all the checkpoints and final model to get synced to the S3 location.

NB: training is working when I don't use the checkpoint_s3_uri (with both torch.distributed.launch and sagemaker distributed model parallel).

Also with a single gpu (on a p3.2xlarge), training with checkpoint_s3_uri is working, all the checkpoints and final model are synced to S3.

@laphang laphang changed the title s3 checkpoints not working with distributed training on sagemaker S3 checkpoints not working with distributed training on sagemaker Apr 23, 2021
@sgugger
Copy link
Collaborator

sgugger commented Apr 23, 2021

cc @philschmid

@philschmid
Copy link
Contributor

Hey @laphang,

Could you please share your estimator configuration? that would help debug and reproduce your problem. Thanks!

@philschmid
Copy link
Contributor

philschmid commented Apr 23, 2021

@laphang I tried to reproduce your error and for me its works using the following HuggingFace estimator.

# estimator
huggingface_estimator = HuggingFace(entry_point='run_glue.py',
                                    source_dir='./scripts',
                                    metrics_definition=metric_definitions,
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    volume_size=volume_size,
                                    role=role,
                                    transformers_version='4.4.2',
                                    pytorch_version='1.6.0',
                                    checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
                                    py_version='py36',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters,
                                    debugger_hook_config=False)

This estimator just extends the estimator from our 04_distributed_training_model_parallelism and includes the checkpoint_s3_uri.
Bildschirmfoto 2021-04-23 um 14 18 47

Environment info

  • transformers version: 4.5.0
  • Platform: AWS Sagemaker
  • Python version: 3.6
  • PyTorch version (GPU?): 1.7.1
  • Tensorflow version (GPU?):
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Reading your environment it seems that you are not yet using the new Hugging Face Deep Learning Container for Amazon SageMaker. Is that true? or have you update them?

@laphang
Copy link
Author

laphang commented Apr 23, 2021

@philschmid
Ah yes, I'm still using the PyTorchEstimator and installing transformers via requirements.txt. I'll try again with the HuggingFace Estimator and get back to you guys. Thanks for the quick response.

@laphang
Copy link
Author

laphang commented Apr 26, 2021

@philschmid yeah, made the changes below from using the PyTorch estimator to the HuggingFace one, and now distributed training with s3 checkpoints is working properly now (training job completes successfully, and all the checkpoints are synced to s3). It's working both using Sagemaker distributed model parallel, and also using torch.distributed.launch

Also just wanted to say that I was pleasantly surprised with how seamlessly Transformers is working with SageMaker model parallel. Great work guys!

before:
estimator = PyTorch(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       framework_version='1.7.1',
                       py_version='py3',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_path, 
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )

after:
estimator = HuggingFace(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       transformers_version='4.4.2',
                       pytorch_version='1.6.0',
                       py_version='py36',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_s3_uri, 
                       debugger_hook_config=False,
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )

@laphang laphang closed this as completed Apr 26, 2021
@philschmid
Copy link
Contributor

@laphang that are great news and thank you for the kind words! 🤗
Should you have any questions or problems in the future feel free to tag me directly in the issue.

@Harshitcmd
Copy link

@philschmid I am getting the same error as @laphang was getting, even with hugging face estimator. Only, the first checkpoints is getting saved in the checkpoint_uri_location and rest don't appear in s3. After the end of training job, it is taking an hour showing uploading and ends with an error "InternalServerError: We encountered an internal error. Please try again".

It has started since I added sagemaker distributed data parallel into Hugging face estimator. It has kind of become a blocker for our model training, any help would be really appreciated.

`distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
sagemaker_sess=sess,
instance_type='ml.p4d.24xlarge',
instance_count=1,
volume_size=60,
code_location=output_path,
output_path=output_path,
checkpoint_s3_uri=checkpoint_s3_uri,
tensorboard_output_config=tensorboard_output_config,
role=role,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
hyperparameters = hyperparameters,
distribution=distribution
)`

@philschmid
Copy link
Contributor

Hey @Harshitcmd,

could maybe share your training script? which TrainingArguments are you using?

For

After the end of training job, it is taking an hour showing uploading and ends with an error "InternalServerError: We encountered an internal error. Please try again".

It might be possible that you are saving your checkpoint in /opt/ml/model (which will be uploaded to s3 after training) and it gets through saving the checkpoints.

@Harshitcmd
Copy link

Harshitcmd commented Oct 26, 2021

Hey @philschmid thanks for replying.

I have been saving my checkpoints into "check_dir": "/opt/ml/checkpoints". Before integrating data parallelism with p4d.24xlarge I was using p3.2xlarge with the same training arguments and there all the checkpoints were getting saved into s3 on the go itself.

Plz have a look into my training arguments.

`

 training_args = TrainingArguments(
    output_dir=args.check_dir,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.train_batch_size,
    per_device_eval_batch_size=args.eval_batch_size,
    eval_accumulation_steps=1,
    warmup_ratio=args.warmup_steps,
    evaluation_strategy="no",
    logging_dir=f"/opt/ml/output/tensorboard/",
    learning_rate=float(args.learning_rate),
    save_total_limit=10,
    save_steps = 200,
    logging_steps = 20,
)

# create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[TensorBoardCallback]
)`

@philschmid
Copy link
Contributor

You might need to add overwrite_output_dir to your TrainingArguments

overwrite_output_dir (bool, optional, defaults to False) – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.
I added it for example like that

overwrite_output_dir=True if get_last_checkpoint(args.output_dir) is not None else False,

and to solve your upload issue you should save the model into /opt/ml/model.

@Harshitcmd
Copy link

Hey @philschmid,

I tried adding overwrite_output_dir=True, it's partially solved my issue. Now, the checkpoints are in sync with s3(all the checkpoints and model artifacts are getting saved at the desired location). Even though all the checkpoints got uploaded to the s3 it has showed the status as Uploading for an hour and ended with an internal error(weird).

PS: When I didn't integrate the data parallelism with the same instance type (p4d.24xlarge) everything worked seamlessly.

@nicolaiberk
Copy link

Is there any way to do this locally without the estimator? I am trying to train modernBERT from scratch, which afaik requires the development version of transformers, which I have to install myself using

pip install git+https://github.com/huggingface/transformers.git

which seems to break the Estimator environment (pardon for the bad description, I have a very shallow understanding of what the estimator actually does).

So I am trying to set up my own container, simply passing the s3 URI to get_last_checkpoint (and later trainer.save_model), but this results in FileNotFoundError. The weird thing is that the data loads fine from s3 using load_dataset, at least when using the RevisedDownloadConfig referenced here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants