S3 checkpoints not working with distributed training on sagemaker #11390

laphang · 2021-04-23T06:13:59Z

Environment info

transformers version: 4.5.0
Platform: AWS Sagemaker
Python version: 3.6
PyTorch version (GPU?): 1.7.1
Tensorflow version (GPU?):
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): gpt-neo

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Use the run_clm.py example script to finetune gpt-neo in Sagemaker with either torch.distributed.launch, or using Sagemaker distributed model parallel (say on a p4d.24xlarge with 8 gpus)
Only the first checkpoint is synced to the checkpoint_s3_uri location. Subsequent checkpoints do not appear in S3
Also, at the end of the training job, it spends around 1 hour in the "Uploading" state and ends with the error below.

InternalServerError: We encountered an internal error. Please try again.

Expected behavior

I expected the training to work normally, and all the checkpoints and final model to get synced to the S3 location.

NB: training is working when I don't use the checkpoint_s3_uri (with both torch.distributed.launch and sagemaker distributed model parallel).

Also with a single gpu (on a p3.2xlarge), training with checkpoint_s3_uri is working, all the checkpoints and final model are synced to S3.

The text was updated successfully, but these errors were encountered:

sgugger · 2021-04-23T11:38:14Z

cc @philschmid

philschmid · 2021-04-23T12:02:35Z

Hey @laphang,

Could you please share your estimator configuration? that would help debug and reproduce your problem. Thanks!

philschmid · 2021-04-23T12:13:23Z

@laphang I tried to reproduce your error and for me its works using the following HuggingFace estimator.

# estimator
huggingface_estimator = HuggingFace(entry_point='run_glue.py',
                                    source_dir='./scripts',
                                    metrics_definition=metric_definitions,
                                    instance_type=instance_type,
                                    instance_count=instance_count,
                                    volume_size=volume_size,
                                    role=role,
                                    transformers_version='4.4.2',
                                    pytorch_version='1.6.0',
                                    checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
                                    py_version='py36',
                                    distribution= distribution,
                                    hyperparameters = hyperparameters,
                                    debugger_hook_config=False)

This estimator just extends the estimator from our 04_distributed_training_model_parallelism and includes the checkpoint_s3_uri.

Environment info

transformers version: 4.5.0

Platform: AWS Sagemaker

Python version: 3.6

PyTorch version (GPU?): 1.7.1

Tensorflow version (GPU?):

Using GPU in script?: yes

Using distributed or parallel set-up in script?: yes

Reading your environment it seems that you are not yet using the new Hugging Face Deep Learning Container for Amazon SageMaker. Is that true? or have you update them?

laphang · 2021-04-23T21:57:45Z

@philschmid
Ah yes, I'm still using the PyTorchEstimator and installing transformers via requirements.txt. I'll try again with the HuggingFace Estimator and get back to you guys. Thanks for the quick response.

laphang · 2021-04-26T06:24:08Z

@philschmid yeah, made the changes below from using the PyTorch estimator to the HuggingFace one, and now distributed training with s3 checkpoints is working properly now (training job completes successfully, and all the checkpoints are synced to s3). It's working both using Sagemaker distributed model parallel, and also using torch.distributed.launch

Also just wanted to say that I was pleasantly surprised with how seamlessly Transformers is working with SageMaker model parallel. Great work guys!

before:
estimator = PyTorch(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       framework_version='1.7.1',
                       py_version='py3',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_path, 
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )

after:
estimator = HuggingFace(base_job_name=job_name,     
                       entry_point = 'run_clm.py', 
                       source_dir=source_dir,
                       code_location=output_path,
                       role=role,
                       transformers_version='4.4.2',
                       pytorch_version='1.6.0',
                       py_version='py36',                    
                       hyperparameters=hyperparameters,
                       tags=tags, 
                       output_path=output_path, 
                       checkpoint_s3_uri=checkpoint_s3_uri, 
                       debugger_hook_config=False,
                       instance_count=1, 
                       instance_type='ml.p4d.24xlarge', 
                       distribution= distribution, 
                       use_spot_instances=train_use_spot_instances,
                       max_run=train_max_run,
                       max_wait=train_max_wait, 
                       metric_definitions=metric_definition
                       )

philschmid · 2021-04-26T06:36:03Z

@laphang that are great news and thank you for the kind words! 🤗
Should you have any questions or problems in the future feel free to tag me directly in the issue.

Harshitcmd · 2021-10-26T08:49:24Z

@philschmid I am getting the same error as @laphang was getting, even with hugging face estimator. Only, the first checkpoints is getting saved in the checkpoint_uri_location and rest don't appear in s3. After the end of training job, it is taking an hour showing uploading and ends with an error "InternalServerError: We encountered an internal error. Please try again".

It has started since I added sagemaker distributed data parallel into Hugging face estimator. It has kind of become a blocker for our model training, any help would be really appreciated.

`distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
sagemaker_sess=sess,
instance_type='ml.p4d.24xlarge',
instance_count=1,
volume_size=60,
code_location=output_path,
output_path=output_path,
checkpoint_s3_uri=checkpoint_s3_uri,
tensorboard_output_config=tensorboard_output_config,
role=role,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
hyperparameters = hyperparameters,
distribution=distribution
)`

philschmid · 2021-10-26T12:29:37Z

Hey @Harshitcmd,

could maybe share your training script? which TrainingArguments are you using?

For

After the end of training job, it is taking an hour showing uploading and ends with an error "InternalServerError: We encountered an internal error. Please try again".

It might be possible that you are saving your checkpoint in /opt/ml/model (which will be uploaded to s3 after training) and it gets through saving the checkpoints.

Harshitcmd · 2021-10-26T13:28:22Z

Hey @philschmid thanks for replying.

I have been saving my checkpoints into "check_dir": "/opt/ml/checkpoints". Before integrating data parallelism with p4d.24xlarge I was using p3.2xlarge with the same training arguments and there all the checkpoints were getting saved into s3 on the go itself.

Plz have a look into my training arguments.

`

 training_args = TrainingArguments(
    output_dir=args.check_dir,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.train_batch_size,
    per_device_eval_batch_size=args.eval_batch_size,
    eval_accumulation_steps=1,
    warmup_ratio=args.warmup_steps,
    evaluation_strategy="no",
    logging_dir=f"/opt/ml/output/tensorboard/",
    learning_rate=float(args.learning_rate),
    save_total_limit=10,
    save_steps = 200,
    logging_steps = 20,
)

# create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[TensorBoardCallback]
)`

philschmid · 2021-10-27T08:00:17Z

You might need to add overwrite_output_dir to your TrainingArguments

overwrite_output_dir (bool, optional, defaults to False) – If True, overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory.
I added it for example like that

overwrite_output_dir=True if get_last_checkpoint(args.output_dir) is not None else False,

and to solve your upload issue you should save the model into /opt/ml/model.

Harshitcmd · 2021-11-01T11:52:51Z

Hey @philschmid,

I tried adding overwrite_output_dir=True, it's partially solved my issue. Now, the checkpoints are in sync with s3(all the checkpoints and model artifacts are getting saved at the desired location). Even though all the checkpoints got uploaded to the s3 it has showed the status as Uploading for an hour and ended with an internal error(weird).

PS: When I didn't integrate the data parallelism with the same instance type (p4d.24xlarge) everything worked seamlessly.

nicolaiberk · 2025-01-31T15:24:17Z

Is there any way to do this locally without the estimator? I am trying to train modernBERT from scratch, which afaik requires the development version of transformers, which I have to install myself using

pip install git+https://github.com/huggingface/transformers.git

which seems to break the Estimator environment (pardon for the bad description, I have a very shallow understanding of what the estimator actually does).

So I am trying to set up my own container, simply passing the s3 URI to get_last_checkpoint (and later trainer.save_model), but this results in FileNotFoundError. The weird thing is that the data loads fine from s3 using load_dataset, at least when using the RevisedDownloadConfig referenced here.

laphang changed the title ~~s3 checkpoints not working with distributed training on sagemaker~~ S3 checkpoints not working with distributed training on sagemaker Apr 23, 2021

laphang closed this as completed Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 checkpoints not working with distributed training on sagemaker #11390

S3 checkpoints not working with distributed training on sagemaker #11390

laphang commented Apr 23, 2021 •

edited

Loading

sgugger commented Apr 23, 2021

philschmid commented Apr 23, 2021

philschmid commented Apr 23, 2021 •

edited

Loading

Environment info

laphang commented Apr 23, 2021

laphang commented Apr 26, 2021

philschmid commented Apr 26, 2021

Harshitcmd commented Oct 26, 2021

philschmid commented Oct 26, 2021

Harshitcmd commented Oct 26, 2021 •

edited

Loading

philschmid commented Oct 27, 2021

Harshitcmd commented Nov 1, 2021

nicolaiberk commented Jan 31, 2025

S3 checkpoints not working with distributed training on sagemaker #11390

S3 checkpoints not working with distributed training on sagemaker #11390

Comments

laphang commented Apr 23, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

sgugger commented Apr 23, 2021

philschmid commented Apr 23, 2021

philschmid commented Apr 23, 2021 • edited Loading

Environment info

laphang commented Apr 23, 2021

laphang commented Apr 26, 2021

philschmid commented Apr 26, 2021

Harshitcmd commented Oct 26, 2021

philschmid commented Oct 26, 2021

Harshitcmd commented Oct 26, 2021 • edited Loading

philschmid commented Oct 27, 2021

Harshitcmd commented Nov 1, 2021

nicolaiberk commented Jan 31, 2025

laphang commented Apr 23, 2021 •

edited

Loading

philschmid commented Apr 23, 2021 •

edited

Loading

Harshitcmd commented Oct 26, 2021 •

edited

Loading