Skip to content

Commit

Permalink
Distributed training tutorial and general tutorial improvements (#87)
Browse files Browse the repository at this point in the history
  • Loading branch information
Weisu Yin authored Aug 28, 2023
1 parent 22a0abc commit 133848f
Show file tree
Hide file tree
Showing 6 changed files with 178 additions and 68 deletions.
86 changes: 19 additions & 67 deletions docs/tutorials/autogluon-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ It should look similar to this: `INFO:sagemaker:Creating training-job with name:
Alternatively, you can go to the SageMaker console and find the ongoing training job and its corresponding job name.

```python
another_cloud_predictor = TabularCloudPredictor(cloud_output_path='YOUR_S3_BUCKET_PATH')
another_cloud_predictor = TabularCloudPredictor()
another_cloud_predictor.attach_job(job_name="JOB_NAME")
```

Expand Down Expand Up @@ -148,6 +148,24 @@ One key inside would be `endpoint`, and it will tell you the name of the endpoin
}
```

### Invoke the Endpoint without AutoGluon Cloud
The endpoint being deployed is a normal Sagemaker Endpoint, and you can invoke it through other methods. For example, to invoke an endpoint with boto3 directly

```python
import boto3

client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Accept='application/json',
Body=test_data.to_csv()
)

#: Print the model endpoint's output.
print(response['Body'].read().decode())
```

## Batch Inference
When minimizing latency isn't a concern, then the batch inference functionality may be easier, more scalable, and cheaper as compute is automatically terminated after the batch inference job is complete.

Expand Down Expand Up @@ -256,69 +274,3 @@ local_predictor = cloud_predictor.to_local_predictor(
```

`to_local_predictor()` would underneath downlod the tarball, expand it to your local disk and load it as a corresponding AutoGluon predictor.

## Training/Inference with Image Modality
If your training and inference tasks involve image modality, your data would contain a column representing the path to the image file, i.e.

```python
feature_1 image label
0 1 image/train/train_1.png 0
1 2 image/train/train_1.png 1
```

### Preparing the Image Column
Currently, AutoGluon only supports one image per row.
If your dataset contains one or more images per row, we first need to preprocess the image column to only contain the first image of each row.

For example, if your images are seperated with `;`, you can preprocess it via:

```python
# image_col is the column name containing the image path. In the example above, it would be `image`
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])
```

Now we update the path to an absolute path.

For example, if your directory is similar to this:

```bash
.
└── current_working_directory/
├── train.csv
├── test.csv
└── images/
├── train/
│ └── train_1.png
└── test/
└── test_1.png
```

You can replace your image column to absolute paths via:

```python
train_data[image_col] = train_data[image_col].apply(lambda path: os.path.abspath(path))
test_data[image_col] = test_data[image_col].apply(lambda path: os.path.abspath(path))
```

### Perform Training/Inference with Image Modality
Provide argument `image_column` as the column name containing image paths to `CloudPredictor` fit/inference APIs.
In the example above, `image_column` would be `image`

```python
cloud_predictor.fit(..., image_column="IMAGE_COLUMN_NAME")
cloud_predictor.predict_real_time(..., image_column="IMAGE_COLUMN_NAME")
cloud_predictor.predict(..., image_column="IMAGE_COLUMN_NAME")
```

## Supported Docker Containers
`autogluon.cloud` supports AutoGluon Deep Learning Containers version 0.6.0 and newer.

### Use Custom Containers
Though not recommended, `autogluon.cloud` supports using your custom containers by specifying `custom_image_uri`.

```python
cloud_predictor.fit(..., custom_image_uri="CUSTOM_IMAGE_URI")
cloud_predictor.predict_real_time(..., custom_image_uri="CUSTOM_IMAGE_URI")
cloud_predictor.predict(..., custom_image_uri="CUSTOM_IMAGE_URI")
```
39 changes: 39 additions & 0 deletions docs/tutorials/distributed-training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# AutoGluon Cloud Distributed Training
AutoGluon Cloud currently supports distributed training for Tabular.

Tabular predictor trains multiple folds of models underneath and parallelize model training on a single machine. It is natural to expand this strategy to a cluster of machines.

With AutoGluon Cloud, we help you to spin up the cluster, dispatch the jobs, and tear down the cluster. And it is not much different from how you would normally train a `TabularCloudPredictor``. All you need to do is specify `backend="ray_aws"` when you init the predictor

```python
cloud_predictor = TabularCloudPredictor(
...,
backend="ray_aws"
)
```

And you would call fit as normal

```python
cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args)
```

## How to Control Number of Instances in the Cluster
The default number of instances being launched will be introduced in the following section. You can control how many instances are created in the cluster by passing `instance_count` to `fit()`.
```python
cloud_predictor.fit(..., instance_count=4)
```

### General Strategy on How to Decide `instance_count`

#### Non-HPO
By default, this value will be determined by number of folds (`num_bag_folds` in `TabularPredictor`). We will launch as many instances as the number of folds, so each fold will be trained on a dedicated machine. The default value should work most of the time

You can of course lower this value to save budgets, but this will slow the training process as a single instance will need to train multiple folds in parallel and split its resources. Setting value larger than number of folds is meaningless, as we do not support distributed training of a single model.

#### HPO
When doing HPO, it's very hard to pre-determine how many instances to use. Therefore, we default to 1 instance, and you would need to specify the number of instances you want to use. The fastest option of course will be matching the number of instances and number of trials. However, this is likely impossible as HPO typically involves big number of trials.

In general, the recommendation would be try to make it so that (#vcpus_per_instance * #instances) is divisible by number of trials. We evenly distribute resources to tasks; therefore, a non-divisible value would results in some resources not being utilized.

To give a recommended example, suppose you want to do HPO with 128 trials. Choosing 8 `m5.2xlarge` (8 vcpus), would make the computing resources divisible by the number of trials: 128 / (8 * 8) = 2. This would results in two batches each containing 64 jobs being distributed on 64 vcpus.
51 changes: 51 additions & 0 deletions docs/tutorials/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# AutoGluon Cloud FAQ

## Supported Docker Containers
`autogluon.cloud` supports AutoGluon Deep Learning Containers version 0.6.0 and newer.

## How to use Previous Versions of AutoGluon containers
By default, `autogluon.cloud` will fetch the latest version of AutoGluon DLC. However, you can supply `framework_version` to fit/inference APIs to access previous versions, i.e.
```python
cloud_predictor.fit(..., framework_version="0.6")
```
It is always recommended to use the latest version as it has more features and up-to-date security patches.


## How to Build a Cloud Compatible Custom Container
If the official DLC doesn't meet your requirement, and you would like to build your own container.

You can either build on top of our [DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#autogluon-training-containers)
or refer to our [Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/autogluon)

## How to Use Custom Containers
Though not recommended, `autogluon.cloud` supports using your custom containers by specifying `custom_image_uri`.

```python
cloud_predictor.fit(..., custom_image_uri="CUSTOM_IMAGE_URI")
cloud_predictor.predict_real_time(..., custom_image_uri="CUSTOM_IMAGE_URI")
cloud_predictor.predict(..., custom_image_uri="CUSTOM_IMAGE_URI")
```

If this custom image lives under a certain ECR, you would need to grant access permission to the IAM role used by the Cloud module.

## Run into Permission Issues
You can try to get the necessary IAM permission and trust relationship through
```python
from autogluon.cloud import TabularCloudPredictor # Can be other CloudPredictor as well

TabularCloudPredictor.generate_default_permission(
backend="BACKNED_YOU_WANT" # We currently support sagemaker and ray_aws
account_id="YOUR_ACCOUNT_ID", # The AWS account ID you plan to use for CloudPredictor.
cloud_output_bucket="S3_BUCKET" # S3 bucket name where intermediate artifacts will be uploaded and trained models should be saved. You need to create this bucket beforehand.
)
```

The util function above would give you two json files describing the trust replationship and the iam policy.
**Make sure you review those files and make necessary changes according to your use case before applying them.**

We recommend you to create an IAM Role for your IAM User to delegate as IAM Role doesn't have permanent long-term credentials and is used to directly interact with AWS services.
Refer to this [tutorial](https://aws.amazon.com/premiumsupport/knowledge-center/iam-assume-role-cli/) to

1. create the IAM Role with the trust relationship and iam policy you generated above
2. setup the credential
3. assume the role
54 changes: 54 additions & 0 deletions docs/tutorials/image-modality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Training/Inference with Image Modality
If your training and inference tasks involve image modality, your data would contain a column representing the path to the image file, i.e.

```python
feature_1 image label
0 1 image/train/train_1.png 0
1 2 image/train/train_1.png 1
```

### Preparing the Image Column
Currently, AutoGluon Cloud only supports one image per row.
If your dataset contains one or more images per row, we first need to preprocess the image column to only contain the first image of each row.

For example, if your images are seperated with `;`, you can preprocess it via:

```python
# image_col is the column name containing the image path. In the example above, it would be `image`
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])
```

Now we update the path to an absolute path.

For example, if your directory is similar to this:

```bash
.
└── current_working_directory/
├── train.csv
├── test.csv
└── images/
├── train/
│ └── train_1.png
└── test/
└── test_1.png
```

You can replace your image column to absolute paths via:

```python
train_data[image_col] = train_data[image_col].apply(lambda path: os.path.abspath(path))
test_data[image_col] = test_data[image_col].apply(lambda path: os.path.abspath(path))
```

### Perform Training/Inference with Image Modality
Provide argument `image_column` as the column name containing image paths to `CloudPredictor` fit/inference APIs along with other arguments that you would normally pass to a CloudPredictor
In the example above, `image_column` would be `image`

```python
cloud_predictor = TabularCloudPredictor(cloud_output_path="YOUR_S3_BUCKET_PATH")
cloud_predictor.fit(..., image_column="IMAGE_COLUMN_NAME")
cloud_predictor.predict_real_time(..., image_column="IMAGE_COLUMN_NAME")
cloud_predictor.predict(..., image_column="IMAGE_COLUMN_NAME")
```
12 changes: 12 additions & 0 deletions docs/tutorials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,15 @@

A tutorial on using AutoGluon Cloud module to train/deploy AutoGluon backed models on SageMaker.
:::

```{toctree}
---
maxdepth: 2
hidden: true
---
Essentials <autogluon-cloud>
Image Modality <image-modality>
Distributed Training <distributed-training>
FAQ <faq>
```
4 changes: 3 additions & 1 deletion src/autogluon/cloud/backend/ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ def fit(
tune_data = predictor_fit_args.pop("tuning_data", None)
presets = predictor_fit_args.pop("presets", [])
num_bag_folds = predictor_fit_args.get("num_bag_folds", None)
hyperparameter_tune_kwargs = predictor_fit_args.get("hyperparameter_tune_kwargs", None)

if instance_count == "auto":
instance_count = num_bag_folds
Expand All @@ -197,9 +198,10 @@ def fit(
and "high_quality" not in presets
and "good_quality" not in presets
and num_bag_folds is None
and hyperparameter_tune_kwargs is None
):
logger.warning(
f"Tabular Predictor will be trained without bagging hence not distributed, but you specified instance count > 1: {instance_count}."
f"Tabular Predictor will be trained without bagging nor HPO hence not distributed, but you specified instance count > 1: {instance_count}."
)
logger.warning("Will deploy cluster with 1 instance only to save costs")
instance_count = 1
Expand Down

0 comments on commit 133848f

Please sign in to comment.