-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Distributed training tutorial and general tutorial improvements (#87)
- Loading branch information
Weisu Yin
authored
Aug 28, 2023
1 parent
22a0abc
commit 133848f
Showing
6 changed files
with
178 additions
and
68 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# AutoGluon Cloud Distributed Training | ||
AutoGluon Cloud currently supports distributed training for Tabular. | ||
|
||
Tabular predictor trains multiple folds of models underneath and parallelize model training on a single machine. It is natural to expand this strategy to a cluster of machines. | ||
|
||
With AutoGluon Cloud, we help you to spin up the cluster, dispatch the jobs, and tear down the cluster. And it is not much different from how you would normally train a `TabularCloudPredictor``. All you need to do is specify `backend="ray_aws"` when you init the predictor | ||
|
||
```python | ||
cloud_predictor = TabularCloudPredictor( | ||
..., | ||
backend="ray_aws" | ||
) | ||
``` | ||
|
||
And you would call fit as normal | ||
|
||
```python | ||
cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args) | ||
``` | ||
|
||
## How to Control Number of Instances in the Cluster | ||
The default number of instances being launched will be introduced in the following section. You can control how many instances are created in the cluster by passing `instance_count` to `fit()`. | ||
```python | ||
cloud_predictor.fit(..., instance_count=4) | ||
``` | ||
|
||
### General Strategy on How to Decide `instance_count` | ||
|
||
#### Non-HPO | ||
By default, this value will be determined by number of folds (`num_bag_folds` in `TabularPredictor`). We will launch as many instances as the number of folds, so each fold will be trained on a dedicated machine. The default value should work most of the time | ||
|
||
You can of course lower this value to save budgets, but this will slow the training process as a single instance will need to train multiple folds in parallel and split its resources. Setting value larger than number of folds is meaningless, as we do not support distributed training of a single model. | ||
|
||
#### HPO | ||
When doing HPO, it's very hard to pre-determine how many instances to use. Therefore, we default to 1 instance, and you would need to specify the number of instances you want to use. The fastest option of course will be matching the number of instances and number of trials. However, this is likely impossible as HPO typically involves big number of trials. | ||
|
||
In general, the recommendation would be try to make it so that (#vcpus_per_instance * #instances) is divisible by number of trials. We evenly distribute resources to tasks; therefore, a non-divisible value would results in some resources not being utilized. | ||
|
||
To give a recommended example, suppose you want to do HPO with 128 trials. Choosing 8 `m5.2xlarge` (8 vcpus), would make the computing resources divisible by the number of trials: 128 / (8 * 8) = 2. This would results in two batches each containing 64 jobs being distributed on 64 vcpus. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# AutoGluon Cloud FAQ | ||
|
||
## Supported Docker Containers | ||
`autogluon.cloud` supports AutoGluon Deep Learning Containers version 0.6.0 and newer. | ||
|
||
## How to use Previous Versions of AutoGluon containers | ||
By default, `autogluon.cloud` will fetch the latest version of AutoGluon DLC. However, you can supply `framework_version` to fit/inference APIs to access previous versions, i.e. | ||
```python | ||
cloud_predictor.fit(..., framework_version="0.6") | ||
``` | ||
It is always recommended to use the latest version as it has more features and up-to-date security patches. | ||
|
||
|
||
## How to Build a Cloud Compatible Custom Container | ||
If the official DLC doesn't meet your requirement, and you would like to build your own container. | ||
|
||
You can either build on top of our [DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#autogluon-training-containers) | ||
or refer to our [Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/autogluon) | ||
|
||
## How to Use Custom Containers | ||
Though not recommended, `autogluon.cloud` supports using your custom containers by specifying `custom_image_uri`. | ||
|
||
```python | ||
cloud_predictor.fit(..., custom_image_uri="CUSTOM_IMAGE_URI") | ||
cloud_predictor.predict_real_time(..., custom_image_uri="CUSTOM_IMAGE_URI") | ||
cloud_predictor.predict(..., custom_image_uri="CUSTOM_IMAGE_URI") | ||
``` | ||
|
||
If this custom image lives under a certain ECR, you would need to grant access permission to the IAM role used by the Cloud module. | ||
|
||
## Run into Permission Issues | ||
You can try to get the necessary IAM permission and trust relationship through | ||
```python | ||
from autogluon.cloud import TabularCloudPredictor # Can be other CloudPredictor as well | ||
|
||
TabularCloudPredictor.generate_default_permission( | ||
backend="BACKNED_YOU_WANT" # We currently support sagemaker and ray_aws | ||
account_id="YOUR_ACCOUNT_ID", # The AWS account ID you plan to use for CloudPredictor. | ||
cloud_output_bucket="S3_BUCKET" # S3 bucket name where intermediate artifacts will be uploaded and trained models should be saved. You need to create this bucket beforehand. | ||
) | ||
``` | ||
|
||
The util function above would give you two json files describing the trust replationship and the iam policy. | ||
**Make sure you review those files and make necessary changes according to your use case before applying them.** | ||
|
||
We recommend you to create an IAM Role for your IAM User to delegate as IAM Role doesn't have permanent long-term credentials and is used to directly interact with AWS services. | ||
Refer to this [tutorial](https://aws.amazon.com/premiumsupport/knowledge-center/iam-assume-role-cli/) to | ||
|
||
1. create the IAM Role with the trust relationship and iam policy you generated above | ||
2. setup the credential | ||
3. assume the role |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Training/Inference with Image Modality | ||
If your training and inference tasks involve image modality, your data would contain a column representing the path to the image file, i.e. | ||
|
||
```python | ||
feature_1 image label | ||
0 1 image/train/train_1.png 0 | ||
1 2 image/train/train_1.png 1 | ||
``` | ||
|
||
### Preparing the Image Column | ||
Currently, AutoGluon Cloud only supports one image per row. | ||
If your dataset contains one or more images per row, we first need to preprocess the image column to only contain the first image of each row. | ||
|
||
For example, if your images are seperated with `;`, you can preprocess it via: | ||
|
||
```python | ||
# image_col is the column name containing the image path. In the example above, it would be `image` | ||
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) | ||
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0]) | ||
``` | ||
|
||
Now we update the path to an absolute path. | ||
|
||
For example, if your directory is similar to this: | ||
|
||
```bash | ||
. | ||
└── current_working_directory/ | ||
├── train.csv | ||
├── test.csv | ||
└── images/ | ||
├── train/ | ||
│ └── train_1.png | ||
└── test/ | ||
└── test_1.png | ||
``` | ||
|
||
You can replace your image column to absolute paths via: | ||
|
||
```python | ||
train_data[image_col] = train_data[image_col].apply(lambda path: os.path.abspath(path)) | ||
test_data[image_col] = test_data[image_col].apply(lambda path: os.path.abspath(path)) | ||
``` | ||
|
||
### Perform Training/Inference with Image Modality | ||
Provide argument `image_column` as the column name containing image paths to `CloudPredictor` fit/inference APIs along with other arguments that you would normally pass to a CloudPredictor | ||
In the example above, `image_column` would be `image` | ||
|
||
```python | ||
cloud_predictor = TabularCloudPredictor(cloud_output_path="YOUR_S3_BUCKET_PATH") | ||
cloud_predictor.fit(..., image_column="IMAGE_COLUMN_NAME") | ||
cloud_predictor.predict_real_time(..., image_column="IMAGE_COLUMN_NAME") | ||
cloud_predictor.predict(..., image_column="IMAGE_COLUMN_NAME") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters