Distributed training tutorial and general tutorial improvements (#87)

autogluon · Aug 28, 2023 · 133848f · 133848f
1 parent 22a0abc
commit 133848f
Show file tree

Hide file tree

Showing 6 changed files with 178 additions and 68 deletions.
diff --git a/docs/tutorials/autogluon-cloud.md b/docs/tutorials/autogluon-cloud.md
@@ -74,7 +74,7 @@ It should look similar to this: `INFO:sagemaker:Creating training-job with name:
 Alternatively, you can go to the SageMaker console and find the ongoing training job and its corresponding job name.
 
 ```python
-another_cloud_predictor = TabularCloudPredictor(cloud_output_path='YOUR_S3_BUCKET_PATH')
+another_cloud_predictor = TabularCloudPredictor()
 another_cloud_predictor.attach_job(job_name="JOB_NAME")
 ```
 
@@ -148,6 +148,24 @@ One key inside would be `endpoint`, and it will tell you the name of the endpoin
 }
 ```
 
+### Invoke the Endpoint without AutoGluon Cloud
+The endpoint being deployed is a normal Sagemaker Endpoint, and you can invoke it through other methods. For example, to invoke an endpoint with boto3 directly
+
+```python
+import boto3
+
+client = boto3.client('sagemaker-runtime')
+response = client.invoke_endpoint(
+    EndpointName=ENDPOINT_NAME,
+    ContentType='text/csv',
+    Accept='application/json',
+    Body=test_data.to_csv()
+)
+
+#: Print the model endpoint's output.
+print(response['Body'].read().decode())
+```
+
 ## Batch Inference
 When minimizing latency isn't a concern, then the batch inference functionality may be easier, more scalable, and cheaper as compute is automatically terminated after the batch inference job is complete.
 
@@ -256,69 +274,3 @@ local_predictor = cloud_predictor.to_local_predictor(
 ```
 
 `to_local_predictor()` would underneath downlod the tarball, expand it to your local disk and load it as a corresponding AutoGluon predictor.
-
-## Training/Inference with Image Modality
-If your training and inference tasks involve image modality, your data would contain a column representing the path to the image file, i.e.
-
-```python
-   feature_1                     image   label
-0          1   image/train/train_1.png       0
-1          2   image/train/train_1.png       1
-```
-
-### Preparing the Image Column
-Currently, AutoGluon only supports one image per row.
-If your dataset contains one or more images per row, we first need to preprocess the image column to only contain the first image of each row.
-
-For example, if your images are seperated with `;`, you can preprocess it via:
-
-```python
-# image_col is the column name containing the image path. In the example above, it would be `image`
-train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])
-test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])
-```
-
-Now we update the path to an absolute path.
-
-For example, if your directory is similar to this:
-
-```bash
-.
-└── current_working_directory/
-    ├── train.csv
-    ├── test.csv
-    └── images/
-        ├── train/
-        │   └── train_1.png
-        └── test/
-            └── test_1.png
-```
-
-You can replace your image column to absolute paths via:
-
-```python
-train_data[image_col] = train_data[image_col].apply(lambda path: os.path.abspath(path))
-test_data[image_col] = test_data[image_col].apply(lambda path: os.path.abspath(path))
-```
-
-### Perform Training/Inference with Image Modality
-Provide argument `image_column` as the column name containing image paths to `CloudPredictor` fit/inference APIs.
-In the example above, `image_column` would be `image`
-
-```python
-cloud_predictor.fit(..., image_column="IMAGE_COLUMN_NAME")
-cloud_predictor.predict_real_time(..., image_column="IMAGE_COLUMN_NAME")
-cloud_predictor.predict(..., image_column="IMAGE_COLUMN_NAME")
-```
-
-## Supported Docker Containers
-`autogluon.cloud` supports AutoGluon Deep Learning Containers version 0.6.0 and newer.
-
-### Use Custom Containers
-Though not recommended, `autogluon.cloud` supports using your custom containers by specifying `custom_image_uri`.
-
-```python
-cloud_predictor.fit(..., custom_image_uri="CUSTOM_IMAGE_URI")
-cloud_predictor.predict_real_time(..., custom_image_uri="CUSTOM_IMAGE_URI")
-cloud_predictor.predict(..., custom_image_uri="CUSTOM_IMAGE_URI")
-```
diff --git a/docs/tutorials/distributed-training.md b/docs/tutorials/distributed-training.md
@@ -0,0 +1,39 @@
+# AutoGluon Cloud Distributed Training
+AutoGluon Cloud currently supports distributed training for Tabular.
+
+Tabular predictor trains multiple folds of models underneath and parallelize model training on a single machine. It is natural to expand this strategy to a cluster of machines.
+
+With AutoGluon Cloud, we help you to spin up the cluster, dispatch the jobs, and tear down the cluster. And it is not much different from how you would normally train a `TabularCloudPredictor``. All you need to do is specify `backend="ray_aws"` when you init the predictor
+
+```python
+cloud_predictor = TabularCloudPredictor(
+    ...,
+    backend="ray_aws"
+)
+```
+
+And you would call fit as normal
+
+```python
+cloud_predictor.fit(predictor_init_args=predictor_init_args, predictor_fit_args=predictor_fit_args)
+```
+
+## How to Control Number of Instances in the Cluster
+The default number of instances being launched will be introduced in the following section. You can control how many instances are created in the cluster by passing `instance_count` to `fit()`.
+```python
+cloud_predictor.fit(..., instance_count=4)
+```
+
+### General Strategy on How to Decide `instance_count`
+
+#### Non-HPO
+By default, this value will be determined by number of folds (`num_bag_folds` in `TabularPredictor`). We will launch as many instances as the number of folds, so each fold will be trained on a dedicated machine. The default value should work most of the time
+
+You can of course lower this value to save budgets, but this will slow the training process as a single instance will need to train multiple folds in parallel and split its resources. Setting value larger than number of folds is meaningless, as we do not support distributed training of a single model.
+
+#### HPO
+When doing HPO, it's very hard to pre-determine how many instances to use. Therefore, we default to 1 instance, and you would need to specify the number of instances you want to use. The fastest option of course will be matching the number of instances and number of trials. However, this is likely impossible as HPO typically involves big number of trials.
+
+In general, the recommendation would be try to make it so that (#vcpus_per_instance * #instances) is divisible by number of trials. We evenly distribute resources to tasks; therefore, a non-divisible value would results in some resources not being utilized.
+
+To give a recommended example, suppose you want to do HPO with 128 trials. Choosing 8 `m5.2xlarge` (8 vcpus), would make the computing resources divisible by the number of trials: 128 / (8 * 8) = 2. This would results in two batches each containing 64 jobs being distributed on 64 vcpus.
diff --git a/docs/tutorials/faq.md b/docs/tutorials/faq.md
@@ -0,0 +1,51 @@
+# AutoGluon Cloud FAQ
+
+## Supported Docker Containers
+`autogluon.cloud` supports AutoGluon Deep Learning Containers version 0.6.0 and newer.
+
+## How to use Previous Versions of AutoGluon containers
+By default, `autogluon.cloud` will fetch the latest version of AutoGluon DLC. However, you can supply `framework_version` to fit/inference APIs to access previous versions, i.e.
+```python
+cloud_predictor.fit(..., framework_version="0.6")
+```
+It is always recommended to use the latest version as it has more features and up-to-date security patches.
+
+
+## How to Build a Cloud Compatible Custom Container
+If the official DLC doesn't meet your requirement, and you would like to build your own container.
+
+You can either build on top of our [DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#autogluon-training-containers)
+or refer to our [Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/autogluon)
+
+## How to Use Custom Containers
+Though not recommended, `autogluon.cloud` supports using your custom containers by specifying `custom_image_uri`.
+
+```python
+cloud_predictor.fit(..., custom_image_uri="CUSTOM_IMAGE_URI")
+cloud_predictor.predict_real_time(..., custom_image_uri="CUSTOM_IMAGE_URI")
+cloud_predictor.predict(..., custom_image_uri="CUSTOM_IMAGE_URI")
+```
+
+If this custom image lives under a certain ECR, you would need to grant access permission to the IAM role used by the Cloud module.
+
+## Run into Permission Issues
+You can try to get the necessary IAM permission and trust relationship through
+```python
+from autogluon.cloud import TabularCloudPredictor  # Can be other CloudPredictor as well
+
+TabularCloudPredictor.generate_default_permission(
+    backend="BACKNED_YOU_WANT"  # We currently support sagemaker and ray_aws
+    account_id="YOUR_ACCOUNT_ID",  # The AWS account ID you plan to use for CloudPredictor.
+    cloud_output_bucket="S3_BUCKET"  # S3 bucket name where intermediate artifacts will be uploaded and trained models should be saved. You need to create this bucket beforehand.
+)
+```
+
+The util function above would give you two json files describing the trust replationship and the iam policy.
+**Make sure you review those files and make necessary changes according to your use case before applying them.**
+
+We recommend you to create an IAM Role for your IAM User to delegate as IAM Role doesn't have permanent long-term credentials and is used to directly interact with AWS services.
+Refer to this [tutorial](https://aws.amazon.com/premiumsupport/knowledge-center/iam-assume-role-cli/) to
+
+1. create the IAM Role with the trust relationship and iam policy you generated above
+2. setup the credential
+3. assume the role
diff --git a/docs/tutorials/image-modality.md b/docs/tutorials/image-modality.md
@@ -0,0 +1,54 @@
+# Training/Inference with Image Modality
+If your training and inference tasks involve image modality, your data would contain a column representing the path to the image file, i.e.
+
+```python
+   feature_1                     image   label
+0          1   image/train/train_1.png       0
+1          2   image/train/train_1.png       1
+```
+
+### Preparing the Image Column
+Currently, AutoGluon Cloud only supports one image per row.
+If your dataset contains one or more images per row, we first need to preprocess the image column to only contain the first image of each row.
+
+For example, if your images are seperated with `;`, you can preprocess it via:
+
+```python
+# image_col is the column name containing the image path. In the example above, it would be `image`
+train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])
+test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])
+```
+
+Now we update the path to an absolute path.
+
+For example, if your directory is similar to this:
+
+```bash
+.
+└── current_working_directory/
+    ├── train.csv
+    ├── test.csv
+    └── images/
+        ├── train/
+        │   └── train_1.png
+        └── test/
+            └── test_1.png
+```
+
+You can replace your image column to absolute paths via:
+
+```python
+train_data[image_col] = train_data[image_col].apply(lambda path: os.path.abspath(path))
+test_data[image_col] = test_data[image_col].apply(lambda path: os.path.abspath(path))
+```
+
+### Perform Training/Inference with Image Modality
+Provide argument `image_column` as the column name containing image paths to `CloudPredictor` fit/inference APIs along with other arguments that you would normally pass to a CloudPredictor 
+In the example above, `image_column` would be `image`
+
+```python
+cloud_predictor = TabularCloudPredictor(cloud_output_path="YOUR_S3_BUCKET_PATH")
+cloud_predictor.fit(..., image_column="IMAGE_COLUMN_NAME")
+cloud_predictor.predict_real_time(..., image_column="IMAGE_COLUMN_NAME")
+cloud_predictor.predict(..., image_column="IMAGE_COLUMN_NAME")
+```
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
@@ -8,3 +8,15 @@
 
   A tutorial on using AutoGluon Cloud module to train/deploy AutoGluon backed models on SageMaker.
 :::
+
+```{toctree}
+---
+maxdepth: 2
+hidden: true
+---
+
+Essentials <autogluon-cloud>
+Image Modality <image-modality>
+Distributed Training <distributed-training>
+FAQ <faq>
+```
diff --git a/src/autogluon/cloud/backend/ray_backend.py b/src/autogluon/cloud/backend/ray_backend.py
@@ -187,6 +187,7 @@ def fit(
         tune_data = predictor_fit_args.pop("tuning_data", None)
         presets = predictor_fit_args.pop("presets", [])
         num_bag_folds = predictor_fit_args.get("num_bag_folds", None)
+        hyperparameter_tune_kwargs = predictor_fit_args.get("hyperparameter_tune_kwargs", None)
 
         if instance_count == "auto":
             instance_count = num_bag_folds
@@ -197,9 +198,10 @@ def fit(
                 and "high_quality" not in presets
                 and "good_quality" not in presets
                 and num_bag_folds is None
+                and hyperparameter_tune_kwargs is None
             ):
                 logger.warning(
-                    f"Tabular Predictor will be trained without bagging hence not distributed, but you specified instance count > 1: {instance_count}."
+                    f"Tabular Predictor will be trained without bagging nor HPO hence not distributed, but you specified instance count > 1: {instance_count}."
                 )
                 logger.warning("Will deploy cluster with 1 instance only to save costs")
                 instance_count = 1