Running SageMaker jobs with Amazon SageMaker Debugger

Outline

Enabling SageMaker Debugger
- Zero Script Change
- Bring your own training container
Configuring SageMaker Debugger
- Saving data
- Rules
  - Built In Rules
  - Custom Rules
Interactive Exploration
SageMaker Studio
TensorBoard Visualization
Example Notebooks

Enabling SageMaker Debugger

There are two ways in which you can enable SageMaker Debugger while training on SageMaker.

Zero Script Change

We have equipped the official Framework containers on SageMaker with custom versions of supported frameworks TensorFlow, PyTorch, MXNet and XGBoost. These containers enable you to use SageMaker Debugger with no changes to your training script, by automatically adding SageMaker Debugger's Hook.

Here's a list of frameworks and versions which support this experience.

Framework	Version
TensorFlow	1.15, 2.1, 2.2
MXNet	1.6
PyTorch	1.4, 1.5
XGBoost	>=0.90-2 As Built-in algorithm

More details for the deep learning frameworks on which containers these are can be found here: SageMaker Framework Containers and AWS Deep Learning Containers. You do not have to specify any training container image if you want to use them on SageMaker. You only need to specify the version above to use these containers.

Bring your own training container

This library smdebug itself supports versions other than the ones listed above. If you want to use SageMaker Debugger with a version different from the above, you will have to orchestrate your training script with a few lines. Before we discuss how these changes look like, let us take a look at the versions supported.

Framework	Versions
TensorFlow	1.13, 1.14, 1.15, 2.1, 2.2
Keras (with TensorFlow backend)	2.3
MXNet	1.4, 1.5, 1.6
PyTorch	1.2, 1.3, 1.4, 1.5
XGBoost	0.90-2, 1.0-1

Setting up SageMaker Debugger with your script on your container

Ensure that you are using Python3 runtime as smdebug only supports Python3.
Install smdebug binary through pip install smdebug
Make some minimal modifications to your training script to add SageMaker Debugger's Hook. Please refer to the framework pages linked below for instructions on how to do that.
- TensorFlow
- PyTorch
- MXNet
- XGBoost

Configuring SageMaker Debugger

Regardless of which of the two above ways you have enabled SageMaker Debugger, you can configure it using the SageMaker python SDK. There are two aspects to this configuration.

You can specify what tensors to be saved, when they should be saved and in what form they should be saved.
You can specify which Rule you want to monitor your training job with. This can be either a built in rule that SageMaker provides, or a custom rule that you can write yourself.

Saving Data

SageMaker Debugger gives you a powerful and flexible API to save the tensors you choose at the frequencies you want. These configurations are made available in the SageMaker Python SDK through the DebuggerHookConfig class.

Saving built-in collections that we manage

Learn more about these built in collections here.

from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
hook_config = DebuggerHookConfig(
    s3_output_path='s3://smdebug-dev-demo-pdx/mnist',
    hook_parameters={
        "save_interval": 100
    },
    collection_configs=[
        CollectionConfig("weights"),
        CollectionConfig("gradients"),
        CollectionConfig("losses"),
        CollectionConfig(
            name="biases",
            parameters={
                "save_interval": 10,
                "end_step": 500
            }
        ),
    ]
)
import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
    entry_point='src/mnist.py',
    role=sm.get_execution_role(),
    base_job_name='smdebug-demo-job',
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    framework_version="1.15",
    py_version="py3",
    # smdebug-specific arguments below
    debugger_hook_config=hook_config
)
sagemaker_estimator.fit()

Saving reductions for a custom collection

You can define your collection of tensors. You can also choose to save certain reductions of tensors only instead of saving the full tensor. You may choose to do this to reduce the amount of data saved. Please note that when you save reductions, unless you pass the flag save_raw_tensor, only these reductions will be available for analysis. The raw tensor will not be saved.

from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
hook_config = DebuggerHookConfig(
    s3_output_path='s3://smdebug-dev-demo-pdx/mnist',
    collection_configs=[
        CollectionConfig(
            name="activations",
            parameters={
                "include_regex": "relu|tanh",
                "reductions": "mean,variance,max,abs_mean,abs_variance,abs_max"
            })
    ]
)
import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
    entry_point='src/mnist.py',
    role=sm.get_execution_role(),
    base_job_name='smdebug-demo-job',
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    framework_version="1.15",
    py_version="py3",
    # smdebug-specific arguments below
    debugger_hook_config=hook_config
)
sagemaker_estimator.fit()

Enabling TensorBoard summaries

SageMaker Debugger can automatically generate tensorboard scalar summaries, distributions and histograms for tensors saved. This can be enabled by passing a TensorBoardOutputConfig object when creating an Estimator as follows. You can also choose to disable or enable histograms specifically for different collections. By default a collection has save_histogram flag set to True. Note that scalar summaries are added to TensorBoard for all ScalarCollections and any scalar saved through hook.save_scalar. Refer API for more details on scalar collections and save_scalar method.

The below example saves weights and gradients as full tensors, and also saves the gradients as histograms and distributions to visualize in TensorBoard. These will be saved to the location passed in TensorBoardOutputConfig object.

from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig
hook_config = DebuggerHookConfig(
    s3_output_path='s3://smdebug-dev-demo-pdx/mnist',
    collection_configs=[
        CollectionConfig(
            name="weights",
            parameters={"save_histogram": False}),
        CollectionConfig(name="gradients"),
    ]
)

tb_config = TensorBoardOutputConfig('s3://smdebug-dev-demo-pdx/mnist/tensorboard')

import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
    entry_point='src/mnist.py',
    role=sm.get_execution_role(),
    base_job_name='smdebug-demo-job',
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    framework_version="1.15",
    py_version="py3",
    # smdebug-specific arguments below
    debugger_hook_config=hook_config,
    tensorboard_output_config=tb_config
)
sagemaker_estimator.fit()

For more details, refer our API page.

Rules

Here are some examples on how to run Rules with your training jobs.

Note that passing a CollectionConfig object to the Rule as collections_to_save is equivalent to passing it to the DebuggerHookConfig object as collection_configs. This is just a shortcut for your convenience.

Built in Rules

The Built-in Rules, or SageMaker Rules, are described in detail on this page

Scope of Validity	Rules
Generic Deep Learning models (TensorFlow, Apache MXNet, and PyTorch)	`dead_relu` `exploding_tensor` `poor_weight_initialization` `saturated_activation` `vanishing_gradient` `weight_update_ratio`
Generic Deep learning models (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm	`all_zero` `class_imbalance` `confusion` `loss_not_decreasing` `overfit` `overtraining` `similar_across_runs` `tensor_variance` `unchanged_tensor`
Deep learning applications	`check_input_images` `nlp_sequence_ratio`
XGBoost algorithm	`tree_depth`

Running built-in SageMaker Rules

You can run a SageMaker built-in Rule as follows using the Rule.sagemaker method. The first argument to this method is the base configuration that is associated with the Rule. We configure them as much as possible. You can take a look at the ruleconfigs that we populate for all built-in rules here. You can choose to customize these parameters using the other parameters.

These rules are run on our pre-built Docker images which are listed here. You are not charged for the instances when running SageMaker built-in rules.

A list of all our built-in rules are provided below.

from sagemaker.debugger import Rule, CollectionConfig, rule_configs

exploding_tensor_rule = Rule.sagemaker(
    base_config=rule_configs.exploding_tensor(),
    rule_parameters={"collection_names": "weights,losses"},
    collections_to_save=[
        CollectionConfig("weights"),
        CollectionConfig("losses")
    ]
)

vanishing_gradient_rule = Rule.sagemaker(
    base_config=rule_configs.vanishing_gradient()
)

import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
    entry_point='src/mnist.py',
    role=sm.get_execution_role(),
    base_job_name='smdebug-demo-job',
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    framework_version="1.15",
    py_version="py3",
    # smdebug-specific arguments below
    rules=[exploding_tensor_rule, vanishing_gradient_rule],
)
sagemaker_estimator.fit()

Custom Rules

You can write your own rule custom made for your application and provide it, so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that smdebug provides. Our page on Programming Model for Analysis describes the APIs that we provide to help you write your own rule. Please refer to this example notebook for a demonstration of creating your custom rule and running it on SageMaker.

Running custom Rules

To run a custom rule, you have to provide a few additional parameters. Key parameters to note are a file which has the implementation of your Rule class source, the name of the Rule class (rule_to_invoke), the type of instance to run the Rule job on (instance_type), the size of the volume on that instance (volume_size_in_gb), and the docker image to use for running this job (image_uri).

Please refer to the documentation here for more details.

We have pre-built Docker images that you can use to run your custom rules. These are listed here. You can also choose to build your own Docker image for custom rule evaluation. Please refer to the repository SageMaker Debugger Rules Container for instructions on how to build such an image.

from sagemaker.debugger import Rule, CollectionConfig

custom_coll = CollectionConfig(
    name="relu_activations",
    parameters={
        "include_regex": "relu",
        "save_interval": 500,
        "end_step": 5000
    })
improper_activation_rule = Rule.custom(
    name='improper_activation_job',
    image_uri='552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest',
    instance_type='ml.c4.xlarge',
    volume_size_in_gb=400,
    source='rules/custom_rules.py',
    rule_to_invoke='ImproperActivation',
    rule_parameters={"collection_names": "relu_activations"},
    collections_to_save=[custom_coll]
)

import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
    entry_point='src/mnist.py',
    role=sm.get_execution_role(),
    base_job_name='smdebug-demo-job',
    train_instance_count=1,
    train_instance_type="ml.m4.xlarge",
    framework_version="1.15",
    py_version="py3",
    # smdebug-specific arguments below
    rules=[improper_activation_rule],
)
sagemaker_estimator.fit()

For more details, refer our Analysis page.

Interactive Exploration

smdebug SDK also allows you perform interactive and real-time exploration of the data saved. You can choose to inspect the tensors saved, or visualize them through your custom plots. You can retrieve these tensors as numpy arrays allowing you to use your favorite analysis libraries right in a SageMaker notebook instance. We have couple of example notebooks demonstrating this.

Real-time anaysis in a notebook during training
Interactive tensor analysis in a notebook

SageMaker Studio

SageMaker Debugger is on by default for supported training jobs on the official SageMaker Framework containers (or AWS Deep Learning Containers) during SageMaker training jobs. In this default scenario, SageMaker Debugger takes the losses and metrics from your training job and publishes them to SageMaker Metrics, allowing you to track these metrics in SageMaker Studio. You can also see the status of Rules you have enabled for your training job right in the Studio. Here are screenshots of that experience.

TensorBoard Visualization

If you have enabled TensorBoard outputs for your training job through SageMaker Debugger, TensorBoard artifacts will automatically be generated for the tensors saved. You can then point your TensorBoard instance to that S3 location and review the visualizations for the tensors saved.

Example Notebooks

We have a bunch of example notebooks here demonstrating different aspects of SageMaker Debugger.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sagemaker.md

sagemaker.md

Running SageMaker jobs with Amazon SageMaker Debugger

Outline

Enabling SageMaker Debugger

Zero Script Change

Bring your own training container

Setting up SageMaker Debugger with your script on your container

Configuring SageMaker Debugger

Saving Data

Saving built-in collections that we manage

Saving reductions for a custom collection

Enabling TensorBoard summaries

Rules

Built in Rules

Running built-in SageMaker Rules

Custom Rules

Running custom Rules

Interactive Exploration

SageMaker Studio

TensorBoard Visualization

Example Notebooks

Files

sagemaker.md

Latest commit

History

sagemaker.md

File metadata and controls

Running SageMaker jobs with Amazon SageMaker Debugger

Outline

Enabling SageMaker Debugger

Zero Script Change

Bring your own training container

Setting up SageMaker Debugger with your script on your container

Configuring SageMaker Debugger

Saving Data

Saving built-in collections that we manage

Saving reductions for a custom collection

Enabling TensorBoard summaries

Rules

Built in Rules

Running built-in SageMaker Rules

Custom Rules

Running custom Rules

Interactive Exploration

SageMaker Studio

TensorBoard Visualization

Example Notebooks