- Enabling SageMaker Debugger
- Configuring SageMaker Debugger
- Interactive Exploration
- SageMaker Studio
- TensorBoard Visualization
- Example Notebooks
There are two ways in which you can enable SageMaker Debugger while training on SageMaker.
We have equipped the official Framework containers on SageMaker with custom versions of supported frameworks TensorFlow, PyTorch, MXNet and XGBoost. These containers enable you to use SageMaker Debugger with no changes to your training script, by automatically adding SageMaker Debugger's Hook.
Here's a list of frameworks and versions which support this experience.
Framework | Version |
---|---|
TensorFlow | 1.15, 2.1, 2.2 |
MXNet | 1.6 |
PyTorch | 1.4, 1.5 |
XGBoost | >=0.90-2 As Built-in algorithm |
More details for the deep learning frameworks on which containers these are can be found here: SageMaker Framework Containers and AWS Deep Learning Containers. You do not have to specify any training container image if you want to use them on SageMaker. You only need to specify the version above to use these containers.
This library smdebug
itself supports versions other than the ones listed above. If you want to use SageMaker Debugger with a version different from the above, you will have to orchestrate your training script with a few lines. Before we discuss how these changes look like, let us take a look at the versions supported.
Framework | Versions |
---|---|
TensorFlow | 1.13, 1.14, 1.15, 2.1, 2.2 |
Keras (with TensorFlow backend) | 2.3 |
MXNet | 1.4, 1.5, 1.6 |
PyTorch | 1.2, 1.3, 1.4, 1.5 |
XGBoost | 0.90-2, 1.0-1 |
- Ensure that you are using Python3 runtime as
smdebug
only supports Python3. - Install
smdebug
binary throughpip install smdebug
- Make some minimal modifications to your training script to add SageMaker Debugger's Hook. Please refer to the framework pages linked below for instructions on how to do that.
Regardless of which of the two above ways you have enabled SageMaker Debugger, you can configure it using the SageMaker python SDK. There are two aspects to this configuration.
- You can specify what tensors to be saved, when they should be saved and in what form they should be saved.
- You can specify which Rule you want to monitor your training job with. This can be either a built in rule that SageMaker provides, or a custom rule that you can write yourself.
SageMaker Debugger gives you a powerful and flexible API to save the tensors you choose at the frequencies you want. These configurations are made available in the SageMaker Python SDK through the DebuggerHookConfig
class.
Learn more about these built in collections here.
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
hook_config = DebuggerHookConfig(
s3_output_path='s3://smdebug-dev-demo-pdx/mnist',
hook_parameters={
"save_interval": 100
},
collection_configs=[
CollectionConfig("weights"),
CollectionConfig("gradients"),
CollectionConfig("losses"),
CollectionConfig(
name="biases",
parameters={
"save_interval": 10,
"end_step": 500
}
),
]
)
import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
entry_point='src/mnist.py',
role=sm.get_execution_role(),
base_job_name='smdebug-demo-job',
train_instance_count=1,
train_instance_type="ml.m4.xlarge",
framework_version="1.15",
py_version="py3",
# smdebug-specific arguments below
debugger_hook_config=hook_config
)
sagemaker_estimator.fit()
You can define your collection of tensors. You can also choose to save certain reductions of tensors only instead of saving the full tensor. You may choose to do this to reduce the amount of data saved. Please note that when you save reductions, unless you pass the flag save_raw_tensor
, only these reductions will be available for analysis. The raw tensor will not be saved.
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
hook_config = DebuggerHookConfig(
s3_output_path='s3://smdebug-dev-demo-pdx/mnist',
collection_configs=[
CollectionConfig(
name="activations",
parameters={
"include_regex": "relu|tanh",
"reductions": "mean,variance,max,abs_mean,abs_variance,abs_max"
})
]
)
import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
entry_point='src/mnist.py',
role=sm.get_execution_role(),
base_job_name='smdebug-demo-job',
train_instance_count=1,
train_instance_type="ml.m4.xlarge",
framework_version="1.15",
py_version="py3",
# smdebug-specific arguments below
debugger_hook_config=hook_config
)
sagemaker_estimator.fit()
SageMaker Debugger can automatically generate tensorboard scalar summaries,
distributions and histograms for tensors saved. This can be enabled by
passing a TensorBoardOutputConfig
object when creating an Estimator as follows.
You can also choose to disable or enable histograms specifically for different collections.
By default a collection has save_histogram
flag set to True.
Note that scalar summaries are added to TensorBoard for all ScalarCollections
and any scalar saved through hook.save_scalar
.
Refer API for more details on scalar collections and save_scalar
method.
The below example saves weights and gradients as full tensors, and also saves the gradients as histograms and distributions to visualize in TensorBoard.
These will be saved to the location passed in TensorBoardOutputConfig
object.
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig
hook_config = DebuggerHookConfig(
s3_output_path='s3://smdebug-dev-demo-pdx/mnist',
collection_configs=[
CollectionConfig(
name="weights",
parameters={"save_histogram": False}),
CollectionConfig(name="gradients"),
]
)
tb_config = TensorBoardOutputConfig('s3://smdebug-dev-demo-pdx/mnist/tensorboard')
import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
entry_point='src/mnist.py',
role=sm.get_execution_role(),
base_job_name='smdebug-demo-job',
train_instance_count=1,
train_instance_type="ml.m4.xlarge",
framework_version="1.15",
py_version="py3",
# smdebug-specific arguments below
debugger_hook_config=hook_config,
tensorboard_output_config=tb_config
)
sagemaker_estimator.fit()
For more details, refer our API page.
Here are some examples on how to run Rules with your training jobs.
Note that passing a CollectionConfig
object to the Rule as collections_to_save
is equivalent to passing it to the DebuggerHookConfig
object as collection_configs
.
This is just a shortcut for your convenience.
The Built-in Rules, or SageMaker Rules, are described in detail on this page
Scope of Validity | Rules |
---|---|
Generic Deep Learning models (TensorFlow, Apache MXNet, and PyTorch) | |
Generic Deep learning models (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm | |
Deep learning applications | |
XGBoost algorithm |
You can run a SageMaker built-in Rule as follows using the Rule.sagemaker
method.
The first argument to this method is the base configuration that is associated with the Rule.
We configure them as much as possible.
You can take a look at the ruleconfigs that we populate for all built-in rules here.
You can choose to customize these parameters using the other parameters.
These rules are run on our pre-built Docker images which are listed here. You are not charged for the instances when running SageMaker built-in rules.
A list of all our built-in rules are provided below.
from sagemaker.debugger import Rule, CollectionConfig, rule_configs
exploding_tensor_rule = Rule.sagemaker(
base_config=rule_configs.exploding_tensor(),
rule_parameters={"collection_names": "weights,losses"},
collections_to_save=[
CollectionConfig("weights"),
CollectionConfig("losses")
]
)
vanishing_gradient_rule = Rule.sagemaker(
base_config=rule_configs.vanishing_gradient()
)
import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
entry_point='src/mnist.py',
role=sm.get_execution_role(),
base_job_name='smdebug-demo-job',
train_instance_count=1,
train_instance_type="ml.m4.xlarge",
framework_version="1.15",
py_version="py3",
# smdebug-specific arguments below
rules=[exploding_tensor_rule, vanishing_gradient_rule],
)
sagemaker_estimator.fit()
You can write your own rule custom made for your application and provide it, so SageMaker can monitor your training job using your rule. To do so, you need to understand the programming model that smdebug
provides. Our page on Programming Model for Analysis describes the APIs that we provide to help you write your own rule.
Please refer to this example notebook for a demonstration of creating your custom rule and running it on SageMaker.
To run a custom rule, you have to provide a few additional parameters.
Key parameters to note are a file which has the implementation of your Rule class source
,
the name of the Rule class (rule_to_invoke
), the type of instance to run the Rule job on (instance_type
),
the size of the volume on that instance (volume_size_in_gb
), and the docker image to use for running this job (image_uri
).
Please refer to the documentation here for more details.
We have pre-built Docker images that you can use to run your custom rules. These are listed here. You can also choose to build your own Docker image for custom rule evaluation. Please refer to the repository SageMaker Debugger Rules Container for instructions on how to build such an image.
from sagemaker.debugger import Rule, CollectionConfig
custom_coll = CollectionConfig(
name="relu_activations",
parameters={
"include_regex": "relu",
"save_interval": 500,
"end_step": 5000
})
improper_activation_rule = Rule.custom(
name='improper_activation_job',
image_uri='552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest',
instance_type='ml.c4.xlarge',
volume_size_in_gb=400,
source='rules/custom_rules.py',
rule_to_invoke='ImproperActivation',
rule_parameters={"collection_names": "relu_activations"},
collections_to_save=[custom_coll]
)
import sagemaker as sm
sagemaker_estimator = sm.tensorflow.TensorFlow(
entry_point='src/mnist.py',
role=sm.get_execution_role(),
base_job_name='smdebug-demo-job',
train_instance_count=1,
train_instance_type="ml.m4.xlarge",
framework_version="1.15",
py_version="py3",
# smdebug-specific arguments below
rules=[improper_activation_rule],
)
sagemaker_estimator.fit()
For more details, refer our Analysis page.
smdebug
SDK also allows you perform interactive and real-time exploration of the data saved. You can choose to inspect the tensors saved, or visualize them through your custom plots.
You can retrieve these tensors as numpy arrays allowing you to use your favorite analysis libraries right in a SageMaker notebook instance. We have couple of example notebooks demonstrating this.
SageMaker Debugger is on by default for supported training jobs on the official SageMaker Framework containers (or AWS Deep Learning Containers) during SageMaker training jobs. In this default scenario, SageMaker Debugger takes the losses and metrics from your training job and publishes them to SageMaker Metrics, allowing you to track these metrics in SageMaker Studio. You can also see the status of Rules you have enabled for your training job right in the Studio. Here are screenshots of that experience.
If you have enabled TensorBoard outputs for your training job through SageMaker Debugger, TensorBoard artifacts will automatically be generated for the tensors saved. You can then point your TensorBoard instance to that S3 location and review the visualizations for the tensors saved.
We have a bunch of example notebooks here demonstrating different aspects of SageMaker Debugger.