Debugging performance on distributed AI applications can be challenging. Currently, developers can examine their PyTorch training applications programmatically using the PyTorch Profiler API. However, the user needs to edit their python code to get this data, and likely relaunch their jobs.
Dynolog enables profiling distributed AI applications on-demand i.e without any code changes. Essentially, dynolog provides a service that enables users to request a PyTorch profile. On receiving a profiling request dynolog uses inter-process communication to configure the PyTorch profiler within the application. This is illustrated in the following diagram.
The following instructions walk through:
- How to setup PyTorch to enable on-demand profiling.
- Configure the pytorch application environment.
- Use dynolog along with a job scheduler like slurm.
- Profile a distributed training job - full example.
Note that these instructions will likely change. We need recent commits in PyTorch to enable this feature. As a new PyTorch release takes place users will not need to build PyTorch from source.
Currently, we need to build PyTorch from source to obtain the latest changes that make it compatible with dynolog.
Please follow the instructions in PyTorch README on obtaining the setup. Ensure that your checkout has the PR#88020 (use git log
to check).
Before building, also pull in latest changes from Kineto dependency using:
cd third_party/kineto/libkineto
git checkout main
cd -
The minimum version needed is one containing PR#653, commit a36e47e6d080abd62ca26dd2a007c8f4cd0bd600
Lastly, disable the USE_LITE_INTERPRETER_PROFILER
setting. One way to do this is change the option in pytorch/CMakeLists.txt
.
if(USE_LITE_INTERPRETER_PROFILER)
- string(APPEND CMAKE_CXX_FLAGS " -DEDGE_PROFILER_USE_KINETO")
+ # string(APPEND CMAKE_CXX_FLAGS " -DEDGE_PROFILER_USE_KINETO")
endif()
You are all set to build and install PyTorch.
python setup.py install
On a local machine you can run dynolog and use the command line tool to trace your application. To run dynolog execute the following in a new terminal window or use a package manager.
./build/dynolog/src/dynolog --enable_ipc_monitor
For pytorch we need to setup the following environemnt first
echo "ENABLE_IPC_FABRIC=YES" > ~/libkineto.conf
export KINETO_CONFIG=~/libkineto.conf
export KINETO_USE_DAEMON=1
Now you can run the pytorch program. We included a simple example pytorch script for you to test out this flow.
python3 scripts/pytorch/linear_model_example.py
Alternatively, you can set the environment variables on the command line.
KINETO_CONFIG=~/libkineto.conf KINETO_USE_DAEMON=1 python3 scripts/pytorch/linear_model_example.py
In the output of the script you should see a message as follows.
INFO:2022-10-22 00:59:13 151209:151209 init.cpp:98] Registering daemon config loader
You can now trigger trace using the dyno command line tool. Dynolog will try to find a pytorch process runing on the machine. It can also filter processes by pid or job id. On a succeessful match the command line tool will print the path where the trace file will be written to:
$> ./build/release/dyno gputrace --log-file /tmp/libkineto_trace.json
Matched 1 processes
Trace output files will be written to:
/tmp/libkineto_trace_151419.json
To trigger trace on a remote node you can use the --host-name
flag.
dyno --host-name a100-st-p4d24xlarge-42 gputrace --log-file /path/to/log/file/libkineto_trace.json
It is likely that a single node can run multiple jobs. Pytorch/Kineto can recognize job IDs of processes run under the SLURM job scheduler. You can pass a job id to filter requests as follows.
dyno --host-name a100-st-p4d24xlarge-42 gputrace --job-id 4242 --log-file /path/to/log/file/libkineto_trace.json
Lastly, this section talks about how to profile a full scale traing jobs containing multiple nodes. We provide an example with SLURM, a job scheduler commonly used in HPC environments.
In case dynolog is not deployed on your cloud/cluster you can launch the dynolog binary
with your job. The script run_with_dyno_wrapper.sh
can be used for this purpose.
This script will spin up a dynolog process first and then run your job's command. It will also setup the required environment variables. An example is shown below:
srun ./scripts/slurm/run_with_dyno_wrapper.sh <desired launch cmd>
# example
srun ./scripts/slurm/run_with_dyno_wrapper.sh python3 ./scripts/pytorch/linear_model_example.p
After the job attaches to dynolog, we can use a utility script unitrace.py
to profile the entire job.
unitrace.py
requires the job id and it allows users to:
- Lookup the nodes running the job.
- Send dyno gputrace requests to all nodes as required.
Currently, this script only works with SLURM job scheduler. Further, job ids with an underscore or
_
in the name are not supported. This can be easily extended to more schedulers.
The usage of the unitrace.py
script is pretty straight forward.
Usage: ./scripts/pytorch/unitrace.py [slurm_job_id] [output_dir]
- where output_dir will be the top level directory for saving traces.
unitrace.py
will save a new trace file for each job using the naming conventionlibkineto_trace_<slurm_job_id>_<pid_of_process>.json
For example, the following command traces the job 65788044
./scripts/pytorch/unitrace.py 65788044 /tmp/trace_65788044/
This produces output such as:
Found hosts string = learnfair[1773,1775-1776,1798,1800,1809,1811-1812]
Hosts = ['learnfair1773', 'learnfair1775', 'learnfair1776', 'learnfair1798', 'learnfair1800', 'learnfair1809', 'learnfair1811', 'learnfair1812']
Running cmd: ./build/release/dyno --hostname learnfair1773 gputrace --job-id 65788044 --log-file /tmp/trace_65788044/libkineto_trace_learnfair1773.json
...
For support on this feature please raise a github issue or contact us via email mentioned in the README page.