Skip to content

Commit

Permalink
[doc][c10d] fix up distributed tutorial (#3132)
Browse files Browse the repository at this point in the history
* [doc][c10d] fix up distributed tutorial

Summary:
Minor fixups to the distributed tutorial.
1. Fix broken links.

Test Plan:
Ran the referenced code locally to make sure that it works.

Reviewers:

Subscribers:

Tasks:

Tags:

* Update intermediate_source/dist_tuto.rst

Co-authored-by: Svetlana Karslioglu <[email protected]>

---------

Co-authored-by: Svetlana Karslioglu <[email protected]>
  • Loading branch information
c-p-i-o and svekars authored Oct 31, 2024
1 parent c7e1819 commit 76bd6d3
Showing 1 changed file with 36 additions and 22 deletions.
58 changes: 36 additions & 22 deletions intermediate_source/dist_tuto.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ simultaneously. If you have access to compute cluster you should check
with your local sysadmin or use your favorite coordination tool (e.g.,
`pdsh <https://linux.die.net/man/1/pdsh>`__,
`clustershell <https://cea-hpc.github.io/clustershell/>`__, or
`others <https://slurm.schedmd.com/>`__). For the purpose of this
`slurm <https://slurm.schedmd.com/>`__). For the purpose of this
tutorial, we will use a single machine and spawn multiple processes using
the following template.

Expand All @@ -64,11 +64,11 @@ the following template.
if __name__ == "__main__":
size = 2
world_size = 2
processes = []
mp.set_start_method("spawn")
for rank in range(size):
p = mp.Process(target=init_process, args=(rank, size, run))
for rank in range(world_size):
p = mp.Process(target=init_process, args=(rank, world_size, run))
p.start()
processes.append(p)
Expand Down Expand Up @@ -125,7 +125,7 @@ process 0 increments the tensor and sends it to process 1 so that they
both end up with 1.0. Notice that process 1 needs to allocate memory in
order to store the data it will receive.

Also notice that ``send``/``recv`` are **blocking**: both processes stop
Also notice that ``send/recv`` are **blocking**: both processes block
until the communication is completed. On the other hand immediates are
**non-blocking**; the script continues its execution and the methods
return a ``Work`` object upon which we can choose to
Expand Down Expand Up @@ -219,16 +219,23 @@ to obtain the sum of all tensors on all processes, we can use the
Since we want the sum of all tensors in the group, we use
``dist.ReduceOp.SUM`` as the reduce operator. Generally speaking, any
commutative mathematical operation can be used as an operator.
Out-of-the-box, PyTorch comes with 4 such operators, all working at the
Out-of-the-box, PyTorch comes with many such operators, all working at the
element-wise level:

- ``dist.ReduceOp.SUM``,
- ``dist.ReduceOp.PRODUCT``,
- ``dist.ReduceOp.MAX``,
- ``dist.ReduceOp.MIN``.
- ``dist.ReduceOp.MIN``,
- ``dist.ReduceOp.BAND``,
- ``dist.ReduceOp.BOR``,
- ``dist.ReduceOp.BXOR``,
- ``dist.ReduceOp.PREMUL_SUM``.

In addition to ``dist.all_reduce(tensor, op, group)``, there are a total
of 6 collectives currently implemented in PyTorch.
The full list of supported operators is
`here <https://pytorch.org/docs/stable/distributed.html#torch.distributed.ReduceOp>`__.

In addition to ``dist.all_reduce(tensor, op, group)``, there are many additional collectives currently implemented in
PyTorch. Here are a few supported collectives.

- ``dist.broadcast(tensor, src, group)``: Copies ``tensor`` from
``src`` to all other processes.
Expand All @@ -244,6 +251,12 @@ of 6 collectives currently implemented in PyTorch.
- ``dist.all_gather(tensor_list, tensor, group)``: Copies ``tensor``
from all processes to ``tensor_list``, on all processes.
- ``dist.barrier(group)``: Blocks all processes in `group` until each one has entered this function.
- ``dist.all_to_all(output_tensor_list, input_tensor_list, group)``: Scatters list of input tensors to all processes in
a group and return gathered list of tensors in output list.

The full list of supported collectives can be found by looking at the latest documentation for PyTorch Distributed
`(link) <https://pytorch.org/docs/stable/distributed.html>`__.


Distributed Training
--------------------
Expand Down Expand Up @@ -275,7 +288,7 @@ gradients of their model on their batch of data and then average their
gradients. In order to ensure similar convergence results when changing
the number of processes, we will first have to partition our dataset.
(You could also use
`tnt.dataset.SplitDataset <https://github.com/pytorch/tnt/blob/master/torchnet/dataset/splitdataset.py#L4>`__,
`torch.utils.data.random_split <https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split>`__,
instead of the snippet below.)

.. code:: python
Expand Down Expand Up @@ -389,7 +402,7 @@ could train any model on a large computer cluster.
lot more tricks <https://seba-1511.github.io/dist_blog>`__ required to
implement a production-level implementation of synchronous SGD. Again,
use what `has been tested and
optimized <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
optimized <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.

Our Own Ring-Allreduce
~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -451,8 +464,9 @@ Communication Backends

One of the most elegant aspects of ``torch.distributed`` is its ability
to abstract and build on top of different backends. As mentioned before,
there are currently three backends implemented in PyTorch: Gloo, NCCL, and
MPI. They each have different specifications and tradeoffs, depending
there are multiple backends implemented in PyTorch.
Some of the most popular ones are Gloo, NCCL, and MPI.
They each have different specifications and tradeoffs, depending
on the desired use case. A comparative table of supported functions can
be found
`here <https://pytorch.org/docs/stable/distributed.html#module-torch.distributed>`__.
Expand Down Expand Up @@ -544,15 +558,15 @@ NCCL backend is included in the pre-built binaries with CUDA support.
Initialization Methods
~~~~~~~~~~~~~~~~~~~~~~

To finish this tutorial, let's talk about the very first function we
called: ``dist.init_process_group(backend, init_method)``. In
particular, we will go over the different initialization methods which
are responsible for the initial coordination step between each process.
Those methods allow you to define how this coordination is done.
Depending on your hardware setup, one of these methods should be
naturally more suitable than the others. In addition to the following
sections, you should also have a look at the `official
documentation <https://pytorch.org/docs/stable/distributed.html#initialization>`__.
To conclude this tutorial, let's examine the initial function we invoked:
``dist.init_process_group(backend, init_method)``. Specifically, we will discuss the various
initialization methods responsible for the preliminary coordination step between each process.
These methods enable you to define how this coordination is accomplished.

The choice of initialization method depends on your hardware setup, and one method may be more
suitable than others. In addition to the following sections, please refer to the `official
documentation <https://pytorch.org/docs/stable/distributed.html#initialization>`__ for further information.


**Environment Variable**

Expand Down

0 comments on commit 76bd6d3

Please sign in to comment.