Notes for debugging NCCL-based software and tuning it up for the peak performance
The full list can be found here. That list is long but many of those variables are no longer in use.
The following env vars are most useful during debugging NCCL-related issues such as hanging and crashing.
This is the most commonly used env var to debug networking issues.
Values:
VERSION
- Prints the NCCL version at the start of the program.WARN
- Prints an explicit error message whenever any NCCL call errors out.INFO
- Prints debug informationTRACE
- Prints replayable trace information on every call.
For example:
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported.
And NCCL_DEBUG_FILE
should be very useful when using NCCL_DEBUG
as the information is copious especially if using many nodes.
When using NCCL_DEBUG
env var, redirect all NCCL debug logging output to a file.
The default is stdout
. When using many GPUs it can be very useful to save each process' debug info into its own log file, which can be done like so:
NCCL_DEBUG_FILE=/path/to/nccl-log.%h.%p.txt
%h
is replaced with the hostname%p
is replaced with the process PID.
If you then need to analyse hundreds of these at once, here are some useful shortcuts:
- grep for a specific match and also print the file and line number where it was found:
grep -n "Init COMPLETE" nccl-log*
- show
tail -1
of all nccl log files followed by the name of each file
find . -name "nccl*" -exec sh -c 'echo "$(tail -1 "$1") ($1)"' _ {} \;
NCCL_DEBUG_SUBSYS
used in combination with NCCL_DEBUG
tells the latter which subsystems to show. Normally you don't have to specify this variable, but sometimes the developers helping you may ask to limit the output to only some sub-systems, for example:
NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING
Disables P2P comms - e.g. NVLink won't be used if there is one and the performance will be much slower as a result of that.
This one is very useful if you have multiple network interfaces and you want to choose a specific one to be used.
By default NCCL will try to use the fastest type of an interface, which is typically ib
(InfiniBand).
But say you want to use an Ethernet interface instead then you can override with:
NCCL_SOCKET_IFNAME=eth
This env var can be used at times to debug connectivity issues, if say one of the interfaces is firewalled, and perhaps the others aren't and can be tried instead. Or if you are not sure whether some problem is related to the network interface or something else, so it helps to test other interfaces to invalidate that the issue comes from network.
The following env vars are used primarily to tune up performance.
This one defines which algorithms NCCL will use. Typically it's one of tree, ring, collnetdirect and collnetchain.
I was asking questions about how a user can do the optimization and was told at this NCCL Issue that basically the user shouldn't try to optimize anything as NCCL has a ton of smart algorithms inside that will try to automatically switch from one algorithm to another depending on a concrete situation.
Sylvain Jeaugey shared:
There used to be a static threshold, but it's been replaced by a more complex tuning system. The new system builds a model of the latency and bandwidth of each algorithm/protocol combination (that's many, many combinations) and decides which one should perform best depending on the size. So there is no longer an env var and a static value, which is good because the performance of each algorithm depends on the number of nodes and number of GPUs per node and therefore we need to navigate a 2D space of algo/protocols which isn't easy. You can always force one algorithm with
NCCL_ALGO=TREE
andNCCL_ALGO=RING
and see what performance you get and whether NCCL switches at the right point. I know it's hard to understand, but it's also the best solution we found to have the best performance across all platforms and users without users having to manually tune the switch points. Downside is, if you want to manually tune things, you can't.
If you use NCCL_ALGO
you need to list the algorithms to consider, but otherwise you have no control over it. So, really, this is only useful if you want to make sure that one of the algorithms isn't used.
When asking about which algorithm is better, I received:
Roughly speaking, ring is superior in terms of peak bandwidth (except on 2 nodes), tree is superior in terms of base latency (especially as we scale).
Bandwidth = Size / Time
, so whether you look at the time or the bandwidth for a given size, it will be a combination of both the peak bandwidth and the base latency. For a fixed size, as you scale, the base latency of ring will become prevalent and tree will be better.
The NCCL_CROSS_NIC
variable controls whether NCCL should allow rings/trees to use different NICs, causing inter-node communication to use different NICs on different nodes.
To maximize inter-node communication performance when using multiple NICs, NCCL tries to communicate between same NICs between nodes, to allow for network design where each NIC from each node connects to a different network switch (network rail), and avoid any risk of traffic flow interference. The NCCL_CROSS_NIC setting is therefore dependent on the network topology, and in particular depending on whether the network fabric is rail-optimized or not.
This has no effect on systems with only one NIC.
Values accepted:
- 0: Always use the same NIC for the same ring/tree, to avoid crossing network rails. Suited for networks with per NIC switches (rails), with a slow inter-rail connection. Note there are corner cases for which NCCL may still cause cross-rail communication, so rails still need to be connected at the top.
- 1: Do not attempt to use the same NIC for the same ring/tree. This is suited for networks where all NICs from a node are connected to the same switch, hence trying to communicate across the same NICs does not help avoiding flow collisions.
- 2: (Default) Try to use the same NIC for the same ring/tree, but still allow for it if it would result in better performance.
As it's often not easy to benchmark hundreds of nodes, often we try to benchmark interconnect performance using, say, 4 nodes. I wasn't sure whether this would give the correct indication for when 40 or 400 nodes will be used so I asked about it here and the answer was:
Extrapolating at scale is not that hard for ring and tree (we have a function in
tuning.cc
predicting it, based on the ring linear latency and the tree log latency with reduced BW). Now as you scale, there are many factors which may cause your real performance to be very far off the prediction, like routing. Also note on an IB network you'll be able to use SHARP; that way your latency stays mostly constant as you scale, your bandwidth doesn't degrade much either, and you're always better than both ring and tree.
Enable NCCL debug logging for subsystems - collectives:
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL
if you're working in a slurm environment with many nodes you probably want to perform this only on rank 0, like so:
if [[ $SLURM_PROCID == "0" ]]; then
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL
fi
Assuming your logs were all sent to main_log.txt
, you can then count how many of each collective call were performed with:
grep -a "NCCL INFO Broadcast" main_log.txt | wc -l
2590
grep -a "NCCL INFO AllReduce" main_log.txt | wc -l
5207
grep -a "NCCL INFO AllGather" main_log.txt | wc -l
1849749
grep -a "NCCL INFO ReduceScatter" main_log.txt | wc -l
82850
It might be a good idea to first isolate a specific stage of the training, as loading and saving will have a very different pattern from training iterations.
So I typically first slice out one iteration. e.g. if each iteration log starts with: iteration: ...
then I'd first do:
csplit main_log.txt '/iteration: /' "{*}"
and then analyse one of the resulting files that correspond to the iterations. By default it will be named something like xx02
.