Skip to content

Latest commit

 

History

History
 
 

debug

Debugging and Troubleshooting

Guides

Tools

  • Debug Tools

  • torch-distributed-gpu-test.py - this a torch.distributed diagnostics script that checks that all GPUs in the cluster (one or many nodes) can talk to each other and allocate gpu memory.

  • NicerTrace - this is an improved trace python module with multiple additional flags added to the constructor and more useful output.