You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is to track the developments needed to finalize and validate the modified version of Horovod we developed. This overarching goal will encapsulate several smaller issues.
Goal
By the end of the hackweek, having a tested code with an associated Pull Request to https://github.com/horovod/horovod which can fully support our needs for Mesh TensorFlow.
Context
With @kimchitsigai and @mypey we worked on some modifications to Horovod that can support multiple communicators. A description of what we did can be found here: DifferentiableUniverseInitiative/horovod#2
In parallel, a different proposal for supporting multiple groups of devices was proposed here horovod/horovod#2839
In the end, probably one of these 2 implementation will be merged, but we can try to find which one works the best for our purposes.
<...>/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or
broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different
ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [iFFT3D_2/HorovodAlltoall_iFFT3D_2_stack_0]
This issue is to track the developments needed to finalize and validate the modified version of Horovod we developed. This overarching goal will encapsulate several smaller issues.
Goal
By the end of the hackweek, having a tested code with an associated Pull Request to https://github.com/horovod/horovod which can fully support our needs for Mesh TensorFlow.
Context
With @kimchitsigai and @mypey we worked on some modifications to Horovod that can support multiple communicators. A description of what we did can be found here: DifferentiableUniverseInitiative/horovod#2
In parallel, a different proposal for supporting multiple groups of devices was proposed here horovod/horovod#2839
In the end, probably one of these 2 implementation will be merged, but we can try to find which one works the best for our purposes.
Participants
The main participants to this task are:
Tasks
The text was updated successfully, but these errors were encountered: