Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing, Benchmarking, and Optimizing Horovod collectives on Jean-Zay #4

Open
5 tasks
EiffL opened this issue May 12, 2021 · 2 comments
Open
5 tasks
Assignees
Labels
Hackathon Goal High level goals for the hack week Horovod Issues related to the horovod backend

Comments

@EiffL
Copy link
Member

EiffL commented May 12, 2021

This issue is to track the developments needed to finalize and validate the modified version of Horovod we developed. This overarching goal will encapsulate several smaller issues.

Goal

By the end of the hackweek, having a tested code with an associated Pull Request to https://github.com/horovod/horovod which can fully support our needs for Mesh TensorFlow.

Context

With @kimchitsigai and @mypey we worked on some modifications to Horovod that can support multiple communicators. A description of what we did can be found here: DifferentiableUniverseInitiative/horovod#2
In parallel, a different proposal for supporting multiple groups of devices was proposed here horovod/horovod#2839
In the end, probably one of these 2 implementation will be merged, but we can try to find which one works the best for our purposes.

Participants

The main participants to this task are:

Tasks

@EiffL EiffL added Hackathon Goal High level goals for the hack week Horovod Issues related to the horovod backend labels May 12, 2021
@EiffL EiffL self-assigned this May 12, 2021
@EiffL
Copy link
Member Author

EiffL commented May 18, 2021

And actually I added a new task, following some weird deadlocks identified by @andrevitorelli and documented there DifferentiableUniverseInitiative/horovod#5
@kimchitsigai this is probably a bug in our horovod modifications somewhere :-|

@andrevitorelli
Copy link

andrevitorelli commented May 18, 2021

Just to leave the error message here:

<...>/horovod/horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or 
broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different 
ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

Missing ranks:
0: [iFFT3D_2/HorovodAlltoall_iFFT3D_2_stack_0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hackathon Goal High level goals for the hack week Horovod Issues related to the horovod backend
Projects
None yet
Development

No branches or pull requests

2 participants