-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some clarifications of global reduction operations needed #44
Comments
add 2): "is finished on all processes of the group" sounds okay as otherwise the cleanup of the not-yet-completed operation might interfere the already-started-remotely one. The post condition of allreduce says "all group members have invoked the procedure", so indeed the barrier is required or any other method for global synchronization. To change the post condition to "all group members have completed the procedure" would push that barrier into the library. This internal barrier is overhead in case there is (implicit) global synchronization in between the calls to allreduce. Maybe the specification needs an user advice in order to create awareness? |
add 1) Even though the timeout value is part of the parameters two allreduce calls that differ only in the timeout value should considered be the same in order to allow patterns like:
or
Any objections? |
re 2: indeed an user advice would be helpful. OpenShmem works the same way and has a dedicated section about that scenario. BTW, in OpenShmem the work memory is user-provided. This gives you the chance to use double-buffering, which makes consecutive reduce calls possible. |
re 1: The exciting question is: how is a invalid and a valid gaspi_allreduce call during an already ongoing gaspi_allreduce distinguished? If both have the same parameters (barring the timeout), then it is clearly valid. But what happens under the hood? Are the other parameters even used? Actually they are already there. That said, a call gaspi_allreduce_continue(timeout) could be helpful here. The user wouldn't need to store the parameters just to ensure a valid call anymore. |
The |
In sec. 11.3 there is the following sentence:
Starting a reduction operation for the same group in a separate thread before previously invoked operation is finished on all processes of the group is not allowed and yields undedined behavior.
gaspi_allreduce(..., myGroup, GASPI_TEST);
gaspi_allreduce(..., myGroup, GASPI_BLOCK);
Here both reduction operations are started in the same thread and potentially overlap.
gaspi_allreduce(..., myGroup, GASPI_BLOCK);
gaspi_allreduce(..., myGroup, GASPI_BLOCK);
since the local process doesn't know, whether all other processes have already left the first gaspi_allreduce. You would have to place a barrier inbetween the two calls.
Another issue: it should be guaranteed, that the results of the reduction are bitwise equal across all ranks of the group.
The text was updated successfully, but these errors were encountered: