Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some clarifications of global reduction operations needed #44

Open
krzikalla opened this issue Aug 15, 2018 · 5 comments
Open

Some clarifications of global reduction operations needed #44

krzikalla opened this issue Aug 15, 2018 · 5 comments

Comments

@krzikalla
Copy link
Collaborator

In sec. 11.3 there is the following sentence:

Starting a reduction operation for the same group in a separate thread before previously invoked operation is finished on all processes of the group is not allowed and yields undedined behavior.

  1. "in a separate thread" is not needed here, since it also undefined to do it like this:

gaspi_allreduce(..., myGroup, GASPI_TEST);
gaspi_allreduce(..., myGroup, GASPI_BLOCK);

Here both reduction operations are started in the same thread and potentially overlap.

  1. I don't think, that "is finished on all processes of the group" is intended, since then you cannot write

gaspi_allreduce(..., myGroup, GASPI_BLOCK);
gaspi_allreduce(..., myGroup, GASPI_BLOCK);

since the local process doesn't know, whether all other processes have already left the first gaspi_allreduce. You would have to place a barrier inbetween the two calls.

Another issue: it should be guaranteed, that the results of the reduction are bitwise equal across all ranks of the group.

@mrahn
Copy link
Collaborator

mrahn commented Aug 16, 2018

add 2): "is finished on all processes of the group" sounds okay as otherwise the cleanup of the not-yet-completed operation might interfere the already-started-remotely one. The post condition of allreduce says "all group members have invoked the procedure", so indeed the barrier is required or any other method for global synchronization. To change the post condition to "all group members have completed the procedure" would push that barrier into the library. This internal barrier is overhead in case there is (implicit) global synchronization in between the calls to allreduce. Maybe the specification needs an user advice in order to create awareness?

@mrahn
Copy link
Collaborator

mrahn commented Aug 16, 2018

add 1) Even though the timeout value is part of the parameters two allreduce calls that differ only in the timeout value should considered be the same in order to allow patterns like:

timeout = TEST;
PROGRESS:
switch (allreduce (..., timeout))
{
case TIMEOUT:
  if (local_work())
  {
    do_local_work();
  }
  else
  {
    timeout = BLOCK;
  }
  goto PROGRESS;
case SUCCESS:
  break;
default:
  throw "BUMMER";
}

or

timeout = 1;
PROGRESS:
switch (allreduce (..., timeout))
{
case TIMEOUT: 
  timeout = std::min (timeout << 1, maximal_reasonable_timeout);
  goto PROGRESS;
case SUCCESS:
  break;
default:
  throw "Fieeep";
}

Any objections?

@krzikalla
Copy link
Collaborator Author

re 2: indeed an user advice would be helpful. OpenShmem works the same way and has a dedicated section about that scenario. BTW, in OpenShmem the work memory is user-provided. This gives you the chance to use double-buffering, which makes consecutive reduce calls possible.

@krzikalla
Copy link
Collaborator Author

krzikalla commented Sep 14, 2018

re 1: The exciting question is: how is a invalid and a valid gaspi_allreduce call during an already ongoing gaspi_allreduce distinguished? If both have the same parameters (barring the timeout), then it is clearly valid. But what happens under the hood? Are the other parameters even used? Actually they are already there. That said, a call gaspi_allreduce_continue(timeout) could be helpful here. The user wouldn't need to store the parameters just to ensure a valid call anymore.
My initial claim still holds anyway under the assumption, that the first call doesn't return with GASPI_SUCCESS and that both calls differ in some arguments. "in a separate thread" is not the property, which distinguishs between valid and invalid calls.

@mrahn
Copy link
Collaborator

mrahn commented Sep 18, 2018

The allreduce_continue can be done in the application (which has a wrapper anyways). Can you give an example when an application has no easy way to avoid multiple concurrent allreduce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants