Some clarifications of global reduction operations needed #44

krzikalla · 2018-08-15T09:55:33Z

In sec. 11.3 there is the following sentence:

Starting a reduction operation for the same group in a separate thread before previously invoked operation is finished on all processes of the group is not allowed and yields undedined behavior.

"in a separate thread" is not needed here, since it also undefined to do it like this:

gaspi_allreduce(..., myGroup, GASPI_TEST);
gaspi_allreduce(..., myGroup, GASPI_BLOCK);

Here both reduction operations are started in the same thread and potentially overlap.

I don't think, that "is finished on all processes of the group" is intended, since then you cannot write

gaspi_allreduce(..., myGroup, GASPI_BLOCK);
gaspi_allreduce(..., myGroup, GASPI_BLOCK);

since the local process doesn't know, whether all other processes have already left the first gaspi_allreduce. You would have to place a barrier inbetween the two calls.

Another issue: it should be guaranteed, that the results of the reduction are bitwise equal across all ranks of the group.

mrahn · 2018-08-16T08:26:47Z

add 2): "is finished on all processes of the group" sounds okay as otherwise the cleanup of the not-yet-completed operation might interfere the already-started-remotely one. The post condition of allreduce says "all group members have invoked the procedure", so indeed the barrier is required or any other method for global synchronization. To change the post condition to "all group members have completed the procedure" would push that barrier into the library. This internal barrier is overhead in case there is (implicit) global synchronization in between the calls to allreduce. Maybe the specification needs an user advice in order to create awareness?

mrahn · 2018-08-16T08:44:22Z

add 1) Even though the timeout value is part of the parameters two allreduce calls that differ only in the timeout value should considered be the same in order to allow patterns like:

timeout = TEST;
PROGRESS:
switch (allreduce (..., timeout))
{
case TIMEOUT:
  if (local_work())
  {
    do_local_work();
  }
  else
  {
    timeout = BLOCK;
  }
  goto PROGRESS;
case SUCCESS:
  break;
default:
  throw "BUMMER";
}

or

timeout = 1;
PROGRESS:
switch (allreduce (..., timeout))
{
case TIMEOUT: 
  timeout = std::min (timeout << 1, maximal_reasonable_timeout);
  goto PROGRESS;
case SUCCESS:
  break;
default:
  throw "Fieeep";
}

Any objections?

krzikalla · 2018-09-14T08:23:41Z

re 2: indeed an user advice would be helpful. OpenShmem works the same way and has a dedicated section about that scenario. BTW, in OpenShmem the work memory is user-provided. This gives you the chance to use double-buffering, which makes consecutive reduce calls possible.

krzikalla · 2018-09-14T08:33:14Z

re 1: The exciting question is: how is a invalid and a valid gaspi_allreduce call during an already ongoing gaspi_allreduce distinguished? If both have the same parameters (barring the timeout), then it is clearly valid. But what happens under the hood? Are the other parameters even used? Actually they are already there. That said, a call gaspi_allreduce_continue(timeout) could be helpful here. The user wouldn't need to store the parameters just to ensure a valid call anymore.
My initial claim still holds anyway under the assumption, that the first call doesn't return with GASPI_SUCCESS and that both calls differ in some arguments. "in a separate thread" is not the property, which distinguishs between valid and invalid calls.

mrahn · 2018-09-18T07:22:47Z

The allreduce_continue can be done in the application (which has a wrapper anyways). Can you give an example when an application has no easy way to avoid multiple concurrent allreduce?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some clarifications of global reduction operations needed #44

Some clarifications of global reduction operations needed #44

krzikalla commented Aug 15, 2018

mrahn commented Aug 16, 2018

mrahn commented Aug 16, 2018

krzikalla commented Sep 14, 2018

krzikalla commented Sep 14, 2018 •

edited

Loading

mrahn commented Sep 18, 2018

Some clarifications of global reduction operations needed #44

Some clarifications of global reduction operations needed #44

Comments

krzikalla commented Aug 15, 2018

mrahn commented Aug 16, 2018

mrahn commented Aug 16, 2018

krzikalla commented Sep 14, 2018

krzikalla commented Sep 14, 2018 • edited Loading

mrahn commented Sep 18, 2018

krzikalla commented Sep 14, 2018 •

edited

Loading