Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose hardware fence capabilities #10

Open
devreal opened this issue Apr 22, 2019 · 39 comments
Open

Expose hardware fence capabilities #10

devreal opened this issue Apr 22, 2019 · 39 comments
Milestone

Comments

@devreal
Copy link

devreal commented Apr 22, 2019

Currently the MPI standard does not provide any ordering guarantees for puts and gets unless remote completion is performed in between two operations. For atomic operations, only accesses to same/overlapping memory regions are guaranteed to be ordered (if origin and target are the same).

As far as I can see, many (most?) high-performance networks provide the notion of a fence to inject an ordering request into a stream of operations, i.e., to ensure that one operation is completed before the next is performed without having to wait for remote completion at the origin.

Example:

a = 0;
PUT(1->win, 1);
FENCE();
GET(1->win, &a);
FLUSH();
assert(a == 1);

This may be more efficient than having two flushes because we only have one round-trip instead of two, which is esp. useful for latency-bound use-cases (single-value atomics/reads/writes) and to ensure memory consistency in frameworks built on top of MPI RMA.

Libraries such as OpenShmem and UCX both offer this functionality.

Hence my question: has this been discussed (and dismissed?) in the past? Is there a reason not to add something like MPI_Win_order that triggers a hardware fence if available? (since MPI_Win_fence is already taken) On networks that don't support fences in hardware, implementations may always fall-back to a full flush, which is what the user would do otherwise at the moment.

@jeffhammond
Copy link
Member

jeffhammond commented Apr 22, 2019 via email

@pavanbalaji
Copy link

No, the proposed approach is more flexible. For example,

for (1000 loops)
   MPI_Put
MPI_Win_order
for (1000 loops)
   MPI_Put

... can be more efficient than simply using accumulates, which has to order all operations.

@hjelmn
Copy link

hjelmn commented Apr 22, 2019

I agree with this proposal. This brings me back to one of my pet peeves about MPI_Win_fence(). It is a fence + barrier operation. Wish it had been deprecated in MPI-3.0.

The question is what do we call this new operation given the obvious name is taken?

@jeffhammond
Copy link
Member

jeffhammond commented Apr 22, 2019 via email

@pavanbalaji
Copy link

I think for 1000 operations, you'll surely see some difference, at least with MPICH. If not, you can replace that with 100 or 500 or whatever number makes sense for your network. Moreover, for networks that natively provide ordered communication, the difference is even higher. The point is very simple: accumulate gives ordering for all operations; this proposal gives more fine-grained ordering of such ordering.

One aspect that the proposal does not explicitly mention (but I think is implied) is that of atomicity. Presumably, the operations after the MPI_Win_order operation are targeting the same memory locations as after the MPI_Win_order (otherwise you wouldn't need ordering anyway). So this proposal allows the user to tell MPI that everything before MPI_Win_order is nonatomic.

@devreal
Copy link
Author

devreal commented Apr 23, 2019

There is another use-case for fences that may be more intuitive and cannot be realized with the ordering guarantees of RMA atomics per-se: any algorithmic requirement for the order in which different variables should be modified. An example I'm currently working on is a signalling scheme where a process writes to one or more variables on another process and eventually sets a flag to signal completion of the modifications:

Process 0:

PUT(1->win, val_offset, val);
FENCE(1->win);
PUT(1->win, flag_offset, 1);
FLUSH(1->win);

Process 1 (waiting for Process 0 to complete, flushes omitted):

while (GET(1->win, flag_offset) != 0) { }
GET(1->win, val_offset);

This can easily be extended to more complex scenarios (where you couldn't just replace the sync through flag_offset with a send/recv or barrier). If something like the above code is done in a critical part of the application repeatedly, the latency of a second flush (instead of the fence) on Process 1 really adds up.

AFAICS, a flush with remote completion should really only be required in two scenarios:

  1. If there are data dependencies between two operations at the origin, i.e., operation Op2 has the result of a previous Op1 as input (e.g., fetching an offset in Op1 to write to in Op2); or
  2. If completion at the target has to be ensured before some out-of-band (read: non-RMA) signalling is done, e.g., before entering a barrier or sending a message.

I will try to come up with a benchmark to see whether there is a significant difference (I'd be surprised/disappointed if there wasn't, what's a fence for then otherwise? ^^).

@jeffhammond
Copy link
Member

@devreal You're reimplementing the SHMEM send-recv pattern, which is best implemented in MPI using MPI_Send and MPI_Recv. Put-with-notification paired with spinning on a read is overrated as a message-passing implementation.

I can't remember right now, but I also thought that it is undefined whether MPI_Get can read a memory location that may be updated concurrently. When I write such code, I always use MPI_Fetch_and_op(NO_OP).

@jeffhammond
Copy link
Member

@pavanbalaji I meant whether you'd notice fence vs flush in the middle of 1000 RMA ops. There are two scenarios to optimize for: latency and throughput. Accumulate ops probably minimize latency whereas non-atomic RMA plus flush optimizes for throughput. Sure, they aren't trivially interchangeable inside of OSHMPI, but a proper MPI application can use both depending on need.

@devreal
Copy link
Author

devreal commented Apr 23, 2019

@jeffhammond As I said, that is a trivial example. Consider all processes doing similar data exchange with random targets. Blocking send/recv won't get you very far because you easily end up with a pair of processes stuck in send. Once you use non-blocking sends and recvs you start spending quite some time testing for completed requests. I'm working on a paper that explores the merits of using RMA for this type of communication. Latency is not one of them (at least in its current state), throughput actually is on some systems.

Apart from that, in the PGAS framework I'm working on (DASH; based on MPI RMA as a communication backend) we are forced to finish pretty much every put with a remote flush (except if the users know what they are doing an tell us so) because the next put or get might point to the same remote memory location and we have to ensure correct ordering. Tracking accesses to individual global memory locations is not an option for us. Having support for ordered RMA without forcing remote completion would be tremendously helpful. Sure, serializing RMA accesses that way might incur some performance penalty compared to non-ordered puts (additional function call, serialized processing, ordering overhead) but I expect it to be significantly faster than what we have to do right now. (of course we can discuss whether single-element global memory access is worth it after all but that's beyond the point here)

Plus, many systems offer this feature so why not provide access to it?

I can't remember right now, but I also thought that it is undefined whether MPI_Get can read a memory location that may be updated concurrently.

The result in undefined in that you can end up with reading partial updates. That shouldn't be a problem if you flip a value from from zero to one. And the actual implementation indeed uses atomics because there may be more than one writer. The ordering issue I tried to show remains though.

@jdinan
Copy link

jdinan commented Apr 23, 2019

@devreal Having spent a fair amount of time working on both MPI RMA and OpenSHMEM, I can tell you that (1) fence is a concept worth having in MPI RMA and (2) the performance benefit is network dependent. On ordered networks like InfiniBand, the fence is effectively a no-op (yay). On most unordered networks, the fence is effectively the same thing as a flush (oh...). If you happen to have triggered ops, you can do something more creative, but there are few such networks.

I'm all for adding this to MPI RMA, but bear in mind the performance caveats -- if you have a network that is good at PGAS (i.e. reliable, unordered datagrams) there's a good chance it will be bad at fence.

FWIW, you can use MPI accumulates if you want ordering. This works fine for scalar operations, but most networks have lower throughput for vector atomics versus put/get.

@jeffhammond
Copy link
Member

jeffhammond commented Apr 24, 2019

We let RMA users turn off ordering in accumulate. Why not let them turn off atomicity as well? We'd have to change a nontrivial amount of text but the end result would be what appears to be desired here, and perhaps a more efficient implementation of it, since it would not require an additional function call to prescribe ordered put and get.

Feature ordered unordered
atomic accumulate ops accumulate ops + info
not atomic put/get + flush put/get
Performance ordered unordered
atomic 🙂 🤔
not atomic 😕 🙂

@pavanbalaji
Copy link

What does ordering mean without atomicity for a single memory location?

I think @devreal's use case is for multiple separate memory locations. So, in some sense, it's orthogonal to PUT/GET vs. ACC/GACC.

@pavanbalaji
Copy link

pavanbalaji commented Apr 24, 2019

I think the point of this proposal is that FLUSH does two things: (1) ensures operations after the FLUSH are ordered with operations before FLUSH and (2) makes sure the data is visible to other processes. This proposal is to decouple them, so the user can get just (1) without (2), if needed. This is independent of ACC/GACC vs. PUT/GET.

@jdinan
Copy link

jdinan commented Apr 24, 2019

Thanks, I had missed that. This is much stronger than what shmem_fence guarantees. shmem_fence only guarantees that the targeted process will see the operations in order. That is, point-to-point, not global ordering.

@pavanbalaji
Copy link

I believe this proposal is point-to-point ordering too, i.e., a subset of FLUSH, not FLUSH_ALL.

@jeffhammond
Copy link
Member

What does ordering mean without atomicity for a single memory location?

It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.

This behavior is equivalent to non-atomic store instructions that execute in-order.

I think @devreal's use case is for multiple separate memory locations.

That isn't stated so I cannot assume it. It is completely reasonable to want updates that appear in order but are not atomic when the user knows that any element is only updated by a single source.

@pavanbalaji
Copy link

What does ordering mean without atomicity for a single memory location?

It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.

Note that I said to a single memory location. In that case, the data will be corrupted. Why would one need ordering if the data is corrupted anyway?

This behavior is equivalent to non-atomic store instructions that execute in-order.

FWIW, store instructions are ordered only on some architectures, such as x86. But that's also an orthogonal discussion.

I think @devreal's use case is for multiple separate memory locations.

That isn't stated so I cannot assume it. It is completely reasonable to want updates that appear in order but are not atomic when the user knows that any element is only updated by a single source.

I was just clarifying what I believe to be the intent of the proposal based on the discussion. You are right that the proposal is not defined well enough to actually confirm that.

@devreal
Copy link
Author

devreal commented Apr 29, 2019

@jeffhammond @pavanbalaji You're right, I didn't mean to put up a full proposal but rather meant to ask whether this topic is of interest (it seems at least controversial ^^).

I put together a proposal as an extension to the
MPI standard document (page 476, Ctrl+F for "ticket10").

Let me clarify what semantics I want to achieve. From the proposal:

MPI_WIN_ORDER ensures that all previously issued RMA operations initiated by the calling process to the target rank on the specified window complete at the target before operations issued following this call.

I want to cover both cases that were discussed here:

  1. Ordering writes and reads to the same memory location at the same target (my original example). Yes, this can be done using atomics. However, my intent is not to have atomic semantics, i.e., I know there won't be concurrent writes.

  2. Ordering writes to to non-overlapping memory locations at the same target (my second example: Expose hardware fence capabilities #10 (comment)). This cannot be done using atomic operations but would be possible using fence semantics.

Both cases can be solved using flushes waiting for remote completion. However, there are valid use-cases where only partial ordering is required but a flush with remote completion can be deferred until a later point when all operations have been posted. In essence, I can communicate my intent to the MPI implementation and let it decide what the best strategy to achieve what I want is on the current platform. If the best strategy happens to be a flush, so be it (see advice to implementors in the annotated standard).

Also to clarify: there is no notion of global ordering, the ordering is always only guaranteed for operations from the same origin to the same target (on the same window).


As promised I did some measurements using OpenShmem: two put operations to distinct memory locations on the same target PE, required to be ordered, completed by quiet:

      uint64_t val = 0;
      uint64_t* target = shmem_malloc(2*sizeof(uint64_t));
      shmem_long_put(target_ptr, &val, 1, target_pe);
      shmem_fence/quiet();
      shmem_long_put(target_ptr+1, &val, 1, target_pe);
      shmem_quiet();

The ordering is either guaranteed by a fence or quiet. I measured 100k repetitions and the values were reproducible across several runs.

On a Cray XC40 using Cray's shmem implementation, I do not observe any difference between fence and quiet. Unfortunately, I wasn't able to get both the OpenShmem reference implementation and Open MPI's shmem implementation to work within finite time (that is, Open MPI + UCX went up in flames and the reference implementation comes with a tail of dependencies I haven't bothered installing yet)

However, on a Mellanox ConnextX-3 cluster using Open MPI + UCX, the difference between fence and quiet is almost 1.8x, i.e., calling shmem_fence in between the two operations instead of shmem_quiet reduces the time for two operations from 4.1us to 2.2us.

That seems to confirm what @jdinan said about the differences in networks (or is it just that Cray's implementation is bad at fences?)


@jdinan I took a closer look at the OpenShmem standard, which explicitly states that non-blocking get operations are not included in the ordering guarantees of a fence and that a fence only guarantees delivery, not completion. I don't see these limitations explicitly stated for UCX or ibverbs. Any insight into why OpenShmem provides weaker guarantees here?

@jeffhammond
Copy link
Member

What does ordering mean without atomicity for a single memory location?

It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.

Note that I said to a single memory location. In that case, the data will be corrupted. Why would one need ordering if the data is corrupted anyway?

This comment makes no sense. If writes are ordered, data is not corrupted, because they happen in a well-defined sequence. Obviously, this only works for a single source, but that is what we are discussing.

@jeffhammond
Copy link
Member

@devreal I believe that Cray's implementation of fence is quiet, because they use end-to-end rather than link-level reliability. This is merely speculation however - only somebody from Cray can speak with authority on what their proprietary blobs do.

@pavanbalaji
Copy link

If writes are ordered, data is not corrupted, because they happen in a well-defined sequence.

That's a reasonable requirement. If that was what was proposed, then I missed it.

@jdinan
Copy link

jdinan commented May 1, 2019

@devreal OpenSHMEM fence is intended to be a write fence. Ordering for gets remains an open topic

@devreal
Copy link
Author

devreal commented May 2, 2019

@jdinan Is that because it makes the implementation of fence easier/feasible on more architectures? Should we carry this restriction over to MPI?

@jeffhammond
Copy link
Member

jeffhammond commented May 2, 2019 via email

@jdinan
Copy link

jdinan commented May 2, 2019

It's common to be able to order/complete reads and writes separately. I don't think separating them is harmful, as long as it doesn't leave gaps in the memory model. There is an assumption in SHMEM that blocking Gets are ordered (because they complete in program order). If you implement Get operations using ld/st instructions on a weakly ordered architecture, the operations may not be ordered. This has forced some implementations to add a read fence to Gets, which is bad for performance.

@jeffhammond
Copy link
Member

@jdinan Yes, these should be separable. Something like MPI_Win_fence assertions should do the trick.

@devreal
Copy link
Author

devreal commented May 3, 2019

@jdinan @jeffhammond I was just about to add an assert parameter that accepts something like MPI_MODE_NOGET to limit the ordering to puts and atomics. However, Section 11.5.5 states:

The assert argument does not change program semantics if it provides correct information on the program

AFAICS, the assert MPI_MODE_NOGET would change program semantics. Turning things around and allowing the user to pass MPI_MODE_NOPUT and/or MPI_MODE_NOACC (tbd) to MPI_WIN_ORDER to signal that there have not been preceding modifications would allow the implementation to avoid the read fence and still provide ordering with succeeding puts. It's not the same as limiting MPI_WIN_ORDER to put and atomics.

As an alternative, we could add a fence_type parameter (similar to lock_type for MPI_Win_lock) that determines whether MPI_WIN_ORDER includes get or not. In that case we could have MPI_ORDER_FULL and MPI_ORDER_NOGET. Unfortunately, MPI already defines MPI_ORDER_C and MPI_ORDER_FORTRAN so we have to find another namespace...

@pavanbalaji
Copy link

The semantics of these asserts need be clearly stated. Does MPI_MODE_NOGET mean that I cannot use MPI_GET in my program, or does it mean that the MPI_GET calls are not ordered? If it's the latter, is that only for ordering with respect to other MPI_GET operations or all operations?

As an alternative approach, would MPI_WIN_IFLUSH solve the same issue?

@devreal
Copy link
Author

devreal commented May 3, 2019

The semantics of these asserts need be clearly stated. Does MPI_MODE_NOGET mean that I cannot use MPI_GET in my program, or does it mean that the MPI_GET calls are not ordered? If it's the latter, is that only for ordering with respect to other MPI_GET operations or all operations?

MPI_MODE_NOGET would mean that the ordering request does not apply to preceding and succeeding MPI_GET calls but only affect put and atomic operations. The user is still allowed to use MPI_GET before and after the call. I'm not sure if there is much use for ordering MPI_GET operations independent of put/atomic operations. MPI_MODE_NOGET isn't strictly an assertion but a request, something I didn't immediately think about. Alternatively, MPI_MODE_NOPUT could be used to signal that there have not been conflicting put/accumulate operations that would require a memory read fence on some architectures. Maybe that is enough already?

Would MPI_WIN_IFLUSH trigger a flush and return a request that can be used to wait for all previously posted operations to complete? Would it imply ordering relative to operations posted after the call to MPI_WIN_IFLUSH? In principle, this seems like an interesting concept, also in light of the discussion on request-based atomic operation (mpi-forum/mpi-issues#128). It seems a bit heavier and involves more work on the user-level due to request handling though.

@pavanbalaji
Copy link

If we need to order future GETs with previous PUTs, for example, does this proposal require us to force ordering for all operations?

@devreal
Copy link
Author

devreal commented May 5, 2019

@pavanbalaji I'm not sure I fully understand your question. Do you mean whether or not the proposal requires ordering of non-conflicting PUTs and GETs? In its current form yes: all previous PUTs would have to complete at the target before any future GETs may read their value from the target.
The current wording is mostly borrowed from OpenShmem, which doesn't include GETs so that hasn't been an issue there...

I guess this can be relaxed to only require ordering of conflicting PUTs and GETs. After all, GETs do not observe the outcome of non-conflicting PUTs...

@hjelmn
Copy link

hjelmn commented May 5, 2019

Should be make it a flag what the fence applies to? Make the flags bitwise MPI_ORDER_READ, MPI_ORDER_WRITE. That would be useful for shared memory.

@pavanbalaji
Copy link

@devreal My point is that your proposed hints do not have read-after-write or write-after-read ordering. They only have write-after-write (i.e., put-order-put) and read-after-read (i.e., get-order-get). I don't know if they are needed, but I'm just calling out the holes so we can make an explicit decision to either add them or ignore them.

@devreal
Copy link
Author

devreal commented May 6, 2019

@hjelmn Is it a problem that MPI_ORDER_C already exists, i.e., that there is somewhat of a namespace clash?

@pavanbalaji @hjelmn Is it sufficient to control whether GETs are included in the ordering (as per @hjelm's suggestion) or is a more fine-grained control similar to the accumulate_ordering info key more helpful (bitwise combinable MPI_ORDER_RAW, MPI_ORDER_WAR, MPI_ORDER_WAW, MPI_ORDER_RAR)? Also, would passing MPI_ORDER_READ or MPI_ORDER_RAR alone be anything else but a noop?

@pavanbalaji
Copy link

I generally like having more control on the actual guarantees provided (so more options). But I was merely asking the question, not suggesting you change your proposal to be one way or another.

@devreal
Copy link
Author

devreal commented May 6, 2019

@pavanbalaji I understand, and your question pointed at something I hadn't considered :) The current wording with regard to GETs is too strict and there seems to be a common interest in adding fine-grained control. I'd like to get to a consensus on how that control should look like :)

@pavanbalaji
Copy link

pavanbalaji commented May 6, 2019

Also, would passing MPI_ORDER_READ or MPI_ORDER_RAR alone be anything else but a noop?

I don't think this would be a no-op. Consider this, admittedly convoluted, example:

All processes create a window with no accumulate ordering. Initial value of X (located on process P2) is 0.

P0:
    a = GET_ACCUMULATE(x, NO_OP)
    WIN_ORDER(RAR)
    b = GET_ACCUMULATE(x, NO_OP)
    WIN_FLUSH
    if (b == 0)
        assert(a == 0);

P1:
   ACCUMULATE(x = 1)
   WIN_FLUSH

@devreal
Copy link
Author

devreal commented May 8, 2019

After skimming through the documentation of existing memory fences I updated the proposal to include an order_type (as proposed by @hjelmn) that is the bit-wise OR combination of one or more of MPI_WIN_ORDER_READ, MPI_WIN_ORDER_WRITE, and MPI_WIN_ORDER_DATA, plus MPI_WIN_ORDER_FULL, which is a shorthand for all three.

MPI_WIN_ORDER_READ requires completion of all previously issued reads at the target before reads issued following the call. This is similar to an lfence.

MPI_WIN_ORDER_WRITE requires completion of all previously issued RMA writes at the target before writes issued following the call. This is similar to an sfence.

MPI_WIN_ORDER_DATA requires the ordering of reads and writes on overlapping memory regions, similar to the default for accumulate operations but including non-atomic operations. This should not need special instructions on most shared memory architectures.

MPI_WIN_ORDER_FULL is similar to an mfence.

For inter-node communication, all ordering types will require a fence in network hardware (where available) or some other form of ordering mechanism.

The proposal now also contains two examples, a rationale for the order types, and an adapted Section 11.7.2 (Ordering):
mpi-report.pdf

Limitations

  • The ordering can only be enforced on the same window, not across window boundaries to the same target process.
  • Load/store operations to local window memory (performed by the application) are explicitly excluded from the ordering. I'm not quite sure whether that is reasonable but my intuition tells me that otherwise additional memory barriers might be required.

Any feedback would be much appreciated :)

@devreal
Copy link
Author

devreal commented May 10, 2019

I updated the draft document:
mpi-report-memalign.pdf

I am not sure about the exclusion of loads and stores from the ordering. It might be helpful for users of shared memory windows but would prevent implementations from deferring the memory fence to the next put/get to/from shared memory targets. Opinions?

Unless there are objections or concerns about the current design, I'd like to put this into a PR in time for a first reading in Chicago. Please let me know if you think that's premature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants