-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose hardware fence capabilities #10
Comments
I would just write this:
ACCUMULATE(op=REPLACE)
GET_ACCUMULATE(op=NOOP)
This is ordered and implementation uses the best method it can for ordering. If there’s a hardware fence operation (there almost certainly isn’t) then it can inject that after the ACCUMULATE or before the GET_ACCUMULATE.
|
No, the proposed approach is more flexible. For example,
... can be more efficient than simply using accumulates, which has to order all operations. |
I agree with this proposal. This brings me back to one of my pet peeves about MPI_Win_fence(). It is a fence + barrier operation. Wish it had been deprecated in MPI-3.0. The question is what do we call this new operation given the obvious name is taken? |
No, the proposed approach is more flexible. For example,
for (1000 loops)
MPI_Put
MPI_Win_order
for (1000 loops)
MPI_Put
Can the difference between order and flush be observed reliably relative to 2000 RMA operations?
I’m not sure you’ll even be able to see a difference in shared memory with an appropriate weakly ordered CPU like POWER. How many cycles difference between msync and mfence (or whatever the instructions are called)? How many cycles for loads from memory (not cache)?
To be clear, we should not add features to the MPI standard to optimize SHMEM microbenchmarks.
... can be more efficient than simply using accumulates, which has to order all operations.
It has to order the operations that are used on the window in question. Add info key to subset operations per window if it will help.
|
I think for 1000 operations, you'll surely see some difference, at least with MPICH. If not, you can replace that with 100 or 500 or whatever number makes sense for your network. Moreover, for networks that natively provide ordered communication, the difference is even higher. The point is very simple: accumulate gives ordering for all operations; this proposal gives more fine-grained ordering of such ordering. One aspect that the proposal does not explicitly mention (but I think is implied) is that of atomicity. Presumably, the operations after the |
There is another use-case for fences that may be more intuitive and cannot be realized with the ordering guarantees of RMA atomics per-se: any algorithmic requirement for the order in which different variables should be modified. An example I'm currently working on is a signalling scheme where a process writes to one or more variables on another process and eventually sets a flag to signal completion of the modifications: Process 0:
Process 1 (waiting for Process 0 to complete, flushes omitted):
This can easily be extended to more complex scenarios (where you couldn't just replace the sync through AFAICS, a flush with remote completion should really only be required in two scenarios:
I will try to come up with a benchmark to see whether there is a significant difference (I'd be surprised/disappointed if there wasn't, what's a fence for then otherwise? ^^). |
@devreal You're reimplementing the SHMEM send-recv pattern, which is best implemented in MPI using I can't remember right now, but I also thought that it is undefined whether |
@pavanbalaji I meant whether you'd notice fence vs flush in the middle of 1000 RMA ops. There are two scenarios to optimize for: latency and throughput. Accumulate ops probably minimize latency whereas non-atomic RMA plus flush optimizes for throughput. Sure, they aren't trivially interchangeable inside of OSHMPI, but a proper MPI application can use both depending on need. |
@jeffhammond As I said, that is a trivial example. Consider all processes doing similar data exchange with random targets. Blocking send/recv won't get you very far because you easily end up with a pair of processes stuck in send. Once you use non-blocking sends and recvs you start spending quite some time testing for completed requests. I'm working on a paper that explores the merits of using RMA for this type of communication. Latency is not one of them (at least in its current state), throughput actually is on some systems. Apart from that, in the PGAS framework I'm working on (DASH; based on MPI RMA as a communication backend) we are forced to finish pretty much every put with a remote flush (except if the users know what they are doing an tell us so) because the next put or get might point to the same remote memory location and we have to ensure correct ordering. Tracking accesses to individual global memory locations is not an option for us. Having support for ordered RMA without forcing remote completion would be tremendously helpful. Sure, serializing RMA accesses that way might incur some performance penalty compared to non-ordered puts (additional function call, serialized processing, ordering overhead) but I expect it to be significantly faster than what we have to do right now. (of course we can discuss whether single-element global memory access is worth it after all but that's beyond the point here) Plus, many systems offer this feature so why not provide access to it?
The result in undefined in that you can end up with reading partial updates. That shouldn't be a problem if you flip a value from from zero to one. And the actual implementation indeed uses atomics because there may be more than one writer. The ordering issue I tried to show remains though. |
@devreal Having spent a fair amount of time working on both MPI RMA and OpenSHMEM, I can tell you that (1) fence is a concept worth having in MPI RMA and (2) the performance benefit is network dependent. On ordered networks like InfiniBand, the fence is effectively a no-op (yay). On most unordered networks, the fence is effectively the same thing as a flush (oh...). If you happen to have triggered ops, you can do something more creative, but there are few such networks. I'm all for adding this to MPI RMA, but bear in mind the performance caveats -- if you have a network that is good at PGAS (i.e. reliable, unordered datagrams) there's a good chance it will be bad at fence. FWIW, you can use MPI accumulates if you want ordering. This works fine for scalar operations, but most networks have lower throughput for vector atomics versus put/get. |
We let RMA users turn off ordering in accumulate. Why not let them turn off atomicity as well? We'd have to change a nontrivial amount of text but the end result would be what appears to be desired here, and perhaps a more efficient implementation of it, since it would not require an additional function call to prescribe ordered put and get.
|
What does ordering mean without atomicity for a single memory location? I think @devreal's use case is for multiple separate memory locations. So, in some sense, it's orthogonal to PUT/GET vs. ACC/GACC. |
I think the point of this proposal is that |
Thanks, I had missed that. This is much stronger than what |
I believe this proposal is point-to-point ordering too, i.e., a subset of |
It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations. This behavior is equivalent to non-atomic store instructions that execute in-order.
That isn't stated so I cannot assume it. It is completely reasonable to want updates that appear in order but are not atomic when the user knows that any element is only updated by a single source. |
Note that I said to a single memory location. In that case, the data will be corrupted. Why would one need ordering if the data is corrupted anyway?
FWIW, store instructions are ordered only on some architectures, such as
I was just clarifying what I believe to be the intent of the proposal based on the discussion. You are right that the proposal is not defined well enough to actually confirm that. |
@jeffhammond @pavanbalaji You're right, I didn't mean to put up a full proposal but rather meant to ask whether this topic is of interest (it seems at least controversial ^^). I put together a proposal as an extension to the Let me clarify what semantics I want to achieve. From the proposal:
I want to cover both cases that were discussed here:
Both cases can be solved using flushes waiting for remote completion. However, there are valid use-cases where only partial ordering is required but a flush with remote completion can be deferred until a later point when all operations have been posted. In essence, I can communicate my intent to the MPI implementation and let it decide what the best strategy to achieve what I want is on the current platform. If the best strategy happens to be a flush, so be it (see advice to implementors in the annotated standard). Also to clarify: there is no notion of global ordering, the ordering is always only guaranteed for operations from the same origin to the same target (on the same window). As promised I did some measurements using OpenShmem: two put operations to distinct memory locations on the same target PE, required to be ordered, completed by uint64_t val = 0;
uint64_t* target = shmem_malloc(2*sizeof(uint64_t));
shmem_long_put(target_ptr, &val, 1, target_pe);
shmem_fence/quiet();
shmem_long_put(target_ptr+1, &val, 1, target_pe);
shmem_quiet(); The ordering is either guaranteed by a On a Cray XC40 using Cray's shmem implementation, I do not observe any difference between However, on a Mellanox ConnextX-3 cluster using Open MPI + UCX, the difference between That seems to confirm what @jdinan said about the differences in networks (or is it just that Cray's implementation is bad at fences?) @jdinan I took a closer look at the OpenShmem standard, which explicitly states that non-blocking get operations are not included in the ordering guarantees of a |
This comment makes no sense. If writes are ordered, data is not corrupted, because they happen in a well-defined sequence. Obviously, this only works for a single source, but that is what we are discussing. |
@devreal I believe that Cray's implementation of fence is quiet, because they use end-to-end rather than link-level reliability. This is merely speculation however - only somebody from Cray can speak with authority on what their proprietary blobs do. |
That's a reasonable requirement. If that was what was proposed, then I missed it. |
@devreal OpenSHMEM fence is intended to be a write fence. Ordering for gets remains an open topic |
@jdinan Is that because it makes the implementation of fence easier/feasible on more architectures? Should we carry this restriction over to MPI? |
No, we should define order in a way that meets MPI’s goals of completeness and orthogonality. SHMEM is full of historical artifacts that only made sense on the Cray T3E or which don’t support the full range of one-sided communication usage models.
|
It's common to be able to order/complete reads and writes separately. I don't think separating them is harmful, as long as it doesn't leave gaps in the memory model. There is an assumption in SHMEM that blocking Gets are ordered (because they complete in program order). If you implement Get operations using ld/st instructions on a weakly ordered architecture, the operations may not be ordered. This has forced some implementations to add a read fence to Gets, which is bad for performance. |
@jdinan Yes, these should be separable. Something like |
@jdinan @jeffhammond I was just about to add an
AFAICS, the assert As an alternative, we could add a |
The semantics of these asserts need be clearly stated. Does As an alternative approach, would |
Would |
If we need to order future GETs with previous PUTs, for example, does this proposal require us to force ordering for all operations? |
@pavanbalaji I'm not sure I fully understand your question. Do you mean whether or not the proposal requires ordering of non-conflicting PUTs and GETs? In its current form yes: all previous PUTs would have to complete at the target before any future GETs may read their value from the target. I guess this can be relaxed to only require ordering of conflicting PUTs and GETs. After all, GETs do not observe the outcome of non-conflicting PUTs... |
Should be make it a flag what the fence applies to? Make the flags bitwise MPI_ORDER_READ, MPI_ORDER_WRITE. That would be useful for shared memory. |
@devreal My point is that your proposed hints do not have read-after-write or write-after-read ordering. They only have write-after-write (i.e., put-order-put) and read-after-read (i.e., get-order-get). I don't know if they are needed, but I'm just calling out the holes so we can make an explicit decision to either add them or ignore them. |
@hjelmn Is it a problem that @pavanbalaji @hjelmn Is it sufficient to control whether GETs are included in the ordering (as per @hjelm's suggestion) or is a more fine-grained control similar to the |
I generally like having more control on the actual guarantees provided (so more options). But I was merely asking the question, not suggesting you change your proposal to be one way or another. |
@pavanbalaji I understand, and your question pointed at something I hadn't considered :) The current wording with regard to GETs is too strict and there seems to be a common interest in adding fine-grained control. I'd like to get to a consensus on how that control should look like :) |
I don't think this would be a no-op. Consider this, admittedly convoluted, example: All processes create a window with no accumulate ordering. Initial value of
|
After skimming through the documentation of existing memory fences I updated the proposal to include an
For inter-node communication, all ordering types will require a fence in network hardware (where available) or some other form of ordering mechanism. The proposal now also contains two examples, a rationale for the order types, and an adapted Section 11.7.2 (Ordering): Limitations
Any feedback would be much appreciated :) |
I updated the draft document: I am not sure about the exclusion of loads and stores from the ordering. It might be helpful for users of shared memory windows but would prevent implementations from deferring the memory fence to the next put/get to/from shared memory targets. Opinions? Unless there are objections or concerns about the current design, I'd like to put this into a PR in time for a first reading in Chicago. Please let me know if you think that's premature. |
Currently the MPI standard does not provide any ordering guarantees for puts and gets unless remote completion is performed in between two operations. For atomic operations, only accesses to same/overlapping memory regions are guaranteed to be ordered (if origin and target are the same).
As far as I can see, many (most?) high-performance networks provide the notion of a
fence
to inject an ordering request into a stream of operations, i.e., to ensure that one operation is completed before the next is performed without having to wait for remote completion at the origin.Example:
This may be more efficient than having two flushes because we only have one round-trip instead of two, which is esp. useful for latency-bound use-cases (single-value atomics/reads/writes) and to ensure memory consistency in frameworks built on top of MPI RMA.
Libraries such as OpenShmem and UCX both offer this functionality.
Hence my question: has this been discussed (and dismissed?) in the past? Is there a reason not to add something like
MPI_Win_order
that triggers a hardware fence if available? (sinceMPI_Win_fence
is already taken) On networks that don't support fences in hardware, implementations may always fall-back to a full flush, which is what the user would do otherwise at the moment.The text was updated successfully, but these errors were encountered: