-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI 3.1/4 – how can progress be ensured with passive-target synchronization? #28
Comments
Replace We've debated this topic for most of the years I've been in the MPI Forum, without making any progress on it (pun intentional). https://pmodels.github.io/casper-www/ was the pragmatic response from Argonne when I was there. |
This is what most people use: int MPIX_Engage_progress(void)
{
return MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} |
It does, yes, but I get worried about a naive implementation of flush that has some communication overhead even when no RMA operations are pending.
In that case, might I request a note in a future MPI spec to state that in a passive communication epoch, even processes that only expect to be the targets of communication may need to periodically ensure MPI progress? The headline description of a passive target epoch (p588, v4.1RFC) states:
… which gives the vague impression that no matter the RMA calls executed by other processes, the process that owns the target window does not need to do anything to ensure their eventual completion. It'd be better yet if the specification could split the RMA calls into ones that may require progress on the target and ones that definitely do not, but I suspect that would be controversial.
Oh, that makes a lot of sense, and I can use an IProbe version in the RMA context. I tend to forget that the |
It shouldn't expensive. Checking a queue for emptiness is trivial. It's going to be less expensive than error checking, probably by a lot.
I think it would be useful to add an example of the use case you have, which is not uncommon, and show how to do it correctly.
Yes, but eventually is a long time 😉
I doubt it would be controversial. @Wee-Free-Scot and others have been looking at issues like this for a while.
|
The new (in MPI-4.1) subsection 2.9 in the Terms chapter defines "progress" in the MPI Standard for the first time. It only gives example procedures from point-to-point, but the rules stated there can be used to put RMA procedures into categories "must guarantee progress" and "others". @csubich Once you've had a chance to read that and to apply it to RMA, any remaining confusion or unresolved conflicts would be really useful feedback. |
@Wee-Free-Scot I apologize for having missed your request for feedback earlier. I've come back to this project and I've had the opportunity to look at the new language. The extra language is nice, and having a black and white definition of 'progress' certainly helps. The definition of 'decoupled MPI activities' also helps explain the sense in which RMA calls might need help from a progress engine. One subtlety, however, is that the text permits an MPI engine to not make progress on calls that might block but do not. For example, a process that calls This leads to a perverse interpretation of 'progress' that (I think) is permitted by 4.1 but might still cause the above code to deadlock:
A minor extension to the specification to include a "must make RMA progress" call would be to permit Footnotes
|
The "perverse interpretations" that you mention are all intentional implementation flexibility. The canonical "kick the progress engine" call is a call to MPI_Iprobe with parameters that select a message that is never sent. Each call must return false, the repetition of such calls must be guaranteed to make progress (even though no individual call is required to make progress). Being required to make progress is MPI-wide -- so it includes RMA operations, even though the canonical procedure is from the point-to-point chapter. Given this, overloading the MPI_Win_test procedure in this way seems superfluous. Separately, it seems problematic because you can (although certainly shouldn't) mix the RMA synchronisation methods on the same window, so calling MPI_Win_test has an existing meaning. |
First, to take a step back I'll reiterate that it is frustrating and deeply unintuitive that legal sequences of RMA calls might require non-RMA calls to avoid deadlock. I accept that's how the specification works, but the complete absence of "messages" from the 'RMA way' makes It doesn't help that this issue seems to be limited to passive-target synchronization. Both fence and PSCW contain blocking calls that will force progress, so the originating problem can't occur. Here, the obvious way of writing the passive-target code did lead to the problem, and I didn't ignore any bright signposts in the specification. I would have preferred things if Finally, if "a call to This feels absurd. The semantic intent of "make progress" is fundamental to MPI itself, but I'm really contorting myself to find a way to just engage the progress engine (if necessary) without making additional assumptions about the MPI environment.
I don't truly like proposing such an extension, but it's the only way I can see to "test" using only MPI objects guaranteed to have meaning for the RMA code, without side-effects elsewhere. Footnotes |
I agree with much of your complaint and share much of your frustration. IMHO, it was a major step forward to include a definition of strong progress vs. weak progress and to state that MPI only requires weak progress. IMHO, the best fix would be to require strong progress for RMA passive target operations, even if the MPI library only supports weak progress everywhere else. Tangential to that would be a query procedure that permitted the user to ask whether strong progress is supported (to permit the user to code around the limitations of weak progress only when absolutely needed). Your proposal of adding Have you tried (and measured performance of) JeffH's suggestion of using |
I agree with this (mostly): passive target RMA really doesn't make much sense without strong progress. The exception is the lockall case - a subset of the uses here would work with weak progress. But that's rather ugly and probably not worth it. |
The problem is that the specification desperately wants to assume strong progress for passive-target synchronization, IMO to the point that it gives a misleading impression about progress. Consider this bit from §12.7 (4.1):
… which describes the deadlock at issue here. The thin wedge of correctness comes in the "update to a public window copy" part, which is only guaranteed (earlier) by:
… which, when parsing very finely, does not give any guarantees that the origin's call to The Rationale section on p616 starts to bring this problem to light, but on p617 it only warns about deadlock when using shared memory for loads:
From the implementation's point of view, a local load/store from a window seems to be equivalent to a shared memory load/store, thus permitting the deadlock in this issue. For fence and PSCW window synchronizations, the synchronization primitives enforce completion-or-blocking, but passive-target synchronization doesn't have a natural way to do this. The frustrating part is that I don't even really need proper, strong progress (although it'd be nice, certainly), just something to add to what would otherwise be busy-wait loops. The To summarize, my thesis statement is this: it is very surprising that a non-erroneous sequence of passive-target window operations, RMA calls, loads, and stores that always completes with strong progress can nonetheless deadlock with weak progress, absent explicit and seemingly extraneous calls to a progress-causing MPI routine from another chapter of the specification.
In my particular case with Intel-MPI, For OpenMPI (4.1.2a1), At the moment, I can't give good performance measurements of the cost of the latter. On the cluster I'm using, Open-MPI seems to randomly and intermittently take 10-40ms to resolve the deadlock even with non-flush progress-inducers, so the impact of flush_all isn't easy to see. The system administrators there prefer (and have tuned) Intel-MPI, hence why I use it as default. Besides that, however, |
NWChem has always worked without background progress. It just runs at half the speed because, on average, half the processes are waiting in Get while the other half are in DGEMM. Global Arrays in general does not deadlock in the absence of background progress although the performance can be terrible. The Casper project provides a way to get background progress in RMA without changing an implementation. It has been challenging to add strong progress for some networks without intrusive changes that implementations aren't willing to do. |
We talked about this at the January 18, 2024 WG meeting: the outcome was that it would be nice to have a procedure that ensures progress of incoming RMA operations in passive target without the overhead of iprobe (triggers all other parts of MPI) or flush (waits for outgoing operations to complete) and the ability to become a no-op if no progress is needed. This could be a simple addition to the standard within the context of RMA and should be fairly straightforward to implement. We just need a name for it :) |
This is a new semantic in MPI -- currently everything that ensures progress is required to ensure all types of progress, that is, progress for all of the pending MPI operations (equivalently, for all of the MPI decoupled activities) irrespective of the "type". That is not an argument against the proposed addition, but it is a note of caution regarding how tricky it might be to write a good definition of what the new procedure is required to do and what it is permitted not to do. |
Example wording: |
Name: MPI_PASSIVE_PROGRESS ? |
We could name it Alternatives if we want to avoid
|
Question: will it be required that the user must ensure the program is correct even if this new procedure is implemented as a no-op? So, if the user places calls to this new procedure judiciously and the implementation chooses to do something, then the correct program might perform better, but it is still correct regardless of these conditions? Question: will the new procedure indicate anything to the caller, such as "something happened!" vs. "nothing yet" or "the number of passive target operations completed since the last window synchronization is now X"? Procedures like MPI_TESTALL indicate whether all the referenced operations (I'm thinking MPI_RPUT, etc) are complete; MPI_WIN_TEST indicates whether all operations originating at the calling process are complete. Those are origin-side completion; this new procedure is target-side -- should it indicate completion at the target and, if so, of which operation(s)? Can I use it to ensure that a passive target operation has completed at the target (i.e., here) and then use the message that was put or re-use the memory from which a get was reading a message? Summary: will this new procedure change user-visible semantics of RMA? |
I'd rather just extend |
I agree that that could be easier. My concerns with extending the semantics of
I'm not sure I fully understand. A program is incorrect if it does not complete due to lack of progress. This procedure is a tool to help passive target RMA programs ensure correctness by giving them a chance to process outstanding operations. The use of this procedure is required if there are no other well-defined (and obvious) means to progress decoupled RMA activities (e.g.,
I'm not sure this could be done reliably. Some incoming operations may have been handled by a previous progress invocation (test/wait on a P2P request). Should they be counted? What happens if multiple threads end up in the progress engine even though they do different things? Clearly the information returned by a call to this facility cannot be relied upon to accurately count the number of completed incoming operations so the value of this information is limited. So I'd rather not even try to count them. If the application must know then there are ways to count on the application side (using atomic ops). |
A call to |
A subtle point: attaching this to The minimum function necessary to address my top-of-thread issue is not a "do it now" progress, but a "do it eventually" progress. Calling
However, the "in a loop" might be an important restriction. An MPI implementation that offers background progress threads would be able to implement |
This suggests we could look at expanding the duties of the MPI_WIN_SYNC procedure, rather than the MPI_WIN_FLUSH procedure. MPI_WIN_SYNC is already a local procedure (indeed, an immediate procedure) and it is restricted to copying data that is already present within the calling MPI process. It is, perhaps, a small step to include "the queue of incoming passive target actions" in the "public window copy" concept. In the separate model, we currently have an implementation differentiation between "data contained in an incoming put instruction that will affect a chunk of memory the user cannot access directly" and "data already contained in that chunk of memory the user cannot access directly". The user cannot tell the difference; they know only that the data has not yet arrived in the private window copy. Strangely, MPI_WIN_SYNC is currently only required to copy data that has already arrived in the public window copy, but is permitted to leave the incoming put/accumulate instructions unexecuted, and so (arguably) it only does half a job. In the unified model, the public and private window copies update each other eventually without user intervention, i.e. without user calls to procedures that ensure progress. This copy/update also eventually happens without user calls to MPI_WIN_SYNC (if the user is happy to wait for eventually to elapse) but can be hastened by calling MPI_WIN_SYNC. If "public copy" is expanded to include "incoming passive target instructions", then we could say that the unified model can only be claimed when the implementation provides strong progress for passive target operations (either a capable NIC or a software agent), even if the rest of MPI only has weak progress or we could say that repeated calls to MPI_WIN_SYNC would ensure progress of incoming passive target operations as well as making changes in either window copy visible in both window copies. This pushes the problem of "my RMA message got stuck somewhere" back one step towards the origin. |
I would love this; it would exactly match the false intuition I had when first uncovering this conceptual mismatch. However, the existing language is fairly clear that
Is this guaranteed without another form of memory barrier, under common HPC memory models? |
MPI_WIN_SYNC was designed to address the shortcomings of a library-based implementation of shared-memory atomics. we are breaking the backwards compatibility of its performance characteristics if we make it poke the progress engine. @jdinan I don't hate your idea, but it's complicated. |
"make it poke a small part of the progress engine" -- it could be permitted to poke the whole ugly mess of the general progress engine (to maintain ease of implementation) or it could be prohibited from doing so (to maintain as much of the performance as possible). |
As a spur of the moment suggestion, how about In an MPI implementation, background progress is analogous to preemptive multitasking in a time-sharing system. Without background progress, we only have cooperative multitasking. A cooperative multitasking system requires processes periodically yield to others in order to multitask, and in MPI this is usually analogized as "engaging progress."
An MPI implementation that ensures background progress MAY implement |
@jeffhammond and I had a quick chat yesterday about this. We came up with the following:
I will try to put that into a PR over the next couple weeks (I'll be on the road throughout most of April) so we can discuss the details. |
Please check https://github.com/mpiwg-rma/mpi-standard/pull/10 for a proposal. We could bring this up at the Forum meeting in 3 weeks if there is some level of agreement about it in the WG. |
Intel has been working on a new proposal for notified-RMA (including implementation work in Intel MPI) that is related to this question. Intel MPI has strong progress for RMA, but we have tackled the question "what about MPI implementations that only support weak progress?" We are almost ready to present this work to the RMA WG -- perhaps @devreal could schedule a WG meeting for that before the MPI Forum meeting? |
Greetings,
I initially interpreted the below issue as a problem within the Intel MPI implementation, but after posting on their community forums1 Intel confirms that this is allowable behaviour per their interpretation. Either I'm misinterpreting the specification (and thus doing something undefined), Intel is wrong, or the specification is ambiguous.
Problem
Somewhat recently, I was flummoxed by a deadlock in MPI code that used passive-target synchronization. A local process would spin-wait on a variable in a shared-memory window (using local load and
MPI_Win_sync
), and a remote process would (eventually) update that variable withMPI_Fetch_and_op
.The expected result was… well, progress. In fact, the Intel MPI implementation would reliably deadlock when the RMA operation involved communication over an interconnect (i.e. something more than a local shared memory access). Sample code is as follows:
The problem is visible even on a single node of a cluster when the shared-memory interconnect is disabled:
The root problem appears to be that rank 0 is waiting on the assistance of the rank 1 to complete the
Fetch
part ofFetch_and_op
, but theMPI_Win_sync
inside the spinlock on rank 1 does not engage the MPI progress engine.Per the specification, I think that this behaviour is surprising if not spec-noncompliant. Per §12.7 of the 4.1 draft (atomicity isn't the problem here):
Replacing the
Fetch_and_op
call on rank 0 with separateMPI_Get
andMPI_Put
calls does function properly, without deadlock, even if it has ambiguous correctness (I'm not sure about the combination ofGet
andPut
with the same RMA target) and is absolutely erroneous in the general case of multiple writers.Proposal
Prior to the MPI 4.1 draft, I would have asked that the call to
Win_sync
engage the progress engine even in the unified memory model, but that's now explicitly not required (p608). I'm not sure what the required change now is, if this deadlock is not in fact an implementation bug.The trouble seems to be twofold:
Fetch_and_op
must actively participate (via progress), even if the window has passive-target synchronization, andWin_flush
works in Intel MPI, but it's an unnatural fit because the target rank in general may have no idea what other rank is executing the atomic update. Additionally, theflush
calls might implicitly cause communication even if there's nothing to do, and that's unnecessary in this case where the target rank certainly 'knows' if it needs to do something with respect to an atomic RMA operation.MPI_Test
withMPI_REQUEST_NULL
doesn't force progress. Rank 1 also can't force progress by sending a zero-length message to itself.Changes to the Text
Again presuming this behaviour is intended or allowable:
MPI_Engage_progress()
to the API as the minimal call that will tell the MPI implementation to make progress on any outstanding requests with no further semantic intent.Impact on Implementations
Implementations will likely complain about any specification-guaranteed background progress. In the Intel forum thread linked above, the Intel representative closed the issue (then presented as a question about progress during
Win_sync
) becauseWin_sync
was simply a memory barrier.Impact on Users
At minimum, black-and-white documentation about participation requirements would have saved me, a user, a few headaches trying to distill the unexpected deadlock into a minimal example. (The original case involved
Fetch_and_op
to one location, then anMPI_Put
of a status flag to a second; the latter was spinwaited upon, blocking completion of the originalFetch_and_op
.)References and pull requests:
Footnotes
(https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-MPI-fails-to-ensure-progress-for-one-sided-operations/m-p/1519581) ↩
The text was updated successfully, but these errors were encountered: