-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed RMA Overhaul #25
Comments
For 7), I propose Let me add my wish list here:
|
@devreal This is essentially what dynamic windows are supposed to provide. The producer process attaches memory to the window and the MPI library has some protocol for exchanging memory registration handles with all peers in the background. I guess the difference you are proposing is that |
Yes, that's what dynamic windows were meant to provide but in practice their performance is sub-par, mostly because a pointer doesn't provide enough information to utilize RDMA without querying the target for the registration handle. So you end up with extra round-trips on each access. I'd rather hoist moving the registration handle out of MPI and have the application manage that, instead of relying on MPI to eventually figure it out (which it hasn't, after over a decade). |
Note from the January 19 meeting:
|
Another idea I had: Tying datatypes to windows (or access objects) instead of operations (datatypes in MPI are expensive to parse and don't transfer well to devices for example). Restricting the set of datatypes used at window creation time (#22) might not be enough since we would still need to pick a type in the operation. Windows views with a single type might be an option. The operation then just specifies an offset and count but not a type. |
RMA operations that only take a count + predefined datatypes would be useful. |
For scientific articles surprise is great. You managed to do that? Wow. For performance, it is opposite. You don't want to surprise MPI. MPI: You want to me do what? This is the story of persistent collectives. An MPI_Start does not surprise MPI anymore. MPI: I knew that was coming. Maybe you can achieve the same for RMA. Register Ops at window generation. Then you can spend the rest of the day with |
I have been thinking about persistent RMA but what stops me every time is that it won't be much different from partitioned P2P. I'd like to retain some level of flexibility (random access, arguably one of the strengths of RMA) while cutting out the parts that are hard to optimize. But I am open to be convinced that persistent RMA is actually useful :) |
I see. This sounds like partial persistence. The expansive parts, the surprise, is persistent. While the cheap ones: random offsets are not. |
RMA is already persistent in most ways. All of the memory registration is persistent (except perhaps in dynamic windows, which we should fix). I suppose there are cases where networks need to register peers and that is not guaranteed to happen during window construction, but there is nothing preventing an implementation from doing so. If they don't, it's likely because peer registration caching is a memory hog. We could solve that by adding collective lock all (which was proposed during MPI-3 by Charles Archer, IIRC) that would make it possible to make the communication peers persistent for an epoch. |
A large number of RMA use cases, including NWChem and many traditional PGAS workloads where SHMEM and UPC are used, have random access patterns. It is contrary to these use cases to attempt to make offsets persistent. I'd argue that those use cases are better matched to persistent send-recv anyways. |
If I have 1000,000 ranks, it is cheaper to promise MPI: I will only talk to these x, where x is small, ranks. |
I want to see the details of that argument. It's not valid for Blue Gene or Cray Aries networks, for example. If it's true for Slingshot or Mellanox, please walk me through why. |
That is true for the target memory but not for the origin, which can be any buffer. My understanding is that some networks require a registration of the origin buffer too. Would it help (esp on accelerators) to get guarantees from the user on the memory range used? |
I believe that you are mixing two concepts. Slingshot and Mellanox are RoCE resp. Infiniband. Thus, you need connections for communication. The other challenge is memory registration and the exchange of keys. There is a notable difference between exchanging keys with 1000,000 or small x ranks. Libfabric pretends that Aries does not need connections, but I believe underneath they do use connections. |
We're kinda digressing in this issue. I'd prefer to keep this rather clean and focus on features. Otherwise it will get too crowded, which makes it hard to find the important bits later. |
If partial persistence is a thing, then MPI supports the full range from no to full persistence. The user will give as much information as possible during the window creating. The offset will be random. I will only talk to these ranks. I don't know yet to which ranks I will talk. The datatype is fixed on window creation. |
Problem Statement
RMA is too complicated. As a result, few high quality implementations exist, user adoption is impacted, and the chapter itself lags behind the rest of the MPI specification with the adoption of new terms and other specification-wide changes.
Proposed Changes
Deprecate/remove:
Introduce
A new window flavor:
A new window synchronization operation:
MPI_Win_fence
) that flushes pending communication and synchronizes processes (similar to a SHMEM barrier)Atomic APIs
MPI_Accumulate
andMPI_Get_accumulate
(e.g. non-fetching atomics)Fix
The text was updated successfully, but these errors were encountered: