-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate MPI_Win_fence #4
Comments
|
I don't see how If the general feeling is we need another call to make us feel better then fine. If that is the path to eliminating fence the so be it :). |
On Blue Gene, there is likely a huge difference between what one can do with In any case, the benchmark I would use to evaluate various options is an unstructured // <input, count, target>
typedef std::tuple<void*,int,int> msg;
void RMA_Alltoallv(std::vector<msg> const & messages, MPI_Win const & win)
{
MPI_Aint disp = 0;
#ifdef FENCE
MPI_Win_fence();
#endif
for (auto m : messages) {
auto buf = m.get<0>;
auto count = m.get<1>;
auto target = m.get<2>;
MPI_Put(buf, count, MPI_BYTE, target, disp, count, MPI_BYTE, win);
disp += count;
}
#ifdef FENCE
MPI_Win_fence();
#else
MPI_Win_flush_all();
MPI_Barrier();
#endif
} |
@jeffhammond let's separate out the issues here so we can handle them cleanly. So far I see the following:
Argument against: I'm not sure this is very helpful for RDMA networks, since the target will not know of any remote completion and the origin has to inform it anyway. It'll be beneficial for active-message based networks, e.g., TCP/IP based networks, but I don't think we should care about optimizing such networks.
Argument against: Closing epochs might not be as important in passive-target because we have flush/flush_local. It might be important if the user mixes exclusive and shared modes, but algorithmically, the need for it seems minimal. I agree with Nathan that detecting errors is harder with Fence, since the only way the MPI implementation can detect a completion epoch (unless the user passes an assert) is to wait and see if the user does another PUT/GET or not. Performance wise, this also adds an additional branch to check for a request completion since, in MPICH, we do an |
I know it is a weaker argument to add in support of removing |
Hmm. RDMA with immediate could be useful in theory. For instance, with InfiniBand, if the origin side did local bounce buffer copies, local completion is trivial. For data coming in, the target could, in theory, check for RDMA immediate buffer notifications instead of getting a separate completion message from the origin. In practice, however, I doubt this will show any benefit because the multiple additional DMA operations for the immediate data will likely wipe out any benefit compared with a single "I am done" message from the origin at the end. In any case, I'm agreeing with you that |
Fence gives local completion for the operations that I issued, before allowing me to move to a different epoch. With unlock, we only get remote completion.
No. Fence imparts remote completion. It was defined in MPI-2 when local and remote completion were equivalent. We definitely did not change it to mean local completion in MPI-3. That would break existing MPI-2 code.
|
It'll be beneficial for active-message based networks, e.g., TCP/IP based networks, but I don't think we should care about optimizing such networks.
Then we should do it. Ethernet is the most popular network by a large margin. I care about MPI adoption in data centers where Ethernet is deployed.
|
A collective flush operation is potentially helpful to application developers as a means of determining that the calling process has received all of the data sent to it (this is what fence guarantees). However, we would need to establish that such an API routine provides a performance improvement over a local flush followed by a barrier. Notified access (Torsten's proposal) had some semantics that were challenging to offload and IIRC the discussion had stalled there. There were a couple other proposals for notifying RMA operations (see: mpi-forum/mpi-issues#59). These were discussed several times in the RMA WG, but the activity on this fell off when the WG became inactive. If there is interest in pursing a proposal again, I would be happy to revive this topic. |
I don't think so. When a Fence completes, the origin process only knows that the data is out of its local memory, but not that it's available in the remote memory. For example, if you have overlapping windows, and you do fence on one epoch and lock/get/unlock on the other window, you can get old data.
Apart from the fact that Ethernet does not necessarily mean TCP/IP, which I'll ignore for this discussion, my point is that when all the PUT/GET operations are going over active messages, the relative performance impact of doing FLUSH_ALL/BARRIER instead of FENCE would be negligible. |
From MPI 3.1 Section 11.5:
So at this point, Pavan is right. However, a few lines down, it says this:
So I contend that |
@jeffhammond Pavan is correct. When fence returns two things are true:
So it is slightly weaker than remote completion. The exact wording: "MPI_Win_fence synchronizes RMA calls on win. The call is collective on the group of win. All RMA operations on win originating at a given process and started before the fence call will complete at that process before the fence call returns. They will be completed at their target before the fence call returns at the target. RMA operations on win started by a process after the fence call returns will access their target window only after MPI_Win_fence has been called by the target process." |
Hah, opps. @jeffhammond Posted just as I posted :) |
No, that's not true either. If the FENCE is not an opening epoch, a process needs to know that all PUTs/GETs for which it is the target have been deposited in its memory. That's the reason it needs to know that the remote fence has been called. However, it doesn't actually need to know that the PUTs/GETs that it has issued have been deposited into the target memory. So, the text that you quoted from the MPI standard is correct, although I agree that it's misleading. To answer your question about whether the MPI implementation knows -- yes, in MPICH we keep track of the fact that this is the first FENCE and if it is the first FENCE, we simply issue the ibarrier. |
@pavanbalaji So how does a user know when remote PUT/GET are complete using only FENCE synchronization? |
They don't. They'll need to do a BARRIER after the FENCE for that information. |
I should also point out that FENCE guarantees that a GET in the next epoch will get the correct data from the previous epoch. So if you are accessing the data only using FENCE epochs, you don't need the additional barrier. However, if you want to mix FENCE and LOCK/UNLOCK, for example, then you need the extra barrier. |
So FENCE orders RMA without remote completion and is thus equivalent to every rank calling a function like
But this is wrong?
If so, then we really need to remove "fence call usually entails a barrier synchronization" and do a much better job of explaining ourselves. |
@jeffhammond One small change:
but otherwise I believe that is correct. |
@jeffhammond As for the sendrecv example. That will also get a == 1 as the remote side had to finish the fence to get to the send. |
I think the example @pavanbalaji had in mind was:
In that case |
Right. The second example that @hjelmn gave was what I had in mind, when I said that the result might be 0 or 1 (or really any other value, since it's not atomic). This is particularly true when you give the |
This is correct, btw:
That's because the target process had to call FENCE too. |
@jdinan Do you have an opinion on the deprecation of |
If you're going to deprecate I'm not sure I've formed an opinion yet on the pro/con of deprecation. On one hand, I think that |
@jdinan PSCW with MPI_Accumulate is the best approximation to MPI_Recv_reduce. I don't want to lose that. |
@jeffhammond You could just as well call |
@jdinan You are right. While we're at it, we should deprecate Send-Recv and tell everybody to use MPI_Neighborhood_alltoallw instead 😝 More seriously, this was discussed in the past, but it isn't sufficient because of ANY_SOURCE and tags. Also, if I am using PSWC+Accumulate for halo exchanges, I may need to create a very large number of communicators. |
@jeffhammond If you are doing halo exchange then active target is the right model anyway, so why are we considering deprecating it? How are you simulating ANY_SOURCE and tags with PSCW+Accumulate? Every process in the group passed to |
@jdinan Then maybe a collective flush + neighborhood aware windows is a good alternative. I could see this as a good way to do a halo exchange:
I know this is mixing the concept of both non-blocking synchronization and a collective flush. Seems like a clean way to handle situations where active message might be beneficial. |
@hjelmn We first need MPI_Win_iflush and related nonblocking equivalents of the existing MPI-3 synchronization functions. The MPICH team has a paper on that already but they haven't brought forward a proposal to the Forum that I recall. Note that somebody should trademark iFlush just in case Apple is planning to release a toilet that runs iOS 😆 |
@jdinan Sketch the implementation of PRK transpose (B+=A using 1D distribution) using MPI_Recv_reduce, PSCW Accumulate, and MPI_Comm_create_group+MPI_Reduce and tell me if you think what you are proposing is reasonable. |
@jeffhammond To be clear, I want to eliminate PSCW as well. I want to see if all-passive RMA + some existing and new functions will work as an effective replacement. |
@jeffhammond I agree. We probably should move on getting the non-blocking synchronization into the standard. Will have to see where that is during the June meeting. Will have to make sure there will be RMAWG time then. |
@hjelmn PSCW and FENCE were both good ideas. In practice, PSCW is unused because Send-Recv meets the need and is more optimized in implementations. FENCE is a true implementation of BSP but as we've demonstrated here, is specified in a rather confusing way. While I understand the sentiment - particularly from an implementer - deprecate everything but passive target, I do not like the precedent it sets. It basically says that we should deprecate features because we've failed as a community to specific and implement them in a way that actually helps users. By that standard, there are many other parts of MPI that we should deprecate 😉 What I'd prefer to do is add an info key that allows users to specify which synchronization modes will (not) be used for a given window (as well as a key to assert something about overlapping windows). I think that will address the implementation issues, since you can then focus on optimizing for the passive-only case and ignore everything else. I don't think it is a burden to maintain existing RMA code that supports the other synchronization modes. |
@jeffhammond While I agree we don't want to set a precedent I think we should still consider getting rid of active target. It probably should be discussed when most of us are the in the same room in June. And, yes, it is a bit of a burden to support active target. The code to check for synchronization errors is a nightmare. Then there is trying to figure out when passive target is ok after a fence. It isn't too terrible but it limits performance. |
@hjelmn But didn't you already right that code? What new code has to be written here? I want to understand the cost to you versus the users whose code will be broken by PSCW/FENCE removal. |
@jeffhammond The code is already written but every line of code comes with a support cost. Being able to simplify the standard greatly reduces the support code of maintaining that code. It also will improve the performance of the passive-target paths as we can remove a bunch of checks trying (and never quite achieving) to detect incorrect synchronization. I know we don't have to detect this but a high quality implementation should do its best to detect synchronization errors and report MPI_ERR_RMA_SYNC. The cost of the synchronization checks in Open MPI on an Aries network with small message latency is ~ 10-20%. |
I should also add from experience at LANL that code developers often find (and use) |
@hjelmn Code isn't like fruit. It doesn't rot if you leave it alone for years. If you want to reduce your support burden, propose to deprecate Fortran support. |
@jdinan None of the examples given there are applicable to RMA implementations. |
@jeffhammond Considering I have broken PSCW in Open MPI multiple times now I think software rot is very relevant. Fence is more stable but as has already been brought up, on high performance networks it is not clear there is an advantage to it. It does, however, increase the complexity of the RMA implementation. |
@hjelmn If you break your code during refactoring, that is not code rot. Is it really your intent to to deprecate the BSP programming model from MPI because your job as an MPI implementer isn't trivial? Why don't we first go to WG14 and propose to deprecate every nasty feature of ISO C that has ever caused us pain? This has devolved into a truly stupid discussion. "Programming is hard" is not a reason to deprecate anything from any standard. The RMA text clearly needs improving. Propose that. |
@jeffhammond Its not refactoring. Its changing unrelated paths that just happen to cause PSCW to break. That fits the definition of code rot. I get it, you do not like this proposal. I still intend to move forward with proposing the end to active target. I mainly started this discussion to see what other thought about the idea. Also, this is not deprecating BSP at all. RMA is just not the place to implement BSP. The only except I see is maybe a |
From The implementation of MPI-2 one-sided communication for the NEC SX-5:
(emphasis mine) |
Except that paper was published before MPI-3. Now we have MPI_Win_lock_all and MPI_Win_flush_all. Any code using fence can replace it with MPI_Win_flush_all + MPI_Barrier without changing the semantics. |
The intent of this issue is to start the discussion (if it hasn't already been started) on whether we should look at deprecating the support for active-message RMA. Fence seems to be the obvious first one to look at for a couple of reasons:
I think (I may be wrong) it can be trivially replaced in user codes by the use of
MPI_Win_lock_all()
,MPI_Win_flush_all()
, andMPI_Barrier()
. In Open MPI with an RDMA capable network fence is effectively equivalent (with some variation with different asserts) to using these functions. The only real difference is the synchronization checks. Which leads toThe existence of active message (especially fence) makes it much more difficult to detect and accurately report synchronization errors.
I am sure there are other reasons but, as above, this issue is intended to start a discussion.
As for PSCW, that one is a little harder to justify deprecation. I can see how it might be useful in some algorithms. I think I may have a good replacement (topology-aware windows) which I hope to bring to the WG later this year.
The text was updated successfully, but these errors were encountered: