-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix core dump when performing misaligned atomics. #8574
Conversation
@hjelmn if you have ideas on how to do this better let me know. |
opal/mca/btl/base/btl_base_am_rdma.c
Outdated
default: | ||
return OPAL_ERR_BAD_PARAM; | ||
if(OPAL_LIKELY(0 == ((((intptr_t) addr) & 0x7)))) { | ||
switch (op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I am comfortable with this change. It is my understanding that MPI_Accumulate, etc the MPI implementation is allowed to require (or is required to) only operated on naturally aligned types. @bosilca Is this correct or is the user allowed to assume they can do atomics on byte-aligned data?
The problem with this change is that you essentially would need to disable the memory atomics interoperability because this lock is only in the BTL. That would break a lot of assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth this test passes with MPICH and Spectrum MPI (IBM mpi).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my understanding that the window alignment of 1, like the test is doing is valid.
The accumulate call itself is not doing anything funny with offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the MPI standard currently does not mandate any alignment requirements for user buffers: mpiwg-rma/rma-issues#3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, a window alignment of 1 is fine. The problem is working on a datatype with natural alignment 8 on an off-alignment. That is not safe on many architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn I agree, but the optics are bad that OMPI segv's where other implementations don't.
Even if we don't want to support this, maybe we can add a check at a higher level to not segv? Not sure off the top of my head where that can be prevented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This is really messy. What I am worried about is silent wrong answers. Crashing is better there. Let me think about the whole stack that leads to this. Maybe this will be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few things:
- If we accept to use a lock approach if the data is misaligned, then on the same window we cannot use both the atomic and the lock based version, because we will lose the atomicity on the data. This seems like a big issue to me.
- The standard stresses on the fact that all atomic operations apply on predefined datatype, which as @hjelmn suggested have specific alignment requirements. Unfortunately, I can't find anything in the standard that explicitly talks about this.
- "Also, on some systems, performance is improved when window boundaries are aligned at “natural” boundaries (word, double-word, cache line, page frame, etc.)", this talks about performance impact not correctness.
I don't think we should provide more atomicity than available on the system. Thus, if the windows alignment is 1, the safest approach is to force the lock approach, while for matching alignments we can use the atomic implementation, and error out if the user-provided pointer is misaligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should provide more atomicity than available on the system. Thus, if the windows alignment is 1, the safest approach is to force the lock approach, while for matching alignments we can use the atomic implementation, and error out if the user-provided pointer is misaligned.
That sounds good to me. If the user creates the window with matching alignment they would have to try extra hard to misalign the pointer, so letting it segfault seems appropriate.
26995cf
to
9f6e870
Compare
bot:aws:retest |
9f6e870
to
89ee72d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nitpicks, looks good to me otherwise.
if(OPAL_UNLIKELY((module -> acc_use_lock_fallback == true) || (module -> disp_unit % extent))) { | ||
flags |= MCA_BTL_ATOMIC_FLAG_USE_LOCK_FALLBACK; | ||
module -> acc_use_lock_fallback = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of readability:
if(OPAL_UNLIKELY((module -> acc_use_lock_fallback == true) || (module -> disp_unit % extent))) { | |
flags |= MCA_BTL_ATOMIC_FLAG_USE_LOCK_FALLBACK; | |
module -> acc_use_lock_fallback = true; | |
} | |
if(OPAL_UNLIKELY((module->acc_use_lock_fallback == true) || (module->disp_unit % extent))) { | |
flags |= MCA_BTL_ATOMIC_FLAG_USE_LOCK_FALLBACK; | |
module->acc_use_lock_fallback = true; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I let my own preference sneak in. Thanks. Fixed.
@@ -221,7 +240,7 @@ static int ompi_osc_rdma_acc_single_atomic (ompi_osc_rdma_sync_t *sync, const vo | |||
ompi_osc_rdma_module_t *module = sync->module; | |||
mca_btl_base_module_t *selected_btl = ompi_osc_rdma_selected_btl (module, peer->data_btl_index); | |||
int32_t atomic_flags = selected_btl->btl_atomic_flags; | |||
int btl_op, flags; | |||
int btl_op, flags = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fixed.
opal/mca/btl/base/btl_base_am_rdma.c
Outdated
break; | ||
default: | ||
return OPAL_ERR_BAD_PARAM; | ||
case MCA_BTL_ATOMIC_ADD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I randomly picked two places where switch
statements are used in opal and both do not indent the case
labels. These white-space changes seem unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My own preference sneaking in again. Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good that we now have the clang-format to keep all this in a consistent state ;)
c7abf64
to
a7e6426
Compare
@devreal thanks for the review - changes addressed. I also squashed down and added the copyright stuffs. |
opal/mca/btl/sm/btl_sm_module.c
Outdated
@@ -178,6 +179,7 @@ static int sm_btl_first_time_init(mca_btl_sm_t *sm_btl, int n) | |||
return rc; | |||
} | |||
} | |||
OBJ_CONSTRUCT(&sm_btl->super.btl_atomic_fallback_lock, opal_mutex_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: I tried to understand the mutex construction/destruction and couldn't find a matching destructor call for this construction here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch, it might be. Will check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the lock to base, I think it makes more sense to have it there, I think the btls should all use the same lock to preserve atomicity. That look good? If so I can squish it down.
@bosilca ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the lock be per window instead of global ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer is not simple, it depends on the memory model. For the two you just mentionned, SM and TCP, it would work because they both do atomic operations in the target memory. Take SM and IB and you would have a totally different story but a less clear outcome.
In any case, the global lock is providing a stronger consistency model than required by MPI, with certain performance implications as well. A lock per window (not per BTL) will at least alleviate the cost of fighting for locks in a highly contentious case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it should be per-window, not per-btl.
Hacking at the code though, I'm not sure of the best way to make this work. The btl has no knowledge of windows, and that's where the actual accumulate operation happens on the target - and where the lock needs to happen. I suspect we would need a largish refactor to make this work, but perhaps not and I am missing something?
I'm happy to take a crack at implementing this, but at this point I'm not sure how to proceed. It doesn't look to me like the current window has its own btl module pointer, it shares it from a global btl pointer with other windows. I guess the easiest thing to do would be to give each window it's own copy and stuff the lock in there, but I don't know if there's other ramifications to that.
That said, this is such a small edge case, I wonder if it's worth the effort. Users who mis-align their windows shouldn't expect great performance, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with a global lock. People using misaligned atomics should have no expectation of performance whatsoever and ripping the module structure apart for it isn't worth it. We should push the alignment discussion within the MPI forum to eventually disallow it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am thinking along the same lines.
One way we can do this on a per-window basis would be to give the btl a callback function into the rdma layer with window info to look it up on the target side.
If the guts is already present to do something like (I don't think it is, but not 100% sure) that wouldn't be too bad. But if we need to rip things apart to get that working, it probably isn't worth it.
opal/mca/btl/base/btl_base_am_rdma.c
Outdated
if(lock) { | ||
OPAL_THREAD_LOCK(lock); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move the result = *addr;
outside the switch case.
opal/mca/btl/base/btl_base_am_rdma.c
Outdated
} | ||
break; | ||
default: { | ||
if(lock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the internal version of the op, I am almost certain we can't hit this code with an incorrect opcode (all the checks should have been done at the upper level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I just added it for completeness. I can remove it if you want
} | ||
|
||
switch (op) { | ||
case MCA_BTL_ATOMIC_ADD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as above.
|
||
} | ||
|
||
if(OPAL_UNLIKELY((module->acc_use_lock_fallback == true) || (module->disp_unit % extent))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to put the constant first (as you did it in opal/mca/btl/base/btl_base_am_rdma.c)
0110f52
to
25949f4
Compare
bot:aws:retest |
25949f4
to
596b8f6
Compare
c_reqops MTT test has a weird alignment when creating the window, which is perfectly valid. This will cause a segv when performing atomic operations. While this was found on Power, it probably is not limited to it. Detect a misalignment, and fallback to a lock + copy method. If it's detected that the window alignment doesn't match the type, abandon the atomics and use a lock implementation for all atomic operations in that window. Signed-off-by: Austen Lauria <[email protected]>
596b8f6
to
587f4a9
Compare
Thinking about this. Couldn't we just force the read-modify-write path if the address is not 64-bit or 32-bit aligned? Just a top level check in osc/rdma when deciding whether to use network atomics, CAS, or read-modify-write. |
Closing as #8599 was merged. |
c_reqops MTT test has a weird alignment when creating the window,
which is perfectly valid. This will cause a segv
when performing atomic operations. While this was found on
Power, it probably is not limited to it.
Detect a misalignment, and fallback to a lock + copy method.
Signed-off-by: Austen Lauria [email protected]