Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix core dump when performing misaligned atomics. #8574

Closed
wants to merge 1 commit into from

Conversation

awlauria
Copy link
Contributor

@awlauria awlauria commented Mar 9, 2021

c_reqops MTT test has a weird alignment when creating the window,
which is perfectly valid. This will cause a segv
when performing atomic operations. While this was found on
Power, it probably is not limited to it.

Detect a misalignment, and fallback to a lock + copy method.

Signed-off-by: Austen Lauria [email protected]

@awlauria
Copy link
Contributor Author

awlauria commented Mar 9, 2021

@hjelmn if you have ideas on how to do this better let me know.

@awlauria awlauria requested a review from hjelmn March 9, 2021 15:49
default:
return OPAL_ERR_BAD_PARAM;
if(OPAL_LIKELY(0 == ((((intptr_t) addr) & 0x7)))) {
switch (op) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I am comfortable with this change. It is my understanding that MPI_Accumulate, etc the MPI implementation is allowed to require (or is required to) only operated on naturally aligned types. @bosilca Is this correct or is the user allowed to assume they can do atomics on byte-aligned data?

The problem with this change is that you essentially would need to disable the memory atomics interoperability because this lock is only in the BTL. That would break a lot of assumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth this test passes with MPICH and Spectrum MPI (IBM mpi).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that the window alignment of 1, like the test is doing is valid.

The accumulate call itself is not doing anything funny with offsets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the MPI standard currently does not mandate any alignment requirements for user buffers: mpiwg-rma/rma-issues#3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, a window alignment of 1 is fine. The problem is working on a datatype with natural alignment 8 on an off-alignment. That is not safe on many architectures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devreal Maybe look at the datatype chapter. There is some language about alignment in that chapter. Not sure if any of it is relevant. Thats why I would like @bosilca to comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn I agree, but the optics are bad that OMPI segv's where other implementations don't.

Even if we don't want to support this, maybe we can add a check at a higher level to not segv? Not sure off the top of my head where that can be prevented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is really messy. What I am worried about is silent wrong answers. Crashing is better there. Let me think about the whole stack that leads to this. Maybe this will be ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things:

  1. If we accept to use a lock approach if the data is misaligned, then on the same window we cannot use both the atomic and the lock based version, because we will lose the atomicity on the data. This seems like a big issue to me.
  2. The standard stresses on the fact that all atomic operations apply on predefined datatype, which as @hjelmn suggested have specific alignment requirements. Unfortunately, I can't find anything in the standard that explicitly talks about this.
  3. "Also, on some systems, performance is improved when window boundaries are aligned at “natural” boundaries (word, double-word, cache line, page frame, etc.)", this talks about performance impact not correctness.

I don't think we should provide more atomicity than available on the system. Thus, if the windows alignment is 1, the safest approach is to force the lock approach, while for matching alignments we can use the atomic implementation, and error out if the user-provided pointer is misaligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should provide more atomicity than available on the system. Thus, if the windows alignment is 1, the safest approach is to force the lock approach, while for matching alignments we can use the atomic implementation, and error out if the user-provided pointer is misaligned.

That sounds good to me. If the user creates the window with matching alignment they would have to try extra hard to misalign the pointer, so letting it segfault seems appropriate.

@awlauria awlauria force-pushed the misaligned_atomics branch 3 times, most recently from 26995cf to 9f6e870 Compare March 10, 2021 02:08
@awlauria
Copy link
Contributor Author

@bosilca @hjelmn I added a patch to detect the bad alignment and fall back to locks.

@awlauria
Copy link
Contributor Author

bot:aws:retest

@awlauria awlauria force-pushed the misaligned_atomics branch from 9f6e870 to 89ee72d Compare March 10, 2021 13:53
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicks, looks good to me otherwise.

Comment on lines 146 to 150
if(OPAL_UNLIKELY((module -> acc_use_lock_fallback == true) || (module -> disp_unit % extent))) {
flags |= MCA_BTL_ATOMIC_FLAG_USE_LOCK_FALLBACK;
module -> acc_use_lock_fallback = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of readability:

Suggested change
if(OPAL_UNLIKELY((module -> acc_use_lock_fallback == true) || (module -> disp_unit % extent))) {
flags |= MCA_BTL_ATOMIC_FLAG_USE_LOCK_FALLBACK;
module -> acc_use_lock_fallback = true;
}
if(OPAL_UNLIKELY((module->acc_use_lock_fallback == true) || (module->disp_unit % extent))) {
flags |= MCA_BTL_ATOMIC_FLAG_USE_LOCK_FALLBACK;
module->acc_use_lock_fallback = true;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I let my own preference sneak in. Thanks. Fixed.

@@ -221,7 +240,7 @@ static int ompi_osc_rdma_acc_single_atomic (ompi_osc_rdma_sync_t *sync, const vo
ompi_osc_rdma_module_t *module = sync->module;
mca_btl_base_module_t *selected_btl = ompi_osc_rdma_selected_btl (module, peer->data_btl_index);
int32_t atomic_flags = selected_btl->btl_atomic_flags;
int btl_op, flags;
int btl_op, flags = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed.

break;
default:
return OPAL_ERR_BAD_PARAM;
case MCA_BTL_ATOMIC_ADD:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I randomly picked two places where switch statements are used in opal and both do not indent the case labels. These white-space changes seem unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My own preference sneaking in again. Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good that we now have the clang-format to keep all this in a consistent state ;)

@awlauria awlauria force-pushed the misaligned_atomics branch 2 times, most recently from c7abf64 to a7e6426 Compare March 10, 2021 17:10
@awlauria
Copy link
Contributor Author

@devreal thanks for the review - changes addressed. I also squashed down and added the copyright stuffs.

@@ -178,6 +179,7 @@ static int sm_btl_first_time_init(mca_btl_sm_t *sm_btl, int n)
return rc;
}
}
OBJ_CONSTRUCT(&sm_btl->super.btl_atomic_fallback_lock, opal_mutex_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: I tried to understand the mutex construction/destruction and couldn't find a matching destructor call for this construction here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch, it might be. Will check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the lock to base, I think it makes more sense to have it there, I think the btls should all use the same lock to preserve atomicity. That look good? If so I can squish it down.

@bosilca ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the lock be per window instead of global ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer is not simple, it depends on the memory model. For the two you just mentionned, SM and TCP, it would work because they both do atomic operations in the target memory. Take SM and IB and you would have a totally different story but a less clear outcome.

In any case, the global lock is providing a stronger consistency model than required by MPI, with certain performance implications as well. A lock per window (not per BTL) will at least alleviate the cost of fighting for locks in a highly contentious case.

Copy link
Contributor Author

@awlauria awlauria Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be per-window, not per-btl.

Hacking at the code though, I'm not sure of the best way to make this work. The btl has no knowledge of windows, and that's where the actual accumulate operation happens on the target - and where the lock needs to happen. I suspect we would need a largish refactor to make this work, but perhaps not and I am missing something?

I'm happy to take a crack at implementing this, but at this point I'm not sure how to proceed. It doesn't look to me like the current window has its own btl module pointer, it shares it from a global btl pointer with other windows. I guess the easiest thing to do would be to give each window it's own copy and stuff the lock in there, but I don't know if there's other ramifications to that.

That said, this is such a small edge case, I wonder if it's worth the effort. Users who mis-align their windows shouldn't expect great performance, anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn and @devreal do you have any thoughts on this?

This locking issue seems to be a bit thorny.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with a global lock. People using misaligned atomics should have no expectation of performance whatsoever and ripping the module structure apart for it isn't worth it. We should push the alignment discussion within the MPI forum to eventually disallow it.

Copy link
Contributor Author

@awlauria awlauria Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am thinking along the same lines.

One way we can do this on a per-window basis would be to give the btl a callback function into the rdma layer with window info to look it up on the target side.

If the guts is already present to do something like (I don't think it is, but not 100% sure) that wouldn't be too bad. But if we need to rip things apart to get that working, it probably isn't worth it.

if(lock) {
OPAL_THREAD_LOCK(lock);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move the result = *addr; outside the switch case.

}
break;
default: {
if(lock) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the internal version of the op, I am almost certain we can't hit this code with an incorrect opcode (all the checks should have been done at the upper level).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I just added it for completeness. I can remove it if you want

}

switch (op) {
case MCA_BTL_ATOMIC_ADD:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as above.


}

if(OPAL_UNLIKELY((module->acc_use_lock_fallback == true) || (module->disp_unit % extent))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to put the constant first (as you did it in opal/mca/btl/base/btl_base_am_rdma.c)

@awlauria awlauria force-pushed the misaligned_atomics branch 3 times, most recently from 0110f52 to 25949f4 Compare March 11, 2021 20:13
@awlauria
Copy link
Contributor Author

@bosilca @devreal I think I addressed all change requests, let me know if I missed anything. If it looks good I can rebase.

I'm still open to moving/reworking the lock but I am unsure how to proceed on that.

@awlauria
Copy link
Contributor Author

bot:aws:retest

@awlauria awlauria force-pushed the misaligned_atomics branch from 25949f4 to 596b8f6 Compare March 11, 2021 23:57
c_reqops MTT test has a weird alignment when creating the window,
which is perfectly valid. This will cause a segv
when performing atomic operations. While this was found on
Power, it probably is not limited to it.

Detect a misalignment, and fallback to a lock + copy method.

If it's detected that the window alignment doesn't match
the type, abandon the atomics and use a lock implementation
for all atomic operations in that window.

Signed-off-by: Austen Lauria <[email protected]>
@awlauria awlauria force-pushed the misaligned_atomics branch from 596b8f6 to 587f4a9 Compare March 11, 2021 23:59
@hjelmn
Copy link
Member

hjelmn commented Mar 12, 2021

Thinking about this. Couldn't we just force the read-modify-write path if the address is not 64-bit or 32-bit aligned? Just a top level check in osc/rdma when deciding whether to use network atomics, CAS, or read-modify-write.

@awlauria
Copy link
Contributor Author

@hjelmn I think that's a good point - and probably the better way to go. I checked the v4.1 branch, and the c_reqops test where I found this passes because it does take the rmw path. I opened #8599 to replace this.

@awlauria
Copy link
Contributor Author

Closing as #8599 was merged.

@awlauria awlauria closed this Mar 15, 2021
@awlauria awlauria deleted the misaligned_atomics branch March 16, 2021 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants