-
Notifications
You must be signed in to change notification settings - Fork 0
2016 06 08
Wesley Bland edited this page Jun 13, 2016
·
2 revisions
- Rename
MPI_IS_CATASTROPHIC
- Call it
MPI_GET_STATE
which would return one of a number of predefined values- The only current value would be
MPI_UNDEFINED
- The only current value would be
- Call it
- Remove
MPI_ERR_IS_CATASTROPHIC
- The value of the function is absorbed by the other function.
- We might want to add a session argument in the future.
- Section 2.8 - Change
MPI_ISEND
toMPI_SEND
without freeing the request - p. 21 L. 34 - In such
ascase - Pavan: We shouldn't copy error handlers for anything but
MPI_COMM_DUP
because it would be inconsistent. This could be a backward compatibility issue.- Further discussion leaned toward a more extreme version of this where all communicators would start with
MPI_ERRORS_ARE_FATAL
.
- Further discussion leaned toward a more extreme version of this where all communicators would start with
- Pavan: Abort should cover intercommunicators so it aborts only the remote group.
- All uses of "implementation specific" should become "implementation-specific"
- If we decide not to propagate error handlers, we need to say "all communicators" at the top of p. 345.
- Changelog line 28: changed, line 32: abort to the, line 33: replace "on" with "using"
- Straw Vote:
- Slides option 1: 6
- Option 2: 3 (would move to option 1 if all error handlers were
MPI_ERRORS_ARE_FATAL
- Option 3: 9
- Martin/Pavan: Using ULFM may not add overhead over not using ULFM when error checking is enabled, but if it is disabled, the overhead could be more significant.
- Pavan: This is especially true for offloading networks where it may become necessary to maintain a software request queue to return errors and handle revoke correctly.
- p. 20 L. 95 - When a process
- p. 21 - Combine the advice to users
- p. 337 L. 10 - "wether" -> "whether"
- Pavan: Can we get a requested / provided type of semantic for FT the same way we have for threads?
- Jeff: Or standardize an
mpiexec
flag?- This is probably more desirable because it would let the implementation pick a different library at runtime to decrease overhead.
- Jeff: Or standardize an
- Martin: The definition of
MPI_FT
is insufficient. We should better specify what is supported or not. - Definition of
MPI_FT
: "chapter 15" -> "Chapter 15" (this occurs elsewhere too) - p. 361 L. 42 -
It should be noted - p. 601 L. 39 - "ranks" -> "MPI processes"
- p. 601 L. 47 - "these" -> "those"
- Martin: What about routing problems?
- Aurelien: Those should be masked or reported as a different error.
- Wesley: How can the implementation tell the difference?
- Dan: Saying "initialization" (or initiation) for nonblocking calls may not cover both regular nonblocking and persistent nonblocking operations.
- p. 603 L. 10 Fix something (I missed this discussion)
- p. 603 L. 18 "involved processes" -> "at least one involved process"
- p. 603 L. 28 - Dan: "Future communication" is unclear. How about "All outstanding and future communication" or change "communication" to "operations".
- Search and replace all "ranks" with "MPI processes"
- p. 604 L. 5 - "new communicator" -> "new communicator handle" (do this everywhere)
- P. 603 L. 41 - "some operations' semantics" is too unclear to be useful. Add "for example,
MPI_BARRIER
". - p. 605 L. 15-18 - Doesn't have to be advice
- p. 605 L. 12 - The application isn't in an undefined state, MPI is in an undefined state.
- p. 604 L. 47-48 - Instead of saying that no processes are spawned, can we say this in terms of
MPI_COMM_GET_PARENT
? - p. 607 L. 18 - "knowledge" -> "notification" or "automatic notification", "communication" -> "non-local" or drop entirely because we've already defined involved.
- p. 607 L. 19 -
eventually - p. 607 L. 42 - "as soon as either" -> "either when"
- p. 608 L. 20-22 - "whose failure raised" -> "whose failure caused an exception of class ... to be raised"
- Definition of
MPI_COMM_SHRINK
: Call out that failed processes may or may not already be known- Make sure that we run the new text by Dan since he was most concerned about the wording.
- p. 608 L. 25 - Make the first sentence normative
- Ryan has concerns about
MPI_COMM_FAILURE_ACK
(missed the specifics0 - p. 608 L. 46 - "proceed" -> "complete"
- p. 608 L. 47 - "previously acknowledged" is bad for some reason
- p. 610 L. 25 - "correct" -> "alive"
- p. 613 L. 22 - Switch check order
- p. 613 L. 34 -
if (split_ok) {
- p. 615 L. 18-19 - Need to handle
MPI_ERR_IN_STATUS
- p. 615 L. 34 - "recieve" -> "receive"
- If we say output values are invalid, we need to at least say that the error value in the status object is correct.
- Dan: Define "alive"
- Martin:
MPI_ERR_CLASS
returns an error code which is a problem because we can't translate it. Can we have this return an error class? - The examples on www.fault-tolerance.org are wrong with respect to error codes vs. classes
- What if send fails and we try to replay it inside the error handler? There could have been a partial message sent.
- This could be reflected by calling the error catastrophic.
- Squyres: Want to be able to tell the user that something was masked. Possibly by attaching a string to the error handler that can be returned to the user.
- Another alternative is to have two errors returned from the error handler, input and output.
- Martin: Should much of this be handled by PMPI/QMPI?
- This was generally agreed upon. More later.
- Anh: We might want to turn this off to avoid performance impact. For example, MS-MPI might not be long jump safe.
- Kathryn/Ignacio: It might be nice to have a way to "clean up" some of MPI (e.g. invalidate everything since a certain point or inside a session or something)
- Squyres: Could the handles be a union instead of a
void *
?- Martin: It's already done as a
void *
in the tools interface.
- Martin: It's already done as a
-
handle_types
doesn't need to be an array anymore - What about generalized requests with the new function to give back a handle associated with a request?
- Squyres: What if you mix setting old and new error handlers on the same error handler?
- To be more backward compatible, we could choose the last error handler set.
- We went through the wish list to see which were still worth pursuing:
- Clarify what you are allowed to do in an error handler
-
Return new error codes from an error handler- Handle with QMPI
- Pick which error classes we handle in a single function
- Multiple error handlers attached to a single object
-
Be able to recreate operations if desired- Handle with QMPI
- Combine the three error handler functions into a single one to be able to assign generically