Skip to content

2016 06 08

Wesley Bland edited this page Jun 13, 2016 · 2 revisions
  • Rename MPI_IS_CATASTROPHIC
    • Call it MPI_GET_STATE which would return one of a number of predefined values
      • The only current value would be MPI_UNDEFINED
  • Remove MPI_ERR_IS_CATASTROPHIC
    • The value of the function is absorbed by the other function.
  • We might want to add a session argument in the future.
  • Section 2.8 - Change MPI_ISEND to MPI_SEND without freeing the request
  • p. 21 L. 34 - In such as case
  • Pavan: We shouldn't copy error handlers for anything but MPI_COMM_DUP because it would be inconsistent. This could be a backward compatibility issue.
    • Further discussion leaned toward a more extreme version of this where all communicators would start with MPI_ERRORS_ARE_FATAL.
  • Pavan: Abort should cover intercommunicators so it aborts only the remote group.
  • All uses of "implementation specific" should become "implementation-specific"
  • If we decide not to propagate error handlers, we need to say "all communicators" at the top of p. 345.
  • Changelog line 28: changed, line 32: abort to the, line 33: replace "on" with "using"
  • Straw Vote:
    • Slides option 1: 6
    • Option 2: 3 (would move to option 1 if all error handlers were MPI_ERRORS_ARE_FATAL
    • Option 3: 9
  • Martin/Pavan: Using ULFM may not add overhead over not using ULFM when error checking is enabled, but if it is disabled, the overhead could be more significant.
  • Pavan: This is especially true for offloading networks where it may become necessary to maintain a software request queue to return errors and handle revoke correctly.
  • p. 20 L. 95 - When a process
  • p. 21 - Combine the advice to users
  • p. 337 L. 10 - "wether" -> "whether"
  • Pavan: Can we get a requested / provided type of semantic for FT the same way we have for threads?
    • Jeff: Or standardize an mpiexec flag?
      • This is probably more desirable because it would let the implementation pick a different library at runtime to decrease overhead.
  • Martin: The definition of MPI_FT is insufficient. We should better specify what is supported or not.
  • Definition of MPI_FT: "chapter 15" -> "Chapter 15" (this occurs elsewhere too)
  • p. 361 L. 42 - It should be noted
  • p. 601 L. 39 - "ranks" -> "MPI processes"
  • p. 601 L. 47 - "these" -> "those"
  • Martin: What about routing problems?
    • Aurelien: Those should be masked or reported as a different error.
    • Wesley: How can the implementation tell the difference?
  • Dan: Saying "initialization" (or initiation) for nonblocking calls may not cover both regular nonblocking and persistent nonblocking operations.
  • p. 603 L. 10 Fix something (I missed this discussion)
  • p. 603 L. 18 "involved processes" -> "at least one involved process"
  • p. 603 L. 28 - Dan: "Future communication" is unclear. How about "All outstanding and future communication" or change "communication" to "operations".
  • Search and replace all "ranks" with "MPI processes"
  • p. 604 L. 5 - "new communicator" -> "new communicator handle" (do this everywhere)
  • P. 603 L. 41 - "some operations' semantics" is too unclear to be useful. Add "for example, MPI_BARRIER".
  • p. 605 L. 15-18 - Doesn't have to be advice
  • p. 605 L. 12 - The application isn't in an undefined state, MPI is in an undefined state.
  • p. 604 L. 47-48 - Instead of saying that no processes are spawned, can we say this in terms of MPI_COMM_GET_PARENT?
  • p. 607 L. 18 - "knowledge" -> "notification" or "automatic notification", "communication" -> "non-local" or drop entirely because we've already defined involved.
  • p. 607 L. 19 - eventually
  • p. 607 L. 42 - "as soon as either" -> "either when"
  • p. 608 L. 20-22 - "whose failure raised" -> "whose failure caused an exception of class ... to be raised"
  • Definition of MPI_COMM_SHRINK: Call out that failed processes may or may not already be known
    • Make sure that we run the new text by Dan since he was most concerned about the wording.
  • p. 608 L. 25 - Make the first sentence normative
  • Ryan has concerns about MPI_COMM_FAILURE_ACK (missed the specifics0
  • p. 608 L. 46 - "proceed" -> "complete"
  • p. 608 L. 47 - "previously acknowledged" is bad for some reason
  • p. 610 L. 25 - "correct" -> "alive"
  • p. 613 L. 22 - Switch check order
  • p. 613 L. 34 - if (split_ok) {
  • p. 615 L. 18-19 - Need to handle MPI_ERR_IN_STATUS
  • p. 615 L. 34 - "recieve" -> "receive"
  • If we say output values are invalid, we need to at least say that the error value in the status object is correct.
  • Dan: Define "alive"
  • Martin: MPI_ERR_CLASS returns an error code which is a problem because we can't translate it. Can we have this return an error class?
  • The examples on www.fault-tolerance.org are wrong with respect to error codes vs. classes

Error Handlers Greenfield

  • What if send fails and we try to replay it inside the error handler? There could have been a partial message sent.
    • This could be reflected by calling the error catastrophic.
  • Squyres: Want to be able to tell the user that something was masked. Possibly by attaching a string to the error handler that can be returned to the user.
    • Another alternative is to have two errors returned from the error handler, input and output.
  • Martin: Should much of this be handled by PMPI/QMPI?
    • This was generally agreed upon. More later.
  • Anh: We might want to turn this off to avoid performance impact. For example, MS-MPI might not be long jump safe.
  • Kathryn/Ignacio: It might be nice to have a way to "clean up" some of MPI (e.g. invalidate everything since a certain point or inside a session or something)
  • Squyres: Could the handles be a union instead of a void *?
    • Martin: It's already done as a void * in the tools interface.
  • handle_types doesn't need to be an array anymore
  • What about generalized requests with the new function to give back a handle associated with a request?
  • Squyres: What if you mix setting old and new error handlers on the same error handler?
    • To be more backward compatible, we could choose the last error handler set.
  • We went through the wish list to see which were still worth pursuing:
    • Clarify what you are allowed to do in an error handler
    • Return new error codes from an error handler
      • Handle with QMPI
    • Pick which error classes we handle in a single function
    • Multiple error handlers attached to a single object
    • Be able to recreate operations if desired
      • Handle with QMPI
    • Combine the three error handler functions into a single one to be able to assign generically
Clone this wiki locally