Skip to content

2018 02 27

Wesley Bland edited this page Mar 21, 2018 · 2 revisions

F2F WG Agenda

  • Error Handlers
  • FT Interop
  • Checkpointable MPI
  • ULFM

Error Handlers

Because error handlers are not set uniformly across a communicator, triggering an error handler anywhere would cause an abort everywhere. This is a problem with intercommunicator when you want to be able to couple a disconnected application without giving one side of the application the ability to abort the other side.

Options:

  1. Change the wording for MPI_ERRORS_ABORT to say that an error will trigger the error handler everywhere, rather than directly aborting.
  2. Vote on proposal and add changes later.
  3. Make no changes.

If we changed the proposal, the new wording could be something like:

If invoked, this error will trigger the error handler on all other MPI processes in this communicator and then call MPI_ABORT on MPI_COMM_SELF.

What we decided was that solving this problem for MPI_ERRORS_ABORT would not be sufficient because the default error handler, MPI_ERRORS_ARE_FATAL would still abort all of the processes and if the application is already not being careful about how it is handling errors, it probably wouldn't have changed the error handler from MPI_ERRORS_ARE_FATAL to MPI_ERRORS_ABORT.

FT Interop

How do we tell the application that it's not using the same FT model during MPI_INTERCOMM_MERGE?

  • Returning an error is more user friendly.
  • Saying this is undefined is more consistent with the rest of the standard and backward compatible.

We didn't make a definitive decision here because we got onto many tangents.

Tony: There might be times that we want to allow multiple FT models at the same time when they are compatible (e.g. FA-MPI might want to use ULFM's MPI_COMM_SHRINK).

Dan: Are these models all actually incompatible to begin with?

  • With ULFM + Reinit combined, the ULFM procs would roll forward after a fault and shrink out the reinit processes. The reinit processes would have a smaller MPI_COMM_WORLD or would recreate the ULFM processes. If they really needed to get back together, they could use connect/accept.

Aurelien: Could we allow nested models?

  • The model would now need to be a property of the communicator/window/file and every model would need to define its compatibility with itself and other models, including how it could be nested (or not).

Tony: Interoperability could be somewhat solved by breaking down the FT models into more composable features, e.g. use MPI_COMM_SHRINK with some other error detector (or user error detector) and agreement algorithms and functions.

Checkpointable MPI (Nawrin - Auburn)

Nawrin presented a "more friendly" version of reinit that doesn't long jump, but does recovery only during MPI calls. The implementation will eventually cause all functions to return the same error code which signals that everyone should roll back to a previous checkpoint. Checkpointing the application and implementation is also a "first class citizen" in the implementation and includes an agreement to complete the previous "transaction". The trade-off here from reinit is that errors might not be discovered as quickly.

This work is still in progress, only p2p communication on MPI_COMM_WORLD and more work is ongoing to both expand the current scope and integrate with existing implementations.

ULFM

Aurelien attempted a reading of the ULFM proposal and presented supporting slides. The general feedback from the forum is that ULFM still has too many problems to standardize as a whole. Individual pieces might be acceptable (e.g. MPI_COMM_FAILURE_ACK/GET_ACKED), but others are still objectionable (e.g. MPI_COMM_SHRINK and MPI_COMM_REVOKE).

The suggestion is to break this into three pieces:

  1. New error classes and MPI_FAILURE_ACK/GET_ACKED
  2. MPI_COMM_AGREE
  3. MPI_COMM_REVOKE and MPI_COMM_SHRINK

1 and 2 are probably standardizable as is, while 3 is the part that is a problem for many. If we have 1 and 2, there might be other solutions that can replace 3 (or tweaks to 3 that can fix the problems which we would discover in the meantime).

Clone this wiki locally