-
Notifications
You must be signed in to change notification settings - Fork 0
2018 02 27
- Error Handlers
- FT Interop
- Checkpointable MPI
- ULFM
Because error handlers are not set uniformly across a communicator, triggering an error handler anywhere would cause an abort everywhere. This is a problem with intercommunicator when you want to be able to couple a disconnected application without giving one side of the application the ability to abort the other side.
Options:
- Change the wording for
MPI_ERRORS_ABORT
to say that an error will trigger the error handler everywhere, rather than directly aborting. - Vote on proposal and add changes later.
- Make no changes.
If we changed the proposal, the new wording could be something like:
If invoked, this error will trigger the error handler on all other MPI processes in this communicator and then call
MPI_ABORT
onMPI_COMM_SELF
.
What we decided was that solving this problem for MPI_ERRORS_ABORT
would not be sufficient because
the default error handler, MPI_ERRORS_ARE_FATAL
would still abort all of the processes and if the
application is already not being careful about how it is handling errors, it probably wouldn't have
changed the error handler from MPI_ERRORS_ARE_FATAL
to MPI_ERRORS_ABORT
.
How do we tell the application that it's not using the same FT model during MPI_INTERCOMM_MERGE
?
- Returning an error is more user friendly.
- Saying this is undefined is more consistent with the rest of the standard and backward compatible.
We didn't make a definitive decision here because we got onto many tangents.
Tony: There might be times that we want to allow multiple FT models at the same time when they are
compatible (e.g. FA-MPI might want to use ULFM's MPI_COMM_SHRINK
).
Dan: Are these models all actually incompatible to begin with?
- With ULFM + Reinit combined, the ULFM procs would roll forward after a fault and shrink out the
reinit processes. The reinit processes would have a smaller
MPI_COMM_WORLD
or would recreate the ULFM processes. If they really needed to get back together, they could use connect/accept.
Aurelien: Could we allow nested models?
- The model would now need to be a property of the communicator/window/file and every model would need to define its compatibility with itself and other models, including how it could be nested (or not).
Tony: Interoperability could be somewhat solved by breaking down the FT models into more composable
features, e.g. use MPI_COMM_SHRINK
with some other error detector (or user error detector) and
agreement algorithms and functions.
Nawrin presented a "more friendly" version of reinit that doesn't long jump, but does recovery only during MPI calls. The implementation will eventually cause all functions to return the same error code which signals that everyone should roll back to a previous checkpoint. Checkpointing the application and implementation is also a "first class citizen" in the implementation and includes an agreement to complete the previous "transaction". The trade-off here from reinit is that errors might not be discovered as quickly.
This work is still in progress, only p2p communication on MPI_COMM_WORLD
and more work is ongoing
to both expand the current scope and integrate with existing implementations.
Aurelien attempted a reading of the ULFM proposal and presented supporting slides. The general
feedback from the forum is that ULFM still has too many problems to standardize as a whole.
Individual pieces might be acceptable (e.g. MPI_COMM_FAILURE_ACK/GET_ACKED
), but others are still
objectionable (e.g. MPI_COMM_SHRINK
and MPI_COMM_REVOKE
).
The suggestion is to break this into three pieces:
- New error classes and
MPI_FAILURE_ACK/GET_ACKED
MPI_COMM_AGREE
-
MPI_COMM_REVOKE
andMPI_COMM_SHRINK
1 and 2 are probably standardizable as is, while 3 is the part that is a problem for many. If we have 1 and 2, there might be other solutions that can replace 3 (or tweaks to 3 that can fix the problems which we would discover in the meantime).