2017 09 20

WG Time

Catastrophic Errors

Went over slides again for new folks in the room
- Confirmed that we still like the current proposal and cleared up confusion around it.
- When presenting to the forum, we should demonstrate plenty of use cases.
  - Resource exhaustion is easy (and perhaps sufficient), but what about others?
inquiry.tex:524 - Implies that returning MPI_IS_OK means we'll never have an error.
- Frame this as stating what happened in the past, not that the future is guaranteed.
Update issue description for reading.

Process Failure Recovery

Reasons Reinit can't live in ULFM
- Can implement a failure detector, process recovery, etc. in SLURM faster than MPI.
- Uses PMPI interface.
- Fails faster without doing agreement / revoking.
Ignacio: Let's have both models and add an API function to pick which model you want.
- Probably can't have both in the same app, but it might be possible if you can make strong guarantees about your application + libraries.
- Aurelien: Could we allow you to pick with error handlers?
  - Set MPI_ERRORS_REINIT on your communicator if you want reinit.

ULFM

Concern about overlapping communicators comes from overlapping shrinks (not revokes)
- We think we can just improve our advice about the safest way to do MPI recovery to say that MPI recovery should all live in the same place instead of happening at multiple layers (unless you are sure it's ok).

Reading

The reading was not a success because of concern about the backward incompatibility. It's ok to have it, but we need to add a new chapter for backward incompatible changes to point this out.

Other notes are on the pull request itself: https://github.com/mpi-forum/mpi-standard/pull/1#pullrequestreview-64681897

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2017 09 20

WG Time

Catastrophic Errors

Process Failure Recovery

ULFM

Reading

Clone this wiki locally