Skip to content

2017 09 20

Wesley Bland edited this page Sep 22, 2017 · 1 revision

WG Time

Catastrophic Errors

  • Went over slides again for new folks in the room
    • Confirmed that we still like the current proposal and cleared up confusion around it.
    • When presenting to the forum, we should demonstrate plenty of use cases.
      • Resource exhaustion is easy (and perhaps sufficient), but what about others?
  • inquiry.tex:524 - Implies that returning MPI_IS_OK means we'll never have an error.
    • Frame this as stating what happened in the past, not that the future is guaranteed.
  • Update issue description for reading.

Process Failure Recovery

  • Reasons Reinit can't live in ULFM
    • Can implement a failure detector, process recovery, etc. in SLURM faster than MPI.
    • Uses PMPI interface.
    • Fails faster without doing agreement / revoking.
  • Ignacio: Let's have both models and add an API function to pick which model you want.
    • Probably can't have both in the same app, but it might be possible if you can make strong guarantees about your application + libraries.
    • Aurelien: Could we allow you to pick with error handlers?
      • Set MPI_ERRORS_REINIT on your communicator if you want reinit.

ULFM

  • Concern about overlapping communicators comes from overlapping shrinks (not revokes)
    • We think we can just improve our advice about the safest way to do MPI recovery to say that MPI recovery should all live in the same place instead of happening at multiple layers (unless you are sure it's ok).

Reading

The reading was not a success because of concern about the backward incompatibility. It's ok to have it, but we need to add a new chapter for backward incompatible changes to point this out.

Other notes are on the pull request itself: https://github.com/mpi-forum/mpi-standard/pull/1#pullrequestreview-64681897

Clone this wiki locally