Skip to content

2017 08 16

Aurelien Bouteiller edited this page Aug 16, 2017 · 1 revision

Attendees

  • UTK - Aurelien Bouteiller
  • LLNL - Ignacio Laguna
  • ORNL - Geoffroy Valle

Catastrophic errors

  • Pavan wants to keep supporting this idea.
  • Ignacio asked why we need to have an intermediate state between fully catastrophic and fully defined.
    • In POSIX, only these 2 states exist (i.e. SEGFAULT: will abort, vs ENOMEM, you can continue/retry and all is well defined)
  • Aurelien remarks that
    • Posix does not matter itself as much with distributed state (easier problem to remain globally defined for Posix than for MPI as a distributed state).
    • Posix favors software engineering vs performance, MPI tends to take the opposite tradeoff
    • Therefore it makes sense that some errors in MPI will blow-up the MPI state
  • A third idea is to return an error (vs aborting silently) only if the state remains defined
    • Would work for resource errors
    • Would prevent some good post-failure use cases (i.e. MPI is disfunctional from now-on because of an error/failure/resource limit, but the application has a chance to save the dataset or continue w/o MPI).

TUM proposal in the Sessions group

  • Aurelien gave a very short overview.
  • In the Session WG, a proposition from TUM to add grow-shrink operations has been discussed
  • we would like to evaluate how/if that can do something for error management
  • Will discuss this with more deps at a future meeting where attendance is higher.
Clone this wiki locally