2017 08 16

Jump to bottom

Aurelien Bouteiller edited this page Aug 16, 2017 · 1 revision

Attendees

UTK - Aurelien Bouteiller
LLNL - Ignacio Laguna
ORNL - Geoffroy Valle

Catastrophic errors

Pavan wants to keep supporting this idea.
Ignacio asked why we need to have an intermediate state between fully catastrophic and fully defined.
- In POSIX, only these 2 states exist (i.e. SEGFAULT: will abort, vs ENOMEM, you can continue/retry and all is well defined)
Aurelien remarks that
- Posix does not matter itself as much with distributed state (easier problem to remain globally defined for Posix than for MPI as a distributed state).
- Posix favors software engineering vs performance, MPI tends to take the opposite tradeoff
- Therefore it makes sense that some errors in MPI will blow-up the MPI state
A third idea is to return an error (vs aborting silently) only if the state remains defined
- Would work for resource errors
- Would prevent some good post-failure use cases (i.e. MPI is disfunctional from now-on because of an error/failure/resource limit, but the application has a chance to save the dataset or continue w/o MPI).

TUM proposal in the Sessions group

Aurelien gave a very short overview.
In the Session WG, a proposition from TUM to add grow-shrink operations has been discussed
we would like to evaluate how/if that can do something for error management
Will discuss this with more deps at a future meeting where attendance is higher.