-
Notifications
You must be signed in to change notification settings - Fork 0
2017 08 16
Aurelien Bouteiller edited this page Aug 16, 2017
·
1 revision
- UTK - Aurelien Bouteiller
- LLNL - Ignacio Laguna
- ORNL - Geoffroy Valle
- Pavan wants to keep supporting this idea.
- Ignacio asked why we need to have an intermediate state between fully catastrophic and fully defined.
- In POSIX, only these 2 states exist (i.e. SEGFAULT: will abort, vs ENOMEM, you can continue/retry and all is well defined)
- Aurelien remarks that
- Posix does not matter itself as much with distributed state (easier problem to remain globally defined for Posix than for MPI as a distributed state).
- Posix favors software engineering vs performance, MPI tends to take the opposite tradeoff
- Therefore it makes sense that some errors in MPI will blow-up the MPI state
- A third idea is to return an error (vs aborting silently) only if the state remains defined
- Would work for resource errors
- Would prevent some good post-failure use cases (i.e. MPI is disfunctional from now-on because of an error/failure/resource limit, but the application has a chance to save the dataset or continue w/o MPI).
- Aurelien gave a very short overview.
- In the Session WG, a proposition from TUM to add grow-shrink operations has been discussed
- we would like to evaluate how/if that can do something for error management
- Will discuss this with more deps at a future meeting where attendance is higher.