Skip to content

Commit

Permalink
Fix: Handle SIGTERM in kubeflow pytorch elastic training plugin (flyt…
Browse files Browse the repository at this point in the history
…eorg#2064)

Signed-off-by: Fabio Graetz <[email protected]>
  • Loading branch information
fg91 authored Dec 21, 2023
1 parent bf726b9 commit 8af01f2
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,7 @@ def fn_partial():
else:
raise Exception("Bad start method")

from torch.distributed.elastic.multiprocessing.api import SignalException
from torch.distributed.elastic.multiprocessing.errors import ChildFailedError

try:
Expand All @@ -399,6 +400,9 @@ def fn_partial():
raise FlyteRecoverableException(e.format_msg())
else:
raise RuntimeError(e.format_msg())
except SignalException as e:
logger.exception(f"Elastic launch agent process terminating: {e}")
raise IgnoreOutputs()

# `out` is a dictionary of rank (not local rank) -> result
# Rank 0 returns the result of the task function
Expand Down

0 comments on commit 8af01f2

Please sign in to comment.