You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a local job but logging in Atlas, the job remains stuck in a running state if the underlying python process dies for some reason (OOM issues). The job cannot then be manipulated on the Atlas UI.
It would be ideal if this job could be failed automatically by Atlas if the underlying process dies. If not, then the user should have the ability to "stop" these phantom jobs which should then appear as failed.
On the issue of having the wrong status displayed to the user: one proposal is to change the job status update mechanism to a heartbeat mechanism, since that these jobs can be executed locally (i.e. no natural way to supervise the job like a job running in the scheduler's cluster). Are there alternatives that can capture these catastrophic failure modes?
When running a local job but logging in Atlas, the job remains stuck in a running state if the underlying python process dies for some reason (OOM issues). The job cannot then be manipulated on the Atlas UI.
It would be ideal if this job could be failed automatically by Atlas if the underlying process dies. If not, then the user should have the ability to "stop" these phantom jobs which should then appear as failed.
This is related to #77 and #137
The text was updated successfully, but these errors were encountered: