Add retry loop into the scheduler handling resubmitting when running state failed #6782

xjules · 2023-12-11T13:18:47Z

Issue
Resolves #6771

Approach
Use while retry to iterate from running to waiting states. It includes a
simple test to check if job has started several times. Max_submit is a function
parameter of job.call that is passed on from scheduler.

Additionally, function driver.finish will implement the basic clean up
functionally. For the local driver it makes sure that all tasks have
been awaited correctly.

Pre review checklist

Read through the code changes carefully after finishing work
Make sure tests pass locally (after every commit!)
Prepare changes in small commits for more convenient review (optional)
PR title captures the intent of the changes, and is fitting for release notes.
Updated documentation
Ensured that unit tests are added for all new behavior (See
Ground Rules),
and changes to existing code have good test coverage.

Pre merge checklist

Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.

codecov-commenter · 2023-12-11T13:29:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f284e11) 83.86% compared to head (3c5e075) 83.87%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6782      +/-   ##
==========================================
+ Coverage   83.86%   83.87%   +0.01%     
==========================================
  Files         365      365              
  Lines       21353    21380      +27     
  Branches      948      948              
==========================================
+ Hits        17907    17932      +25     
- Misses       3152     3154       +2     
  Partials      294      294

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/unit_tests/scheduler/test_scheduler.py

src/ert/scheduler/job.py

berland · 2023-12-14T07:26:00Z

src/ert/scheduler/job.py

+                    await asyncio.sleep(0.01)
+                returncode = await self.returncode
+                # we need to make sure that the task has finished too
+                await self.driver.wait(self.real.iens)


Is it optimal to use the name wait() for this, as it looks like the same as the lines with self.started.wait(), while the latter is waiting for an asyncio.Event.

Maybe a name sufficiently clear to avoid the comment.

This wait function leaks implementation details of the LocalDriver. Why can't the LocalDriver itself do the right thing when the job is ended / restarted?

Also, I'm not sure what we are waiting for anyway. The subprocess is done at this point.

Why can't the LocalDriver itself do the right thing when the job is ended / restarted?

Since we want to have it non-blocking and thus we need to await out of the scope. The thing that Subprocess has ended does not guarantee that the task has ended too, ie. got failures on tests.

The fact that it's a subprocess is implementation detail. await self.returncode already means "wait for the job to end". Do we need another wait function to do this?

Correct, but as you said it's the job that has finished not the task. You can try to remove it and the test with retries will fail. I guess this await task does more to that async context.

pinkwah · 2023-12-14T09:15:04Z

src/ert/scheduler/job.py

-            sem.release()
+        retries = 0
+        retry: bool = True
+        while retry:


Maybe separate the retry functionality into its own function? This is beginning to be a mess.

Eg:

async def __call__(...): for _ in range(max_submit): if await self.actually_do_the_thing(...): # SUCCESS! else: # FAILURE (No more retries)

Also, using a normal loop will probably avoid having to do these retries, retry variables and manually checking whether you should escape.

Also, you may want to reset the events and futures on this job.

Maybe separate the retry functionality into its own function? This is beginning to be a mess.

Agree, I saw it coming, will give it some tries to split it in 2.

Now it's split.

pinkwah · 2023-12-14T09:16:43Z

src/ert/scheduler/job.py

-        finally:
-            sem.release()
+        retries = 0
+        retry: bool = True


Suggested change

retry: bool = True

retry = True

True has type bool, just like 0 has type int. No need to specify it explicitly.

pinkwah · 2023-12-14T09:21:32Z

src/ert/scheduler/job.py

+                    await asyncio.sleep(0.01)
+                returncode = await self.returncode
+                # we need to make sure that the task has finished too
+                await self.driver.wait(self.real.iens)


This wait function leaks implementation details of the LocalDriver. Why can't the LocalDriver itself do the right thing when the job is ended / restarted?

Also, I'm not sure what we are waiting for anyway. The subprocess is done at this point.

pinkwah · 2023-12-14T09:24:14Z

src/ert/scheduler/job.py

-            sem.release()
+        retries = 0
+        retry: bool = True
+        while retry:


Also, you may want to reset the events and futures on this job.

pinkwah · 2023-12-14T09:24:48Z

src/ert/scheduler/local_driver.py

@@ -23,6 +23,9 @@ async def kill(self, iens: int) -> None:
        except KeyError:
            return

+    async def wait(self, iens: int) -> None:


What would an LFSDriver implementation of this function be like?

It could be an Event that is triggered?

I mean, what are we wait for in LSF? The job being completed/failed is already covered by Job's returncode.

For localdriver, this task was never awaited:

self._tasks[iens] = asyncio.create_task( self._wait_until_finish(iens, executable, *args, cwd=cwd) )

We need to take the task awaiting out from the driver to the Job level (ie. await driver.wait()) to have the driver atomic functions (init, kill, poll) non-blocking. For LSF 🤷 maybe empty then?

berland · 2023-12-15T07:52:04Z

src/ert/scheduler/job.py

@@ -81,6 +80,9 @@ async def __call__(
            while not self.returncode.done():
                await asyncio.sleep(0.01)
            returncode = await self.returncode
+            # we need to make sure that the task has finished too
+            await self.driver.wait(self.real.iens)


I think the presence of a comment here means that the function in the driver should have a clearer name.

Right, that was not clear that the name was an issue :)

berland · 2023-12-15T07:52:54Z

src/ert/scheduler/driver.py

+
+    @abstractmethod
+    async def wait(self, iens: int) -> None:
+        """Blocks the execution of a job associated with a realization.


I think it is too vague what is to be waited for by in implementations of this function. And should we use "block", as we never block anything, do we?

We do when we call block, but you are correct it's not the job execution but the event loop only I guess

src/ert/scheduler/job.py

berland · 2023-12-15T07:59:22Z

tests/unit_tests/scheduler/test_scheduler.py

@@ -108,3 +107,16 @@ async def test_cancel(tmp_path: Path, realization):

    assert (tmp_path / "a").exists()
    assert not (tmp_path / "b").exists()
+
+
+async def test_that_max_submit_was_reached(tmp_path: Path, realization):


should we ensure that the edge_case max_submit=1 is covered?

(what about max_submit=0, is that covered by the config parser?)

Can do a parametriization ranging from 1 to 3? Regarding 0 that's a matter of config and not this test.

Use while retry to iterate from running to waiting states. It includes a simple test to check if job has started several times. Max_submit is a function parameter of job.__call__ that is passed on from scheduler. Additionally, function driver.finish will implement the basic clean up functionally. For the local driver it makes sure that all tasks have been awaited correctly.

pinkwah

LGTM!

xjules self-assigned this Dec 11, 2023

xjules added the scheduler label Dec 11, 2023

xjules added the improvement Something nice to have, that will make life easier for developers or users or both. label Dec 11, 2023

xjules commented Dec 11, 2023

View reviewed changes

tests/unit_tests/scheduler/test_scheduler.py Outdated Show resolved Hide resolved

xjules force-pushed the retry_loop branch from 910855f to a9e4798 Compare December 12, 2023 15:00

xjules marked this pull request as ready for review December 12, 2023 15:01

xjules force-pushed the retry_loop branch 3 times, most recently from 7698ac5 to 3e725e8 Compare December 12, 2023 21:19

berland reviewed Dec 13, 2023

View reviewed changes

src/ert/scheduler/job.py Outdated Show resolved Hide resolved

berland reviewed Dec 13, 2023

View reviewed changes

src/ert/scheduler/job.py Outdated Show resolved Hide resolved

xjules force-pushed the retry_loop branch 2 times, most recently from 04defcf to b0b890c Compare December 13, 2023 19:09

berland reviewed Dec 14, 2023

View reviewed changes

pinkwah reviewed Dec 14, 2023

View reviewed changes

pinkwah suggested changes Dec 14, 2023

View reviewed changes

xjules force-pushed the retry_loop branch 3 times, most recently from 6b5ec33 to b3737a3 Compare December 14, 2023 20:58

berland reviewed Dec 15, 2023

View reviewed changes

src/ert/scheduler/job.py Outdated Show resolved Hide resolved

berland reviewed Dec 15, 2023

View reviewed changes

src/ert/scheduler/job.py Outdated Show resolved Hide resolved

berland reviewed Dec 15, 2023

View reviewed changes

xjules force-pushed the retry_loop branch from 366aaf2 to 37bf484 Compare December 15, 2023 11:05

pinkwah approved these changes Dec 15, 2023

View reviewed changes

xjules merged commit 38348cd into equinor:main Dec 15, 2023
41 of 42 checks passed

xjules deleted the retry_loop branch December 15, 2023 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry loop into the scheduler handling resubmitting when running state failed #6782

Add retry loop into the scheduler handling resubmitting when running state failed #6782

xjules commented Dec 11, 2023 •

edited

Loading

codecov-commenter commented Dec 11, 2023 •

edited

Loading

berland Dec 14, 2023

berland Dec 14, 2023

pinkwah Dec 14, 2023

xjules Dec 14, 2023

pinkwah Dec 14, 2023

xjules Dec 14, 2023

pinkwah Dec 14, 2023

pinkwah Dec 14, 2023

xjules Dec 14, 2023

xjules Dec 14, 2023

pinkwah Dec 14, 2023

pinkwah Dec 14, 2023

pinkwah Dec 14, 2023

pinkwah Dec 14, 2023

xjules Dec 14, 2023

pinkwah Dec 14, 2023

xjules Dec 14, 2023

berland Dec 15, 2023

xjules Dec 15, 2023

berland Dec 15, 2023

xjules Dec 15, 2023

berland Dec 15, 2023

xjules Dec 15, 2023

pinkwah left a comment

Add retry loop into the scheduler handling resubmitting when running state failed #6782

Add retry loop into the scheduler handling resubmitting when running state failed #6782

Conversation

xjules commented Dec 11, 2023 • edited Loading

Pre review checklist

Pre merge checklist

codecov-commenter commented Dec 11, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pinkwah left a comment

Choose a reason for hiding this comment

xjules commented Dec 11, 2023 •

edited

Loading

codecov-commenter commented Dec 11, 2023 •

edited

Loading