Add stop_long_running_jobs functionality to Scheduler #6865

xjules · 2023-12-29T14:21:56Z

Issue
Resolves #6710

Approach
This adds a task which handles long running jobs

(Screenshot of new behavior in GUI if applicable)

Pre review checklist

Read through the code changes carefully after finishing work
Make sure tests pass locally (after every commit!)
Prepare changes in small commits for more convenient review (optional)
PR title captures the intent of the changes, and is fitting for release notes.
Updated documentation
Ensured that unit tests are added for all new behavior (See
Ground Rules),
and changes to existing code have good test coverage.

Pre merge checklist

Added appropriate release note label
Commit history is consistent and clean, in line with the contribution guidelines.

codecov-commenter · 2023-12-29T14:35:07Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (fc7c5ba) 83.99% compared to head (89622d6) 84.01%.
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6865      +/-   ##
==========================================
+ Coverage   83.99%   84.01%   +0.02%     
==========================================
  Files         368      368              
  Lines       21656    21683      +27     
  Branches      948      948              
==========================================
+ Hits        18190    18218      +28     
+ Misses       3172     3171       -1     
  Partials      294      294

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pinkwah

Looks pretty good, but functionality is a bit weird for me still. Also, wrt. performance, we should try to avoid sleep loops if we can help it.

src/ert/scheduler/job.py

src/ert/scheduler/scheduler.py

pinkwah · 2024-01-02T17:19:41Z

src/ert/scheduler/scheduler.py

+                    ):
+                        task.cancel()
+                        await task
+            await asyncio.sleep(0.1)


Perhaps this whole mechanism should be handled by _process_event_queue()? This task loops through all realisations multiple times, every 0.1 seconds.

The only time something can happen is when a job starts or finishes (either fail or success). Otherwise, if no job has finished since last loop, all jobs' running_duration have increased by 0.1, but average_runtime stays the same. So once len(completed_jobs) >= minimum_required_realizations is True, ERT will keep on running until the jobs get 90% done and then it kills them, and keep checking every 0.1s even though the time when we should start killing jobs is entirely predictable.

Instead, _process_event_queue() could have this logic that is only executed when a job starts or stops. This would mean no more checking every 100ms, which is nice imo because performance.

I might need to rethink this a bit.

pinkwah · 2024-01-02T17:20:40Z

tests/unit_tests/scheduler/conftest.py

            return True if result is None else bool(result)
        return True

-    async def _kill(self, *args):
+    async def _kill(self, iens, *args):


_kill takes no more arguments

Suggested change

async def _kill(self, iens, *args):

async def _kill(self, iens):

pinkwah

I still don't like how we are checking every 100ms when I think we can do better, but I think having basic functionality trumps this sort of performance consideration at this time.

berland · 2024-01-03T13:11:53Z

src/ert/scheduler/job.py

@@ -67,6 +68,8 @@ def __init__(self, scheduler: Scheduler, real: Realization) -> None:
        self._scheduler: Scheduler = scheduler
        self._callback_status_msg: str = ""
        self._requested_max_submit: Optional[int] = None
+        self._start_time: Optional[float] = None


why not datetime.datetime?

maybe faster

It all boils down to floats, but tbh tried datetime first but got a mypy error at some point.

berland · 2024-01-03T13:50:58Z

src/ert/scheduler/scheduler.py

-        pass
+    async def _update_avg_job_runtime(self) -> None:
+        while True:
+            job_id = await self.completed_jobs.get()


use iens instead of job_id for consistency.

berland · 2024-01-03T13:54:00Z

tests/unit_tests/scheduler/test_scheduler.py

+    async def wait(iens):
+        # all jobs with iens > 5 will sleep for 10 seconds and should be killed
+        if iens < 6:
+            await asyncio.sleep(0.5)


will 0.1 also work and give faster test execution?

Will try. Update: yes, will do 0.1 👍

This adds two tasks to scheduler. 1) Processing the finished jobs and computing the running average 2) Checking that the duration of still running jobs is bellow the threshold and kills those jobs otherwise.

xjules self-assigned this Dec 29, 2023

xjules force-pushed the long_running branch 5 times, most recently from 7fb6349 to 513c3e3 Compare January 2, 2024 14:12

xjules marked this pull request as ready for review January 2, 2024 14:12

xjules added release-notes:improvement Automatically categorise as improvement in release notes scheduler release-notes:skip If there should be no mention of this in release notes and removed release-notes:improvement Automatically categorise as improvement in release notes labels Jan 2, 2024

pinkwah reviewed Jan 2, 2024

View reviewed changes

berland changed the title ~~Add stop_long_running_jobs funcitonality to Scheduler~~ Add stop_long_running_jobs functionality to Scheduler Jan 3, 2024

xjules force-pushed the long_running branch 4 times, most recently from 6bb7106 to 89622d6 Compare January 3, 2024 11:40

pinkwah approved these changes Jan 3, 2024

View reviewed changes

berland reviewed Jan 3, 2024

View reviewed changes

xjules force-pushed the long_running branch from 89622d6 to 091dfe5 Compare January 3, 2024 13:22

berland reviewed Jan 3, 2024

View reviewed changes

Add stop_long_running_jobs funcitonality to Scheduler

5e46544

This adds two tasks to scheduler. 1) Processing the finished jobs and computing the running average 2) Checking that the duration of still running jobs is bellow the threshold and kills those jobs otherwise.

xjules force-pushed the long_running branch from 091dfe5 to 5e46544 Compare January 3, 2024 14:08

xjules merged commit ba865c9 into equinor:main Jan 3, 2024
41 of 44 checks passed

xjules deleted the long_running branch January 3, 2024 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stop_long_running_jobs functionality to Scheduler #6865

Add stop_long_running_jobs functionality to Scheduler #6865

xjules commented Dec 29, 2023 •

edited

Loading

codecov-commenter commented Dec 29, 2023 •

edited

Loading

pinkwah left a comment

pinkwah Jan 2, 2024

xjules Jan 3, 2024 •

edited

Loading

pinkwah Jan 2, 2024

pinkwah left a comment

berland Jan 3, 2024

berland Jan 3, 2024

xjules Jan 3, 2024

berland Jan 3, 2024 •

edited

Loading

berland Jan 3, 2024

xjules Jan 3, 2024 •

edited

Loading

	async def _kill(self, iens, *args):
	async def _kill(self, iens):

Add stop_long_running_jobs functionality to Scheduler #6865

Add stop_long_running_jobs functionality to Scheduler #6865

Conversation

xjules commented Dec 29, 2023 • edited Loading

Pre review checklist

Pre merge checklist

codecov-commenter commented Dec 29, 2023 • edited Loading

Codecov Report

pinkwah left a comment

Choose a reason for hiding this comment

pinkwah Jan 2, 2024

Choose a reason for hiding this comment

xjules Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

pinkwah Jan 2, 2024

Choose a reason for hiding this comment

pinkwah left a comment

Choose a reason for hiding this comment

berland Jan 3, 2024

Choose a reason for hiding this comment

berland Jan 3, 2024

Choose a reason for hiding this comment

xjules Jan 3, 2024

Choose a reason for hiding this comment

berland Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

berland Jan 3, 2024

Choose a reason for hiding this comment

xjules Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

xjules commented Dec 29, 2023 •

edited

Loading

codecov-commenter commented Dec 29, 2023 •

edited

Loading

xjules Jan 3, 2024 •

edited

Loading

berland Jan 3, 2024 •

edited

Loading

xjules Jan 3, 2024 •

edited

Loading