-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stop_long_running_jobs functionality to Scheduler #6865
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6865 +/- ##
==========================================
+ Coverage 83.99% 84.01% +0.02%
==========================================
Files 368 368
Lines 21656 21683 +27
Branches 948 948
==========================================
+ Hits 18190 18218 +28
+ Misses 3172 3171 -1
Partials 294 294 ☔ View full report in Codecov by Sentry. |
7fb6349
to
513c3e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, but functionality is a bit weird for me still. Also, wrt. performance, we should try to avoid sleep loops if we can help it.
): | ||
task.cancel() | ||
await task | ||
await asyncio.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this whole mechanism should be handled by _process_event_queue()
? This task loops through all realisations multiple times, every 0.1 seconds.
The only time something can happen is when a job starts or finishes (either fail or success). Otherwise, if no job has finished since last loop, all jobs' running_duration
have increased by 0.1
, but average_runtime
stays the same. So once len(completed_jobs) >= minimum_required_realizations
is True
, ERT will keep on running until the jobs get 90% done and then it kills them, and keep checking every 0.1s even though the time when we should start killing jobs is entirely predictable.
Instead, _process_event_queue()
could have this logic that is only executed when a job starts or stops. This would mean no more checking every 100ms, which is nice imo because performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might need to rethink this a bit.
return True if result is None else bool(result) | ||
return True | ||
|
||
async def _kill(self, *args): | ||
async def _kill(self, iens, *args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_kill
takes no more arguments
async def _kill(self, iens, *args): | |
async def _kill(self, iens): |
6bb7106
to
89622d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like how we are checking every 100ms when I think we can do better, but I think having basic functionality trumps this sort of performance consideration at this time.
@@ -67,6 +68,8 @@ def __init__(self, scheduler: Scheduler, real: Realization) -> None: | |||
self._scheduler: Scheduler = scheduler | |||
self._callback_status_msg: str = "" | |||
self._requested_max_submit: Optional[int] = None | |||
self._start_time: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not datetime.datetime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all boils down to floats, but tbh tried datetime first but got a mypy error at some point.
src/ert/scheduler/scheduler.py
Outdated
pass | ||
async def _update_avg_job_runtime(self) -> None: | ||
while True: | ||
job_id = await self.completed_jobs.get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use iens
instead of job_id
for consistency.
async def wait(iens): | ||
# all jobs with iens > 5 will sleep for 10 seconds and should be killed | ||
if iens < 6: | ||
await asyncio.sleep(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will 0.1 also work and give faster test execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try. Update: yes, will do 0.1 👍
This adds two tasks to scheduler. 1) Processing the finished jobs and computing the running average 2) Checking that the duration of still running jobs is bellow the threshold and kills those jobs otherwise.
Issue
Resolves #6710
Approach
This adds a task which handles long running jobs
(Screenshot of new behavior in GUI if applicable)
Pre review checklist
Ground Rules),
and changes to existing code have good test coverage.
Pre merge checklist