Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stop_long_running_jobs functionality to Scheduler #6865

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

xjules
Copy link
Contributor

@xjules xjules commented Dec 29, 2023

Issue
Resolves #6710

Approach
This adds a task which handles long running jobs

(Screenshot of new behavior in GUI if applicable)

Pre review checklist

  • Read through the code changes carefully after finishing work
  • Make sure tests pass locally (after every commit!)
  • Prepare changes in small commits for more convenient review (optional)
  • PR title captures the intent of the changes, and is fitting for release notes.
  • Updated documentation
  • Ensured that unit tests are added for all new behavior (See
    Ground Rules),
    and changes to existing code have good test coverage.

Pre merge checklist

  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.

@xjules xjules self-assigned this Dec 29, 2023
@codecov-commenter
Copy link

codecov-commenter commented Dec 29, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (fc7c5ba) 83.99% compared to head (89622d6) 84.01%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6865      +/-   ##
==========================================
+ Coverage   83.99%   84.01%   +0.02%     
==========================================
  Files         368      368              
  Lines       21656    21683      +27     
  Branches      948      948              
==========================================
+ Hits        18190    18218      +28     
+ Misses       3172     3171       -1     
  Partials      294      294              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xjules xjules force-pushed the long_running branch 5 times, most recently from 7fb6349 to 513c3e3 Compare January 2, 2024 14:12
@xjules xjules marked this pull request as ready for review January 2, 2024 14:12
@xjules xjules added release-notes:improvement Automatically categorise as improvement in release notes scheduler release-notes:skip If there should be no mention of this in release notes and removed release-notes:improvement Automatically categorise as improvement in release notes labels Jan 2, 2024
Copy link
Contributor

@pinkwah pinkwah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, but functionality is a bit weird for me still. Also, wrt. performance, we should try to avoid sleep loops if we can help it.

src/ert/scheduler/job.py Show resolved Hide resolved
src/ert/scheduler/job.py Outdated Show resolved Hide resolved
src/ert/scheduler/scheduler.py Outdated Show resolved Hide resolved
):
task.cancel()
await task
await asyncio.sleep(0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this whole mechanism should be handled by _process_event_queue()? This task loops through all realisations multiple times, every 0.1 seconds.

The only time something can happen is when a job starts or finishes (either fail or success). Otherwise, if no job has finished since last loop, all jobs' running_duration have increased by 0.1, but average_runtime stays the same. So once len(completed_jobs) >= minimum_required_realizations is True, ERT will keep on running until the jobs get 90% done and then it kills them, and keep checking every 0.1s even though the time when we should start killing jobs is entirely predictable.

Instead, _process_event_queue() could have this logic that is only executed when a job starts or stops. This would mean no more checking every 100ms, which is nice imo because performance.

Copy link
Contributor Author

@xjules xjules Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need to rethink this a bit.

return True if result is None else bool(result)
return True

async def _kill(self, *args):
async def _kill(self, iens, *args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_kill takes no more arguments

Suggested change
async def _kill(self, iens, *args):
async def _kill(self, iens):

@berland berland changed the title Add stop_long_running_jobs funcitonality to Scheduler Add stop_long_running_jobs functionality to Scheduler Jan 3, 2024
@xjules xjules force-pushed the long_running branch 4 times, most recently from 6bb7106 to 89622d6 Compare January 3, 2024 11:40
Copy link
Contributor

@pinkwah pinkwah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like how we are checking every 100ms when I think we can do better, but I think having basic functionality trumps this sort of performance consideration at this time.

@@ -67,6 +68,8 @@ def __init__(self, scheduler: Scheduler, real: Realization) -> None:
self._scheduler: Scheduler = scheduler
self._callback_status_msg: str = ""
self._requested_max_submit: Optional[int] = None
self._start_time: Optional[float] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not datetime.datetime?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all boils down to floats, but tbh tried datetime first but got a mypy error at some point.

pass
async def _update_avg_job_runtime(self) -> None:
while True:
job_id = await self.completed_jobs.get()
Copy link
Contributor

@berland berland Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use iens instead of job_id for consistency.

async def wait(iens):
# all jobs with iens > 5 will sleep for 10 seconds and should be killed
if iens < 6:
await asyncio.sleep(0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will 0.1 also work and give faster test execution?

Copy link
Contributor Author

@xjules xjules Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try. Update: yes, will do 0.1 👍

This adds two tasks to scheduler. 1) Processing the finished jobs and computing the running average 2) Checking that the duration of still running jobs is bellow the threshold and kills those jobs otherwise.
@xjules xjules merged commit ba865c9 into equinor:main Jan 3, 2024
41 of 44 checks passed
@xjules xjules deleted the long_running branch January 3, 2024 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:skip If there should be no mention of this in release notes
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Implement stop_long_running_jobs in the Scheduler
4 participants