Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers: update ordering of events in StartTask to fix executor leak #24495

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mismithhisler
Copy link
Member

@mismithhisler mismithhisler commented Nov 19, 2024

Description

If an error occurs before a task is started, but after this executor process is created, the executor must be explicity stopped. This change updates the logic in StartTask so launching the task happens immediately after creating the executor.

Testing & Reproduction steps

Run a job with a configuration that would fail after the executor process has started, but before the task is launched. For example, an exec task with cap_add = ["net_raw"]. After the job fails, the executor process will still be running.

Links

Fixes GH #11958

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

If an error occurs before a task is started, but after this executor
process is created, the executor must be explicity stopped.  This change
updates the logic in StartTask so launching the task happens immediately
after creating the executor.
@mismithhisler mismithhisler marked this pull request as draft November 19, 2024 20:56
@mismithhisler
Copy link
Member Author

mismithhisler commented Nov 19, 2024

I wonder if using named return values would be a better way to prevent a regression. With named return values, we can defer a function that checks if err != nil, and kill the plugin client. This way we don't have to worry about accidentally leaving the executor process running in any added codepaths in the future.

We could also write a single test that forces any error after the executor is created, and that should suffice to catch this bug in most other error scenarios.

@jrasell
Copy link
Member

jrasell commented Nov 20, 2024

I wonder if using named return values would be a better way to prevent a regression. With named return values, we can defer a function that checks if err != nil, and kill the plugin client. This way we don't have to worry about accidentally leaving the executor process running in any added codepaths in the future.

We could also write a single test that forces any error after the executor is created, and that should suffice to catch this bug in most other error scenarios.

If you think it's a solution worth exploring, then I would encourage you to look into it. The testing aspect in particular would be very useful in my opinion. The current code in this PR looks like it would fix the problem, so we can always fallback here if needed.

@mismithhisler mismithhisler marked this pull request as ready for review November 20, 2024 17:26
@mismithhisler mismithhisler added backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line labels Nov 20, 2024
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks @mismithhisler!

I tested this on a Linux workstation with the details supplied and it was easy to witness the fix.

I noticed the added test takes 12s to run which seems a little long. It might be a follow up to see if this can be lowered. Interestingly, the NoOrphanedTasks test is also long at 27s, so it might be something these tests need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exec driver leaks executor process after StartTask error
2 participants