Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use async pid fd instead of blocking waitid to wait for a child process to exit #745

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jprendes
Copy link
Collaborator

@jprendes jprendes commented Nov 22, 2024

This PR replaces the blocking waitid call with an async implementation based on pid fd.
There's a race condition where containerd-shim reaps child processes.
If the child process has already been reaped, query containerd-shim to get the process status.

Note that this race condition is already present in the current implementation, but runwasi's waitid is very likely to win the race, the introduction of async evens out the odds between runwasi and containerd-shim.

This PR is in preparation to move the whole shim implementation to async.

Copy link
Member

@andreiltd andreiltd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Mossaka
Mossaka previously approved these changes Nov 26, 2024
Copy link
Member

@Mossaka Mossaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jprendes
Copy link
Collaborator Author

I still need to figure out why CI is failing. I think the http_poxy container is not receiving the SiGINT for some reason. But I don't see how that can relate to this. I'll keep digging tomorrow.

@jprendes jprendes force-pushed the async-shim-v3 branch 3 times, most recently from 29927c0 to 4a4f9b0 Compare January 23, 2025 11:02
@jprendes jprendes changed the title Use async pid fd instead of blocking waitid to wait for a child process to exit Use containerd-shim monitor to wait for child to exit Jan 23, 2025
@jprendes jprendes force-pushed the async-shim-v3 branch 2 times, most recently from f29e21c to 26d8ab0 Compare January 23, 2025 14:50
@jprendes jprendes changed the title Use containerd-shim monitor to wait for child to exit Use async pid fd instead of blocking waitid to wait for a child process to exit Jan 23, 2025
@jprendes jprendes force-pushed the async-shim-v3 branch 3 times, most recently from 92edd5b to e89619a Compare January 23, 2025 16:02
@jprendes jprendes requested review from Mossaka and andreiltd January 23, 2025 16:25
Copy link
Member

@Mossaka Mossaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for working on this

What I'd like to see, perhaps as a follow up, is some unit tests for edge cases like

  1. What would happen if the pid does not exist
  2. what would happen if the orignal PID is reused by a new process after the first one exists.
  3. what would happen to call wait() when containerd-shim has already reaped the process

Copy link
Member

@Mossaka Mossaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for working on this

What I'd like to see, perhaps as a follow up, is some unit tests for edge cases like

  1. What would happen if the pid does not exist
  2. what would happen if the orignal PID is reused by a new process after the first one exists.
  3. what would happen to call wait() when containerd-shim has already reaped the process

return Err(std::io::Error::last_os_error().into());
}
let fd = unsafe { OwnedFd::from_raw_fd(pidfd as RawFd) };
let subs = monitor_subscribe(Topic::Pid)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: shouldn't we start subscription monitor before the syscalls?

Err(Errno::ECHILD) => {
// The process has already been reaped by the containerd-shim reaper.
// Get the status from there.
let status = wait_pid(self.pid, self.subs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If containerd has already reaped the process, will this be blocking the async runtime forever?

If so, I would suggest to run this in a spawn_blocking call. I am not entirely sure how whether or not containerd-shim handles subscription and how reliable is that.

@Mossaka
Copy link
Member

Mossaka commented Jan 27, 2025

Note that this race condition is already present in the current implementation, but runwasi's waitid is very likely to win the race, the introduction of async evens out the odds between runwasi and containerd-shim.

Just to clarify, in this PR, you did not resolve this issue, right? If yes, could you please raise an issue against the repo so that we can track this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants