-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] *: introduce pidfd-socket flag #4045
Conversation
The king of involution, is writing code during holidays also a form of rest? 😄
Could you please explain why? And why we can’t use ‘runc kill’ or runc libcontainer API directly?
I seem to remember that ‘contained’ has used kill(2) for many many years. |
5a4d6eb
to
2328294
Compare
runc-kill is used to send the signal to the container init process. As far as I know, there is no runc-commandline to send signal to the exec init process.
Yes because pidfd is available since v5.3 kernel. pidfd can ensure that we can send the signal to the correct process, especially the exec-probe has timeout. |
ping @AkihiroSuda @thaJeztah |
also ping @lifubang @cyphar @kolyshkin |
I think this is a nice feature, because I have hated the big for loop in containerd to find out whether the exit signal is from the init process or not in many years ago. Just only one question, I think maybe we can simplify the implementation, I don’t know whether my solution could work or not: Lines 270 to 278 in ee45b9b
|
Basically, yes. The runc-{create/exec/run} process is still parent of the init process before exit. We should check the status of process by |
2328294
to
71ff429
Compare
It seems like it would've been a good idea to make FWIW, I don't like adding features to runc's command-line if we can avoid it -- it makes life harder for other OCI runtimes because we are creating non-standard behaviour that everyone has to copy from us in order to work with runtimes that depend on it. I made this mistake with But then again, I don't see another way of solving it, other than re-architecting runc... Hmmm... |
I think it isn’t a problem to add features to runc's command-line, because if there is a way that we accept pidFd solution without cmd flag, other OCI runtimes should also have to support it with the way like runc uses. What I mean is that, Can we independently solve this problem on containerd side? @fuweid For example, when containerd have fetched the init process’ pid, how about get the pidFd from containerd side? |
Thanks for the comment! @cyphar
Understand. Currently, no spec is to describe what the command line looks like. For example, the standard init process has two steps to setup:
Just wondering about what re-architecting runc looks like. If it's not related to spec or standard, I think we still have problem to align with all the runtime implementations. Any new features could introduce new flag.
Totally agrees. I was thinking about introduce Hi @lifubang
It requires the sub-reaper setting. The idea comes from refactoring the containerd-shim process manager. I think it's useful to non-sub-reaper use case as well. |
71ff429
to
7f3dfd9
Compare
7f3dfd9
to
0117ed9
Compare
16c6989
to
1105572
Compare
ping @opencontainers/runc-maintainers ~ |
Sorry to ping @cyphar @kolyshkin @AkihiroSuda @thaJeztah @lifubang again. Any thoughts on this pull request? Thanks |
cli.StringFlag{ | ||
Name: "pidfd-socket", | ||
Usage: "path to an AF_UNIX socket which will receive a file descriptor referencing the init process", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. cc @lifubang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is green now. @lifubang @AkihiroSuda PTAL thanks
661a689
to
5fe6606
Compare
d77625a
to
911366a
Compare
911366a
to
52ad8b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Please add a changelog entry in your PR description.
@cyphar @kolyshkin @thaJeztah PTAL, if there is no objection, I will merge it in next week.
|
52ad8b5
to
12c2dab
Compare
The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely. And for the standard init process, we can have polling support to get exit event instead of blocking on wait4. Signed-off-by: Wei Fu <[email protected]>
12c2dab
to
94505a0
Compare
The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely.
And for the standard init process, we can have polling support to get exit event instead of blocking on wait4.
Let me explain why the containerd-shim needs this feature for containerd init process.
Without pidfd, containerd-shim can't tell which process exits. It has to use reap all the zombies.
However, it requires all the fork/exec operations needs to use the reap-event-framework in containerd.
For example, the
mount
go-package needs to fork child process which unshares to get brand-new userns. If the child process has been killed and reaped by containerd-shim, the child process's pid can be reused. In order to know the exit event, themount
go-package needs to use reap-event-framework, which doesn't make senses.With pidfd support, we can use polling support to know which process exits instead of calling
wait4
syscall.And one more detail is that the containerd-shim only cares the container init process and exec init processes.
Currently, containerd-shim uses
PR_SET_CHILD_SUBREAPER
, and watch the signalSIGCHLD
to reap all the zombie processes, including container init process and exec init processes. Before v4.11 kernel-exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction, the process X double-forked by the exec init process will be reparented to containerd-shim so that containerd-shim can cleanup the zombie.Right now (>= v4.11 kernel), the containerd-shim only receives the SIGCHLD from init processes because the double-forked processes will be reparent to pid-1 in the pid namespace. So, containerd-shim doesn't need to care any double-forked processes. The pidfd can help containerd-shim to focus on the correct processes.
REF: containerd/containerd#9175