-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill dangling subprocesses #632
base: rolling
Are you sure you want to change the base?
Conversation
@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)? |
IIUC there's no equivalent to process groups in windows. (edit) I have to double check this though |
I have a few reasons:
Maybe we shouldn't have this feature at all, and only log a warning if a "dangling" subprocess is detected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable feature to me.
Maybe we shouldn't have this feature at all, and only log a warning if a "dangling" subprocess is detected.
Is there a scenario you had in mind where we wouldn't want a subprocess to end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need to wait for those children to exit?
Some like this example:
https://psutil.readthedocs.io/en/latest/index.html?highlight=children#kill-process-tree
Where it uses psutil.wait_procs
?
It would be nice to notify the user in the launch output if the child process still did not exit after some period of time.
Furthermore, does this timer always wait until it expires and then check the subprocesses, or does it only do this if the parent process has to be sent SIGTERM/SIGKILL?
Also, what happens when all the subprocess exit cleanly and quickly? Does this timer still wait and then check them?
It feels like we're missing something here where we await these processes exiting and only send them signals if they don't exit.
I don't think we can block when executing an action, so I would need another timer to check if the processes were killed.
It always sends the signals to all subprocesses, event if SIGTERM/SIGKILL was not needed.
Yes, and "process does not exist" errors are ignored when trying to kill the subprocesses that were previously detected in the tree for this reason.
Do you mean we should kill subprocesses going "level by level"? |
We can await the processes to exit, either with a thread that sets a future, or if possible using existing asyncio subprocess stuff.
It cannot be caught, but it doesn't mean that it will definitely exit, or how long that will take. I've definitely had some processes fail to exit on
My point was more about the waiting. If the default timeout of
Is that a safe assumption?
No, that's not what I'm suggesting. Instead I was thinking something like this:
This process is good in my opinion because it:
Obviously the escalation process for the main process and the child processes are the same, so maybe we can generalize that, so it can be used in What I outlined is a bit more work to implement, but I also think it's a lot more thorough and is likely to save us (and our users) some time in the future debugging these kind of issues. |
Another comment about this. I wasn't thinking that, I was still thinking doing it in a batch operation, but I also don't think doing it layer by layer is a bad idea, though it might be really annoying if it takes a long time. The reason I like it as an idea is that I want to give the called processes every chance to behave, and doing a shutdown escalation level by level (top to bottom) is the best way to do that. I could see this as something of a configuration. If we decide to implement what I outlined above, then making it possible to do it step by step instead of in batch should be easy-ish to do. We just need to keep that in mind when working on it. |
I'm not sure. If that's not a safe assumption, then we cannot do anything.
But you cannot really do this:
|
The alternative is to create a new group id when launching a process, and then send a signal to the created group. |
Then how does https://psutil.readthedocs.io/en/latest/index.html?highlight=children#psutil.wait_procs work? In their example they enumerate the children and then wait for them to exit. |
polling |
Ok, I see, so no actual blocking is realistic, and it does it one at a time, so you may "miss" the exit of one process while waiting on another. So maybe that doesn't help with the race between process exit and pid reuse, but I still think waiting to see if they exit is a decent idea. There's nothing worse than having to go back to ps or something to figure out what was left behind, especially if launch could just tell us. And my proposed series of steps have some other advantages, even if this one doesn't pan out. |
What benefit does that give us? (curious) |
Just in case, polling is not only used to check the status of more than one process, it's also used to "wait for a process to exit" if the process is not a child.
Sounds good.
Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?
You send only one signal to the group, though the part to monitor if the processes exited or not doesn't change. |
@wjwwood could you confirm this? |
@wjwwood friendly ping |
So, I was actually thinking we do something like what psutils does (perhaps using it), but in a thread or maybe as an async coroutine/task. Basically create a list of the pid you're watching, start timers to escalate their signals, then a thread/coroutine-task that iterates over all the pid that we're watching, and for each that have exited cancel the timers and remove them from the list, and then once you iterate, sleep for a fixed short period and then poll again until the list is empty. If, after doing SIGKILL and some period has passed, log it and then remove it from the list of pid to watch.
Correct. |
Please, take a look to e326c0d |
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
ead94fc
to
5b43cdc
Compare
@ivanpauno just some flake8 linting issues |
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
mmm, I have to double check those failures, they seem related |
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
This is making windows CI hang for some reason .... |
"""Test launching a process with an environment variable.""" | ||
executable = ExecuteLocal( | ||
process_description=Executable( | ||
cmd=['python3', '-c', f'"{PYTHON_SCRIPT}"'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient to test the feature? Is it the shell=True
part that makes this test useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shell=True
will create a shell process, and that shell will create a subproceses.
So the launch process will have to kill the shell subprocess, because the shell will not trap the signals and resend them to the child.
This shows that the feature works.
It's not super complete though, if you have more test case ideas I can add them.
My use case is a process spawns some children and then exits, and I do not want launch to exit until the child processes have exited. I have implemented this in Greenroom-Robotics/launch_ext@850a2f4#diff-7baf6e854cc3c937eaed0b127161c5f82cf86d8a8eaef0038171216a817a0c62R194-R222 My technique was to use the stdin/stdout/stderr inodes and if any shared the same inodes with the parent process they would be considered children to be waited on. I am not sure if this way is ideal but it does seem to work. |
Howdy @ivanpauno - any chance this could still be merged? I've encountered this issue recently, and have a workaround in place, but it would be ideal to handle the issue directly in |
Fixes #545.
I used psutil to figure out children of a process recursively.
It's an easy way to handle this issue platform independently.
For posix OSs, we could send a signal to the process group, but for that we should create a new process group when launching a process, which I'm not sure if it's the best ideal.