-
-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition around subprocess inference state tidyup #2008
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is insane. Thanks a lot for documenting all of these details. It's extremely rare to be gifted with pull requests of this quality! Feel free to merge and thanks again for dealing with all of my old code that's really not that easy to understand ❤️
Thanks :D For clarity: I'm keeping this on hold as I've started seeing test failures on my |
This removes some of the coupling between the management of the underlying process and the inference state itself, which intends to enable changing the origin of the id. This will be useful in the next commit.
I'm not sure where this was used in the past, however it appears to be unused now. Removing this simplifies a change I'm about to make to _InferenceStateProcess.
This is derived from my understanding of the code, plus a bit of experimentation.
There was a race condition due to the combination of Python's object ids being re-usable and Jedi persisting such ids beyond the real lifeteime of some objects. This could lead to the subprocess' view of the lifetime of `InferenceState` contexts getting out of step with that in the parent process and resulting in errors when removing them. It is also possible that this could result in erroneous results being reported, however this was not directly observed. The race was specifically: - `InferenceState` A created, gets id 1 - `InferenceStateSubprocess` A' created, uses `InferenceState` A which it stores as a weakref and an id - `InferenceStateSubprocess` A' is used, the sub-process learns about an `InferenceState` with id 1 - `InferenceState` A goes away, `InferenceStateSubprocess` A' is not yet garbage collected - `InferenceState` B created, gets id 1 - `InferenceStateSubprocess` B' created, uses `InferenceState` B which it stores as a weakref and an id - `InferenceStateSubprocess` B' is used, the sub-process re-uses its entry for an `InferenceState` with id 1 At this point the order of operations between the two `InferenceStateSubprocess` instances going away is immaterial -- both will trigger a removal of a state with id 1. As long as B' doesn't try to use the sub-process again after the first removal has happened then the second removal will fail. This commit resolves the race condition by coupling the context in the subprocess to the corresponding manager class instance in the parent process, rather than to the consumer `InferenceState`. See inline comments for further details.
a70bc50
to
a67deeb
Compare
I believe these are down to some changes in Python 3.13.0-beta.3 (where previously CI was using beta.2), and I'm happy that that's not related to the changes in this PR. |
That makes sense. The pull request itself feels very simple and I am not surprised it's not your fault :) |
Summary
There was a race condition due to the combination of Python's object ids being re-usable and Jedi persisting such ids beyond the real lifeteime of some objects. This could lead to the subprocess' view of the lifetime of
InferenceState
contexts getting out of step with that in the parent process and resulting in errors when removing them. It is also possible that this could result in erroneous results being reported, however this was not directly observed.The race was specifically:
InferenceState
A created, gets id 1InferenceStateSubprocess
A' created, usesInferenceState
A which it stores as a weakref and an idInferenceStateSubprocess
A' is used, the sub-process learns about anInferenceState
with id 1InferenceState
A goes away,InferenceStateSubprocess
A' is not yet garbage collectedInferenceState
B created, gets id 1InferenceStateSubprocess
B' created, usesInferenceState
B which it stores as a weakref and an idInferenceStateSubprocess
B' is used, the sub-process re-uses its entry for anInferenceState
with id 1At this point the order of operations between the two
InferenceStateSubprocess
instances going away is immaterial -- both will trigger a removal of a state with id 1. As long as B' doesn't try to use the sub-process again after the first removal has happened then the second removal will fail.This change resolves the race condition by coupling the context in the subprocess to the corresponding manager class instance in the parent process, rather than to the consumer
InferenceState
.See inline comments for further details.
Code review
Suggest review by commit. There's some small refactors in the early commits. The actual fix is the last commit on the branch.
Testing
I'm open to ideas on how to add a test for this.
Currently this is relying on Python 3.13 tests reproducing the issue fairly easily, though not entirely consistently.
Links
Currently builds on #2003 (as it was discovered there), though it doesn't really need to be part of the upgrade process.