-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the specializing interpreter thread-safe in --disable-gil
builds
#115999
Comments
(subscribing myself) |
…aded builds (#116013) For now, disable all specialization when the GIL might be disabled.
This is now a performance (rather than correctness) issue for free-threaded builds, so I'm going to focus on more time-sensitive issues for a while. |
…e-threaded builds (python#116013) For now, disable all specialization when the GIL might be disabled.
…e-threaded builds (python#116013) For now, disable all specialization when the GIL might be disabled.
…e-threaded builds (python#116013) For now, disable all specialization when the GIL might be disabled.
@swtaarrs Out of curiosity, is there any progress or plan for this issue? |
@corona10 I'm planning to work on this after I get the deferred reference stack in. However, there are no concrete plans as of now. I'm really happy for you or anyone else to propose a design for the specializing interpreter with free-threaded safety! |
@Fidget-Spinner cc @swtaarrs By the way, in the short term, can we enable the specializer to be used only for the main thread if we can not solve the issue before 3.13 is released? |
@corona10 for 3.13, I think generally we're focusing on scalability across multicore rather than single-threaded perf for 3.13. It's a bit too near to feature freeze for me to feel safe re-enabling specialization at this point. There are a lot of unsolved problems still even with specialization only on the main thread. Consider the following:
I'm reading a few papers to get some inspiration and also looking at how CRuby and other runtimes deal with this. Will post back when I have an actual plan. |
Stop the world when invalidating function versions The tier1 interpreter specializes `CALL` instructions based on the values of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1 interpreter uses function versions to verify that the attributes of a function during execution of a specialization match those seen during specialization. A function's version is initialized in `MAKE_FUNCTION` and is invalidated when any of the critical function attributes are changed. The tier1 interpreter stores the function version in the inline cache during specialization. A guard is used by the specialized instruction to verify that the version of the function on the operand stack matches the cached version (and therefore has all of the expected attributes). It is assumed that once the guard passes, all attributes will remain unchanged while executing the rest of the specialized instruction. Stopping the world when invalidating function versions ensures that all critical function attributes will remain unchanged after the function version guard passes in free-threaded builds. It's important to note that this is only true if the remainder of the specialized instruction does not enter and exit a stop-the-world point. We will stop the world the first time any of the following function attributes are mutated: - defaults - vectorcall - kwdefaults - closure - code This should happen rarely and only happens once per function, so the performance impact on majority of code should be minimal. Additionally, refactor the API for manipulating function versions to more clearly match the stated semantics.
…ython#124997) Stop the world when invalidating function versions The tier1 interpreter specializes `CALL` instructions based on the values of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1 interpreter uses function versions to verify that the attributes of a function during execution of a specialization match those seen during specialization. A function's version is initialized in `MAKE_FUNCTION` and is invalidated when any of the critical function attributes are changed. The tier1 interpreter stores the function version in the inline cache during specialization. A guard is used by the specialized instruction to verify that the version of the function on the operand stack matches the cached version (and therefore has all of the expected attributes). It is assumed that once the guard passes, all attributes will remain unchanged while executing the rest of the specialized instruction. Stopping the world when invalidating function versions ensures that all critical function attributes will remain unchanged after the function version guard passes in free-threaded builds. It's important to note that this is only true if the remainder of the specialized instruction does not enter and exit a stop-the-world point. We will stop the world the first time any of the following function attributes are mutated: - defaults - vectorcall - kwdefaults - closure - code This should happen rarely and only happens once per function, so the performance impact on majority of code should be minimal. Additionally, refactor the API for manipulating function versions to more clearly match the stated semantics.
I am releasing the assignee of the STORE_ATTR to anyone who has time to grab in, since I am going to focus on |
…27128) Use existing helpers to atomically modify the bytecode. Add unit tests to ensure specializing is happening as expected. Add test_specialize.py that can be used with ThreadSanitizer to detect data races. Fix thread safety issue with cell_set_contents().
…ded builds (#127123) The CALL family of instructions were mostly thread-safe already and only required a small number of changes, which are documented below. A few changes were needed to make CALL_ALLOC_AND_ENTER_INIT thread-safe: Added _PyType_LookupRefAndVersion, which returns the type version corresponding to the returned ref. Added _PyType_CacheInitForSpecialization, which takes an init method and the corresponding type version and only populates the specialization cache if the current type version matches the supplied version. This prevents potentially caching a stale value in free-threaded builds if we race with an update to __init__. Only cache __init__ functions that are deferred in free-threaded builds. This ensures that the reference to __init__ that is stored in the specialization cache is valid if the type version guard in _CHECK_AND_ALLOCATE_OBJECT passes. Fix a bug in _CREATE_INIT_FRAME where the frame is pushed to the stack on failure. A few other miscellaneous changes were also needed: Use {LOCK,UNLOCK}_OBJECT in LIST_APPEND. This ensures that the list's per-object lock is held while we are appending to it. Add missing co_tlbc for _Py_InitCleanup. Stop/start the world around setting the eval frame hook. This allows us to read interp->eval_frame non-atomically and preserves the behavior of _CHECK_PEP_523 documented below.
…ation for `BINARY_OP` (python#123926) Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads. Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization. Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
…bytecode change (python#126440) Fix the gdb pretty printer in the face of --enable-shared by delaying the attempt to load the _PyInterpreterFrame definition until after .so files are loaded.
…thongh-126450) - The specialization logic determines the appropriate specialization using only the operand's type, which is safe to read non-atomically (changing it requires stopping the world). We are guaranteed that the type will not change in between when it is checked and when we specialize the bytecode because the types involved are immutable (you cannot assign to `__class__` for exact instances of `dict`, `set`, or `frozenset`). The bytecode is mutated atomically using helpers. - The specialized instructions rely on the operand type not changing in between the `DEOPT_IF` checks and the calls to the appropriate type-specific helpers (e.g. `_PySet_Contains`). This is a correctness requirement in the default builds and there are no changes to the opcodes in the free-threaded builds that would invalidate this.
…ython#126414) Introduce helpers for (un)specializing instructions Consolidate the code to specialize/unspecialize instructions into two helper functions and use them in `_Py_Specialize_BinaryOp`. The resulting code is more concise and keeps all of the logic at the point where we decide to specialize/unspecialize an instruction.
* Enable specialization of CALL_KW * Fix bug pushing frame in _PY_FRAME_KW `_PY_FRAME_KW` pushes a pointer to the new frame onto the stack for consumption by the next uop. When pushing the frame fails, we do not want to push the result, `NULL`, to the stack because it is not a valid stackref. This works in the default build because `PyStackRef_NULL` and `NULL` are the same value, so the `PyStackRef_XCLOSE()` in the error handler ignores it. In the free-threaded build the values are not the same; `PyStackRef_XCLOSE()` will attempt to decref a null pointer.
…d builds (#127711) We use the same approach that was used for specialization of LOAD_GLOBAL in free-threaded builds: _CHECK_ATTR_MODULE is renamed to _CHECK_ATTR_MODULE_PUSH_KEYS; it pushes the keys object for the following _LOAD_ATTR_MODULE_FROM_KEYS (nee _LOAD_ATTR_MODULE). This arrangement avoids having to recheck the keys version. _LOAD_ATTR_MODULE is renamed to _LOAD_ATTR_MODULE_FROM_KEYS; it loads the value from the keys object pushed by the preceding _CHECK_ATTR_MODULE_PUSH_KEYS at the cached index.
…re-attr' into pythongh-115999-integrate-attr
Feature or enhancement
Proposal:
In free-threaded builds, the specializing adaptive interpreter needs to be made thread-safe. We should start with a small PR to simply disable it in free-threaded builds, which will be correct but will incur a performance penalty. Then we can work out how to properly support specialization in a free-threaded build.
These two commits from Sam's nogil-3.12 branch can serve as inspiration:
There are two primary concerns to balance while implementing this functionality on
main
:Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
Specialization Families
Tasks
Linked PRs
BINARY_OP
#123926LOAD_GLOBAL
specializations to avoid reloading {globals, builtins} keys #124953UNPACK_SEQUENCE
#126600LOAD_GLOBAL
in free-threaded builds #126607TO_BOOL
#126616CALL
instructions in free-threaded builds #127123LOAD_SUPER_ATTR
in free-threaded builds #127128specialize
#127167STORE_SUBSCR
#127169SEND
. #127426CALL_KW
in free-threaded builds #127713STORE_ATTR
in free-threaded builds. #127838The text was updated successfully, but these errors were encountered: