Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP #123926

Merged
merged 84 commits into from
Nov 4, 2024

Conversation

mpage
Copy link
Contributor

@mpage mpage commented Sep 10, 2024

This PR implements the foundational work necessary for making the specializing interpreter thread-safe in free-threaded builds and enables specialization for BINARY_OP as an end-to-end example. To enable future incremental work, specialization can now be toggled on a per-family basis. Subsequent PRs will enable specialization in free-threaded builds for the remaining families.

Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.

Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.

Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.

- Fix a few places where we were not using atomics to (de)instrument
  opcodes.
- Fix a few places where we weren't using atomics to reset adaptive
  counters.
- Remove some redundant non-atomic resets of adaptive counters that
  presumably snuck as merge artifacts of python#118064
  and python#117144 landing close
  together.
Read the opcode atomically, the interpreter may be specializing it
@mpage
Copy link
Contributor Author

mpage commented Oct 22, 2024

@markshannon - Would you take a look at this, please?

@markshannon
Copy link
Member

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.

Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

@mpage
Copy link
Contributor Author

mpage commented Oct 24, 2024

!buildbot nogil refleak

@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @mpage for commit 07f9140 🤖

The command will test the builders whose names match following regular expression: nogil refleak

The builders matched are:

  • AMD64 CentOS9 NoGIL Refleaks PR
  • AMD64 Fedora Rawhide NoGIL refleaks PR
  • aarch64 Fedora Rawhide NoGIL refleaks PR
  • PPC64LE Fedora Rawhide NoGIL refleaks PR

@mpage
Copy link
Contributor Author

mpage commented Oct 24, 2024

I'm still concerned about not counting the tlbc memory blocks in the refleaks test.

Maybe you could count them separately, and still check that there aren't too many leaked, but be a bit more relaxed about the counts for tlbc than for other blocks?

@markshannon - That would work, but I opted for clearing the cached TLBC for threads that aren't currently in use when we clear other internal caches. This should still catch leaks, doesn't require modifying refleaks.py, and is the same approach we use for tier2. Please have a look.

# code objects is a large fraction of the total number of
# references, this can cause the total number of allocated
# blocks to exceed the total number of references.
if not support.Py_GIL_DISABLED:
Copy link
Member

@markshannon markshannon Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we can free the unused tlbcs, can we replace this with sys._clear_internal_caches()?

Copy link
Contributor Author

@mpage mpage Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. It seems to be very sensitive to which kinds of objects are on the heap as well as the number of non reference counted allocations (blocks) per object. With the introduction of TLBC there is at least one additional block allocated per code object that is not reference counted, the _PyCodeArray, which is present even if we free the unused TLBCs. Its presence is enough to trigger the assertion.

This assertion feels pretty brittle and I'd be in favor of removing it, but that's probably worth doing in a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace it with a more meaningful test rather than remove it. But in another PR.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

One question. Can we prefix the test for leaking blocks with sys._clear_internal_caches() instead of making it conditional on not using free-threading?

@mpage
Copy link
Contributor Author

mpage commented Oct 29, 2024

One question. Can we prefix the test for leaking blocks with sys._clear_internal_caches() instead of making it conditional on not using free-threading?

@markshannon - Unfortunately that doesn't help. See my reply inline.

@markshannon markshannon self-requested a review October 29, 2024 16:54
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have concerns about memory use, but we can iterate on that in subsequent PRs.

@mpage
Copy link
Contributor Author

mpage commented Nov 4, 2024

JIT failures appear on main and are unrelated to this PR

@mpage mpage merged commit 2e95c5b into python:main Nov 4, 2024
51 of 57 checks passed
@mpage mpage deleted the gh-115999-thread-local-bytecode branch November 4, 2024 19:14
@encukou
Copy link
Member

encukou commented Nov 5, 2024

After this PR was merged, test_gdb started failing; see e.g. https://buildbot.python.org/#/builders/506/builds/9149
Do you think it's possible to fix this in a day?

@Yhg1s
Copy link
Member

Yhg1s commented Nov 5, 2024

Looks like the problem is only in --enable-shared builds, and it's because we're now looking up _PyInterpreterFrame too early (before the .so file is loaded). I'll have a fix in a few minutes.

@Yhg1s
Copy link
Member

Yhg1s commented Nov 5, 2024

PR #126440 should fix the failure.

picnixz pushed a commit to picnixz/cpython that referenced this pull request Dec 8, 2024
…ation for `BINARY_OP` (python#123926)

Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.

Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.

Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants