Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-126644: Fix various thread safety issues in _interpreters #126696

Closed

Conversation

ZeroIntensity
Copy link
Member

@ZeroIntensity ZeroIntensity commented Nov 11, 2024

This PR fixes a few things:

  • Setting an error with _PyXI_ERR_ALREADY_RUNNING currently relies on the interpreter state after the thread state has already been detached, so it's possible, at least on the free-threaded build, that another thread stops the interpreter and breaks the FailIfRunningMain check. This fixes that by just raising the exception manually.
  • _interpreters.destroy doesn't expect another thread to start the interpreter after its initial check, so this adds a flag on the interpreter state that prevents future threads from starting after destruction has started.
  • As it turns out, creating a thread state has some thread safety issues as of gh-115999: Implement thread-local bytecode and enable specialization for BINARY_OP #123926, so this just slaps an ugly HEAD_LOCK on it for now.
  • Re-enables subinterpreter tests for free-threaded builds, because they should be thread-safe now.

It would be nice if we didn't have to throw an exception for everything and instead just make it thread-safe, but that would be too big of a PR to backport to 3.13.

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these thread-safety issues are not limited to the free threading build. The GIL does not really protect most of these cross-interpreter checks, like _PyInterpreterState_IsRunningMain, because different subinterpreters can concurrently try to run or destroy each other.

I also think it's better to try to fix these issues robustly rather than make a minimal PR that can be backported to 3.13. More generally, I'm not convinced it's worth the risk to backport non-trivial fixes to issues that solely come up when using the combination of free-threading + subinterpreters.

Comment on lines +1588 to +1589
/* _Py_ReserveTLBCIndex has thread safety issues */
HEAD_LOCK(runtime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. _PyIndexPool_AllocIndex locks internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's what I thought too. I'm not too sure what's going on here, but this segfaults without this lock.

return NULL;
}

if (_Py_atomic_load_ptr_relaxed(&interp->threads.head) != NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what you're doing here, but access to the linked list of thread states requires holding HEAD_LOCK().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is when _PyInterpreterState_PreventMain was able to set the prevention flag, but there was a thread that was able to set the main thread right before that happened, so we can't destroy the interpreter while that thread is still alive (but no more threads will be able to set main after its done).

@ZeroIntensity
Copy link
Member Author

More generally, I'm not convinced it's worth the risk to backport non-trivial fixes to issues that solely come up when using the combination of free-threading + subinterpreters.

Hmm, why? This will hurt PEP 734, because Eric's PyPI package for backporting features is targeted towards 3.13+, not 3.14+. I'm not sure how many people are using 3.13t as their working copy, but I would assume that the people who are using it are also the people who would like to use subinterpreters to get around the GIL as well.

@colesbury
Copy link
Contributor

  • Both free threading and _interpreters are experimental; the combination is especially so.
  • I think the scope of getting the two to work robustly together is large; more like a "feature" than a "bug fix"
  • Some of these changes are risky; we may introduce regressions to fix something that is not a regression.
  • I do not fully understand the thread-safety implications here. From the PR description, I'm not sure you fully understand them either. If you do, then the thread-safety requirements should be more clearly explained in the PR.

@ZeroIntensity
Copy link
Member Author

  • Some of these changes are risky; we may introduce regressions to fix something that is not a regression.

Judging by the failing tests, I see your point :)

What do you suggest we do for 3.13? Slap a big Py_MOD_GIL_USED on _interpreters?

  • I do not fully understand the thread-safety implications here. From the PR description, I'm not sure you fully understand them either. If you do, then the thread-safety requirements should be more clearly explained in the PR.

Yeah, I'm pretty sure that the cases I've fixed in this PR are truly fixed, but the test case isn't extensive at all. I'm doing my best to try and cover the bases that aren't thread-safe here, but as far as I can tell, interpreter switching without waiting on the GIL is a gray area. There are things that work right now that shouldn't, and I'm not totally sure why. (I'm not even addressing the C API here, that's probably still a mess--this change is assuming that subinterpreters created by _interpreters are only used by _interpreters.)

@ZeroIntensity ZeroIntensity removed the needs backport to 3.13 bugs and security fixes label Nov 11, 2024
@ZeroIntensity ZeroIntensity marked this pull request as draft November 11, 2024 20:12
@ZeroIntensity
Copy link
Member Author

@colesbury FWIW, I was being dumb. The exception problem is almost certainly present on the default builds (I said that in the original issue, but not in the description for whatever reason). I'm assuming the same goes for the destroy issue, but I haven't been able to reproduce it yet.

Apparently, there's some sort of race on both builds when creating thread states that's been screwing up my reproducers (I get hit with a thread state already initialized fatal error on both GIL and no-GIL):

from threading import Thread
import _interpreters

interp = _interpreters.create()

def run():
    this_interp = _interpreters.create()
    _interpreters.run_string(this_interp, f"import _interpreters; _interpreters.run_string({interp}, '1')")

threads = [Thread(target=run) for _ in range(1000)]
for thread in threads:
    thread.start()

for thread in threads:
    thread.join()

As far as I can tell, moving HEAD_LOCK here was an incidental fix for that. I'll investigate this further later today.

@ZeroIntensity
Copy link
Member Author

I'm going to split this into smaller PRs so we can address reviews and backporting on a case-by-case basis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants