Skip to content

Commit

Permalink
pythongh-112175: Add eval_breaker to PyThreadState
Browse files Browse the repository at this point in the history
This change adds an `eval_breaker` field to `PyThreadState`, renaming the existing `eval_breaker` to
`interp_eval_breaker` (its uses are explained further down). The primary motivation is for
performance in free-threaded builds: with thread-local eval breakers, we can stop a specific thread
(e.g., for an async exception) without interrupting other threads.

There are still two situations where we want the first available thread to handle a request:
- Running a garbage collection: In normal builds, we set `_PY_GC_SCHEDULED_BIT` on the current
  thread. In case a thread suspends before handling the collection, the bit is copied to and from
  `interp_eval_breaker` on thread suspend and resume, respectively. In a free-threaded build, we
  simply iterate over all threads and set the bit. The first thread to check its eval breaker runs
  the collection, unsetting the bit on all threads.
  - Free-threaded builds could have multiple threads attempt a GC from one trigger if we get very
    unlucky with thread scheduling. I didn't put any protections against this in place because a)
    the consequences of it happening are just that one or more threads will check the GC thresholds
    right after a collection finishes, which won't affect correctness and b) it's incredibly,
    vanishingly unlikely.
- Pending calls not limited to the main thread (possible since python/cpython@757b402ea1c2). This is
  a little tricker, since the callback can be added from any thread, with or without the GIL
  held. If the targeted interpreter's GIL is locked, we signal the holding thread. When a thread is
  resumed, its `_PY_CALLS_TO_DO` bit is derived from the source of truth for pending calls (one of
  two `_pending_calls` structs). This handles situations where no thread held the GIL when the call
  was first added, or if the active thread did not handle the call before releasing the GIL. In a
  free-threaded build, all threads all signaled, similar to scheduling a GC.

The source of truth for the global instrumentation version is still in `interp_eval_breaker`, in
both normal and free-threaded builds. Threads usually read the version from their local
`eval_breaker`, where it continues to be colocated with the eval breaker bits, and the method for
keeping it up to date depends on build type. All builds first update the version in
`interp_eval_breaker`, and then:
- Normal builds update the version in the current thread's `eval_breaker`. When a thread takes the
  GIL, it copies the current version from `interp_eval_breaker` as part of the same operation that
  copies `_PY_GC_SCHEDULED_BIT`.
- Free-threaded builds again iterate over all threads in the current interpreter, updating the
  version on each one.
Instrumentation (and the specializing interpreter more generally) will need more work to be
compatible with free-threaded builds, so these changes are just intended to maintain the status quo
in normal builds for now.

Other notable changes are:
- The `_PY_*_BIT` macros now expand to the actual bit being set, rather than the bit's index. I
  think this is simpler overall. I also moved their definitions from `pycore_ceval.h` to
  `pycore_pystate.h`, since their main usage is on `PyThreadState`s now.
- Most manipulations of `eval_breaker` are done with a new pair of functions:
  `_PyThreadState_Signal()` and `_PyThreadState_Unsignal()`. Having two separate functions to
  set/unset a bit, rather than one function that takes the bit value to use, lets us use a single
  atomic `or`/`and`, rather than a loop around an atomic compare/exchange like the old
  `_Py_set_eval_breaker_bit` function.

Existing tests provide pretty good coverage for most of this functionality. The one new test I added
is to make sure a GC still happens if a thread schedules it then drops the GIL before the GC runs. I
don't love how complicated this test ended up so I'm open to other ideas for how to test this (or
other things to test in general).
  • Loading branch information
swtaarrs committed Feb 9, 2024
1 parent 5914a21 commit bc613e4
Show file tree
Hide file tree
Showing 20 changed files with 441 additions and 169 deletions.
5 changes: 5 additions & 0 deletions Include/cpython/pystate.h
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,11 @@ struct _ts {
PyThreadState *next;
PyInterpreterState *interp;

/* The global instrumentation version in high bits, plus flags indicating
when to break out of the interpreter loop in lower bits. See details in
pycore_pystate.h. */
uintptr_t eval_breaker;

struct {
/* Has been initialized to a safe state.
Expand Down
42 changes: 4 additions & 38 deletions Include/internal/pycore_ceval.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ PyAPI_FUNC(int) _PyEval_MakePendingCalls(PyThreadState *);

extern void _Py_FinishPendingCalls(PyThreadState *tstate);
extern void _PyEval_InitState(PyInterpreterState *);
extern void _PyEval_SignalReceived(PyInterpreterState *interp);
extern void _PyEval_SignalReceived(void);

// bitwise flags:
#define _Py_PENDING_MAINTHREADONLY 1
Expand All @@ -55,7 +55,6 @@ PyAPI_FUNC(int) _PyEval_AddPendingCall(
void *arg,
int flags);

extern void _PyEval_SignalAsyncExc(PyInterpreterState *interp);
#ifdef HAVE_FORK
extern PyStatus _PyEval_ReInitThreads(PyThreadState *tstate);
#endif
Expand Down Expand Up @@ -181,8 +180,9 @@ extern struct _PyInterpreterFrame* _PyEval_GetFrame(void);
extern PyObject* _Py_MakeCoro(PyFunctionObject *func);

/* Handle signals, pending calls, GIL drop request
and asynchronous exception */
extern int _Py_HandlePending(PyThreadState *tstate);
and asynchronous exception.
Export for '_testinternalcapi' shared extension. */
PyAPI_FUNC(int) _Py_HandlePending(PyThreadState *tstate);

extern PyObject * _PyEval_GetFrameLocals(void);

Expand All @@ -200,40 +200,6 @@ int _PyEval_UnpackIterable(PyThreadState *tstate, PyObject *v, int argcnt, int a
void _PyEval_FrameClearAndPop(PyThreadState *tstate, _PyInterpreterFrame *frame);


#define _PY_GIL_DROP_REQUEST_BIT 0
#define _PY_SIGNALS_PENDING_BIT 1
#define _PY_CALLS_TO_DO_BIT 2
#define _PY_ASYNC_EXCEPTION_BIT 3
#define _PY_GC_SCHEDULED_BIT 4
#define _PY_EVAL_PLEASE_STOP_BIT 5

/* Reserve a few bits for future use */
#define _PY_EVAL_EVENTS_BITS 8
#define _PY_EVAL_EVENTS_MASK ((1 << _PY_EVAL_EVENTS_BITS)-1)

static inline void
_Py_set_eval_breaker_bit(PyInterpreterState *interp, uint32_t bit, uint32_t set)
{
assert(set == 0 || set == 1);
uintptr_t to_set = set << bit;
uintptr_t mask = ((uintptr_t)1) << bit;
uintptr_t old = _Py_atomic_load_uintptr(&interp->ceval.eval_breaker);
if ((old & mask) == to_set) {
return;
}
uintptr_t new;
do {
new = (old & ~mask) | to_set;
} while (!_Py_atomic_compare_exchange_uintptr(&interp->ceval.eval_breaker, &old, new));
}

static inline bool
_Py_eval_breaker_bit_is_set(PyInterpreterState *interp, int32_t bit)
{
return _Py_atomic_load_uintptr_relaxed(&interp->ceval.eval_breaker) & (((uintptr_t)1) << bit);
}


#ifdef __cplusplus
}
#endif
Expand Down
13 changes: 8 additions & 5 deletions Include/internal/pycore_ceval_state.h
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,14 @@ struct _ceval_runtime_state {


struct _ceval_state {
/* This single variable consolidates all requests to break out of
* the fast path in the eval loop.
* It is by far the hottest field in this struct and
* should be placed at the beginning. */
uintptr_t eval_breaker;
/* This single variable holds the global instrumentation version and some
* interpreter-global requests to break out of the fast path in the eval
* loop. PyThreadState also contains an eval_breaker, which is the source
* of truth when a thread is running.
*
* It is by far the hottest field in this struct and should be placed at
* the beginning. */
uintptr_t interp_eval_breaker;
/* Avoid false sharing */
int64_t padding[7];
int recursion_limit;
Expand Down
3 changes: 2 additions & 1 deletion Include/internal/pycore_gc.h
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,8 @@ extern void _PySlice_ClearCache(_PyFreeListState *state);
extern void _PyDict_ClearFreeList(_PyFreeListState *state, int is_finalization);
extern void _PyAsyncGen_ClearFreeLists(_PyFreeListState *state, int is_finalization);
extern void _PyContext_ClearFreeList(_PyFreeListState *state, int is_finalization);
extern void _Py_ScheduleGC(PyInterpreterState *interp);
// Export for '_testinternalcapi' shared extension.
PyAPI_FUNC(void) _Py_ScheduleGC(PyThreadState *interp);
extern void _Py_RunGC(PyThreadState *tstate);

#ifdef __cplusplus
Expand Down
36 changes: 36 additions & 0 deletions Include/internal/pycore_pystate.h
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,42 @@ static inline _PyFreeListState* _PyFreeListState_GET(void)
#endif
}

/* Bits that can be set in PyThreadState.eval_breaker */
#define _PY_GIL_DROP_REQUEST_BIT (1U << 0)
#define _PY_SIGNALS_PENDING_BIT (1U << 1)
#define _PY_CALLS_TO_DO_BIT (1U << 2)
#define _PY_ASYNC_EXCEPTION_BIT (1U << 3)
#define _PY_GC_SCHEDULED_BIT (1U << 4)
#define _PY_EVAL_PLEASE_STOP_BIT (1U << 5)

/* Reserve a few bits for future use */
#define _PY_EVAL_EVENTS_BITS 8
#define _PY_EVAL_EVENTS_MASK ((1U << _PY_EVAL_EVENTS_BITS)-1)

static inline void
_PyThreadState_Signal(PyThreadState *tstate, uintptr_t bit)
{
_Py_atomic_or_uintptr(&tstate->eval_breaker, bit);
}

static inline void
_PyThreadState_Unsignal(PyThreadState *tstate, uintptr_t bit)
{
_Py_atomic_and_uintptr(&tstate->eval_breaker, ~bit);
}

static inline int
_PyThreadState_IsSignalled(PyThreadState *tstate, uintptr_t bit)
{
uintptr_t b = _Py_atomic_load_uintptr_relaxed(&tstate->eval_breaker);
return (b & bit) != 0;
}

// Free-threaded builds use these functions to set or unset a bit on all
// threads in the given interpreter.
void _PyInterpreterState_SignalAll(PyInterpreterState *interp, uintptr_t bit);
void _PyInterpreterState_UnsignalAll(PyInterpreterState *interp, uintptr_t bit);

#ifdef __cplusplus
}
#endif
Expand Down
3 changes: 3 additions & 0 deletions Include/internal/pycore_runtime.h
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,10 @@ typedef struct pyruntimestate {
int64_t next_id;
} interpreters;

/* Platform-specific identifier and PyThreadState, respectively, for the
main thread in the main interpreter. */
unsigned long main_thread;
PyThreadState *main_tstate;

/* ---------- IMPORTANT ---------------------------
The fields above this line are declared as early as
Expand Down
41 changes: 41 additions & 0 deletions Lib/test/test_gc.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@
from test.support.os_helper import temp_dir, TESTFN, unlink
from test.support.script_helper import assert_python_ok, make_script
from test.support import threading_helper
try:
import _testinternalcapi
except ImportError:
_testinternalcapi = None

import gc
import sys
Expand Down Expand Up @@ -1418,6 +1422,43 @@ def test_ast_fini(self):
assert_python_ok("-c", code)


class GCSchedulingTests(unittest.TestCase):
@unittest.skipIf(_testinternalcapi is None,
"Requires functions from _testinternalcapi")
@threading_helper.requires_working_threading()
def test_gc_schedule_before_thread_switch(self):
# Ensure that a scheduled collection is not lost due to thread
# switching. Most of the work happens in helper functions in
# _testinternalcapi.

class Cycle:
def __init__(self):
self._self = self

thresholds = gc.get_threshold()
gc.enable()

try:
state = _testinternalcapi.schedule_gc_new_state()

def thread1():
_testinternalcapi.schedule_gc_do_schedule(state)

gc.set_threshold(1)
threads = [threading.Thread(target=thread1)]
with threading_helper.start_threads(threads):
r = weakref.ref(Cycle())
_testinternalcapi.schedule_gc_do_wait(state)

# Ensure that at least one GC has happened
for i in range(5):
self.assertEqual(1, 1)
self.assertIsNone(r())
finally:
gc.disable()
gc.set_threshold(*thresholds)


def setUpModule():
global enabled, debug
enabled = gc.isenabled()
Expand Down
114 changes: 114 additions & 0 deletions Modules/_testinternalcapi.c
Original file line number Diff line number Diff line change
Expand Up @@ -1650,6 +1650,117 @@ get_rare_event_counters(PyObject *self, PyObject *type)
);
}

// The schedule_gc_* functions work together to test GC timing and the eval
// breaker, when used by
// test_gc.py:GCSchedulingTests.test_gc_schedule_before_thread_switch().
//
// The expected sequence of events is:
// - thread 2 waits for thread 1 to be ready
// - thread 1 waits for thread 2 to be ready
// (both threads are now at known locations in their respective C functions)
// - thread 1 clears out pending eval breaker flags
// - thread 2 checks that a GC is not scheduled
// - thread 1 schedules a GC and releases the GIL without checking its eval breaker
// - thread 2 checks that a GC is scheduled and returns
// - thread 1 sees that thread 2 is done and returns, allowing Python code to run again
typedef enum {
SCHEDULE_GC_INIT,
SCHEDULE_GC_THREAD1_READY,
SCHEDULE_GC_THREAD2_READY,
SCHEDULE_GC_THREAD1_CLEARED,
SCHEDULE_GC_THREAD2_VERIFIED,
SCHEDULE_GC_THREAD1_SCHEDULED,
SCHEDULE_GC_THREAD2_DONE,

SCHEDULE_GC_STOP,
} schedule_gc_state;

static void
schedule_gc_state_destructor(PyObject *capsule)
{
void *state = PyCapsule_GetPointer(capsule, NULL);
assert(state != NULL);
free(state);
}

static PyObject *
schedule_gc_new_state(PyObject *self, PyObject *Py_UNUSED(ignored))
{
schedule_gc_state *state = malloc(sizeof(schedule_gc_state));
if (state == NULL) {
PyErr_SetString(PyExc_RuntimeError, "Failed to allocate state");
return NULL;
}
*state = SCHEDULE_GC_INIT;
return PyCapsule_New(state, NULL, schedule_gc_state_destructor);
}

// Repeatedly release the GIL until the desired state appears in *state.
#define SCHEDULE_GC_WAIT_FOR(desired) \
do { \
while (*state != desired) { \
if (*state == SCHEDULE_GC_STOP) { \
Py_RETURN_NONE; \
} \
PyEval_RestoreThread(PyEval_SaveThread()); \
} \
} while (0)

static PyObject *
schedule_gc_do_schedule(PyObject *self, PyObject *capsule)
{
PyThreadState *tstate = PyThreadState_Get();
schedule_gc_state *state = PyCapsule_GetPointer(capsule, NULL);
assert(state != NULL);

*state = SCHEDULE_GC_THREAD1_READY;
SCHEDULE_GC_WAIT_FOR(SCHEDULE_GC_THREAD2_READY);

if (_Py_HandlePending(tstate) < 0) {
*state = SCHEDULE_GC_STOP;
return NULL;
}
*state = SCHEDULE_GC_THREAD1_CLEARED;
SCHEDULE_GC_WAIT_FOR(SCHEDULE_GC_THREAD2_VERIFIED);

_Py_ScheduleGC(tstate);
*state = SCHEDULE_GC_THREAD1_SCHEDULED;
SCHEDULE_GC_WAIT_FOR(SCHEDULE_GC_THREAD2_DONE);

Py_RETURN_NONE;
}

static PyObject *
schedule_gc_do_wait(PyObject *self, PyObject *capsule)
{
PyThreadState *tstate = PyThreadState_Get();
schedule_gc_state *state = PyCapsule_GetPointer(capsule, NULL);
assert(state != NULL);

SCHEDULE_GC_WAIT_FOR(SCHEDULE_GC_THREAD1_READY);

*state = SCHEDULE_GC_THREAD2_READY;
SCHEDULE_GC_WAIT_FOR(SCHEDULE_GC_THREAD1_CLEARED);

if (_PyThreadState_IsSignalled(tstate, _PY_GC_SCHEDULED_BIT)) {
PyErr_SetString(PyExc_AssertionError,
"GC_SCHEDULED_BIT unexpectedly set");
return NULL;
}
*state = SCHEDULE_GC_THREAD2_VERIFIED;
SCHEDULE_GC_WAIT_FOR(SCHEDULE_GC_THREAD1_SCHEDULED);

if (!_PyThreadState_IsSignalled(tstate, _PY_GC_SCHEDULED_BIT)) {
PyErr_SetString(PyExc_AssertionError,
"GC_SCHEDULED_BIT not carried over from thread 1");
return NULL;
}
*state = SCHEDULE_GC_THREAD2_DONE;
// Let the GC run naturally once we've returned to Python.

Py_RETURN_NONE;
}


#ifdef Py_GIL_DISABLED
static PyObject *
Expand Down Expand Up @@ -1727,6 +1838,9 @@ static PyMethodDef module_functions[] = {
_TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
{"get_type_module_name", get_type_module_name, METH_O},
{"get_rare_event_counters", get_rare_event_counters, METH_NOARGS},
{"schedule_gc_new_state", schedule_gc_new_state, METH_NOARGS},
{"schedule_gc_do_schedule", schedule_gc_do_schedule, METH_O},
{"schedule_gc_do_wait", schedule_gc_do_wait, METH_O},
#ifdef Py_GIL_DISABLED
{"py_thread_id", get_py_thread_id, METH_NOARGS},
#endif
Expand Down
10 changes: 3 additions & 7 deletions Modules/signalmodule.c
Original file line number Diff line number Diff line change
Expand Up @@ -276,11 +276,7 @@ trip_signal(int sig_num)
cleared in PyErr_CheckSignals() before .tripped. */
_Py_atomic_store_int(&is_tripped, 1);

/* Signals are always handled by the main interpreter */
PyInterpreterState *interp = _PyInterpreterState_Main();

/* Notify ceval.c */
_PyEval_SignalReceived(interp);
_PyEval_SignalReceived();

/* And then write to the wakeup fd *after* setting all the globals and
doing the _PyEval_SignalReceived. We used to write to the wakeup fd
Expand All @@ -303,6 +299,7 @@ trip_signal(int sig_num)

int fd = wakeup.fd;
if (fd != INVALID_FD) {
PyInterpreterState *interp = _PyInterpreterState_Main();
unsigned char byte = (unsigned char)sig_num;
#ifdef MS_WINDOWS
if (wakeup.use_send) {
Expand Down Expand Up @@ -1770,8 +1767,7 @@ PyErr_CheckSignals(void)
Python code to ensure signals are handled. Checking for the GC here
allows long running native code to clean cycles created using the C-API
even if it doesn't run the evaluation loop */
if (_Py_eval_breaker_bit_is_set(tstate->interp, _PY_GC_SCHEDULED_BIT)) {
_Py_set_eval_breaker_bit(tstate->interp, _PY_GC_SCHEDULED_BIT, 0);
if (_PyThreadState_IsSignalled(tstate, _PY_GC_SCHEDULED_BIT)) {
_Py_RunGC(tstate);
}

Expand Down
Loading

0 comments on commit bc613e4

Please sign in to comment.