Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: Don't decrement a reference to a borrowed object #1420

Merged
merged 1 commit into from
Sep 16, 2024

Conversation

ac000
Copy link
Member

@ac000 ac000 commented Sep 12, 2024

On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash
triggered by Py_Finalize() in nxt_python_atexit() when running one of
our pytests, namely
test/test_python_factory.py::test_python_factory_invalid_callable_value

  2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable" in module "wsgi" can not be called to fetch callable
  Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
  Python runtime state: finalizing (tstate=0x00007f560b88a718)

  Current thread 0x00007f560bde7ad0 (most recent call first):
    <no Python frame>
  2024/09/12 15:07:29 [alert] 5451#5451 app process 5452 exited on signal 6 (core dumped)

This was due to

obj = PyDict_GetItemString(PyModule_GetDict(module), callable);

in nxt_python_set_target() which returns a borrowed reference, then
due to the test meaning this is a None object we goto fail and call

Py_DECREF(obj);

which then causes Py_Finalize() to blow up.

The simple fix is to just increment its reference count before the goto fail.

Note: This problem only showed up under (the various versions of Python
we test on); 3.11.9 & 3.11.10. It doesn't show up under; 3.6, 3.7, 3.9,
3.10, 3.12

hongzhidao
hongzhidao previously approved these changes Sep 12, 2024
@callahad
Copy link
Collaborator

Now I have dumb questions about whether we're now leaking memory. Sounds like it should be balanced (given the decref) but if decref was only blowing up in some specific cases, that implies that in other cases the refcount was reasonable?

@ac000
Copy link
Member Author

ac000 commented Sep 12, 2024

Now I have dumb questions about whether we're now leaking memory. Sounds like it should be balanced (given the

The short answer is, no.

The object/variable is allocated once at module start. Any memory for it that may linger will be returned when the process exits, regardless of what the reference count may be, though we are exiting, we do try and clean up properly...

decref) but if decref was only blowing up in some specific cases, that implies that in other cases the refcount was reasonable?

Possibly for most Python versions the reference stuff could handle an imbalanced decref.

Let me poke at it with GDB and see if I can find what the reference count for this thing actually is at various points...

@ac000
Copy link
Member Author

ac000 commented Sep 13, 2024

Yes, apologies, this patch is just masking over the issue. In fact I've no idea why this even gets rid of the crash.

Looking with gdb(1) with the added incref in both cases (failure and non-failure) the reference count on the object goes to two, then to one after the decref. For whatever reason, the fact that it still has a reference when calling Py_Finalize() in the Python 3.11.9/10 case avoids the crash.

However, I believe the following is the real issue (though not 100% convinced it's all (or any of) our fault...)

Looking at the two important lines from the error output

2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable" in module "wsgi" can not be called to fetch callable
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension

The first line shows we are tripping up in the following code in src/python/nxt_python.c::nxt_python_set_target()

463     if (is_factory) {                                                       
464         if (nxt_slow_path(PyCallable_Check(obj) == 0)) {                    
465             nxt_alert(task,                                                 
466                       "factory \"%s\" in module \"%s\" "                    
467                       "can not be called to fetch callable",                
468                       callable, module_name);                               
469             goto fail;                                                      
470         }

That is due to

    19	wsgi_invalid_callable = None

from test/python/factory/wsgi.py

The important thing here is that obj is a _PyNone_Type object and doing a decref on it is what seems to lead to the crash.

The fail label takes us to

500 fail:                                                                       
501                                                                             
502     Py_XDECREF(obj);                                                        
503     Py_XDECREF(module);
504 
505     return NXT_ERROR;
506 }

If we comment out the Py_XDECREF(obj);, no crash.

That marries up with the second error line

Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension

This is also demonstrable with the following test program

/*
 * py-none-decref-test.c - Demonstrate decref on Py_None issue
 *
 * gcc $(python3-config --embed --cflags) -Wall -Wextra -O0 -g -o py-none-decref-test py-none-decref-test.c $(python3-config --embed --libs --ldflags)
 */

#define PY_SSIZE_T_CLEAN
#include <Python.h>                                                             

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
	int opt;
	bool decref = false;
	PyObject *obj;

	while ((opt = getopt(argc, argv, "d")) != -1) {
		switch (opt) {
		case 'd':
			decref = true;
			break;
		default:
			fprintf(stderr, "Usage: %s [-d]\n", argv[0]);
			exit(EXIT_FAILURE);
		}
	}

	Py_InitializeEx(0);

	obj = Py_None;
	if (decref)
		Py_DECREF(obj);

	Py_Finalize();

	exit(EXIT_SUCCESS);
}

With Python 3.12.5

$ ./py-none-decref-test 
$ ./py-none-decref-test -d
$

No issues.

With Python 3.11.10

$ ./py-none-decref-test
$ ./py-none-decref-test -d
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: finalizing (tstate=0x00007f70a7e8a718)

Current thread 0x00007f70a80ddb08 (most recent call first):
  <no Python frame>
Aborted (core dumped)
$

And there's our error message.

(Just to be clear, the crash is happening from the call to Py_Finalize())

After

     34         obj = Py_None;                                                                                                    

obj looks like

(gdb) p *obj
$1 = {
  ob_refcnt = 3662,
  ob_type = 0x7ffff7e59cc0 <_PyNone_Type>
}

Notice the somewhat high reference count.

However not as high as it is under 3.12

(gdb) p *obj
$1 = {
  {
    ob_refcnt = 4294967295,
    ob_refcnt_split = {4294967295, 0}
  },
  ob_type = 0x7ffff7d56560 <_PyNone_Type>
}

3.12 introduced the idea of immortality for certain often used objects including None. We also have some extra refcnt tracking...

With Python 3.11.10 after

 36                 Py_DECREF(obj);                                             
(gdb) p *obj
$2 = {
  ob_refcnt = 3661,
  ob_type = 0x7ffff7e59cc0 <_PyNone_Type>
}

OK, it decremented the refcnt, but wasn't happy doing so...

With Python 3.12.5

(gdb) p *obj
$2 = {
  {
    ob_refcnt = 4294967295,
    ob_refcnt_split = {4294967295, 0}
  },
  ob_type = 0x7ffff7d56560 <_PyNone_Type>
}

No effect...

With Python 3.7.3

$ ./py-none-decref-test 
$ ./py-none-decref-test -d
$

Before the decref

(gdb) p *obj
$3 = {
  ob_refcnt = 4212, 
  ob_type = 0x7ffff7f3a500 <_PyNone_Type>
}

After...

$4 = {
  ob_refcnt = 4211, 
  ob_type = 0x7ffff7f3a500 <_PyNone_Type>
}

Decremented it, but didn't cause an issue...

I currently have two thoughts...

  1. Leave it, it's a bug in Python, it only effects two of the many versions we test on. It only happens at module startup and only if you're doing something obviously wrong.
  2. We perhaps shouldn't be trying to decref None objects, I'd like to find a definitive answer on that, (though most of the versions we test on handle it) and add a check for None objects.

This does then beg further questions

  1. Do we just add a check in the know problem location
  2. Do we add checks all over the place
  3. Do we wrap Py_XDECREF() and just do the check in there

@ac000
Copy link
Member Author

ac000 commented Sep 13, 2024

Oh, and the crash isn't due to a NULL pointer dereference or anything like that, it's simply due to Python calling abort(3)...

(gdb) bt
#0  __restore_sigs (set=set@entry=0x7ffef6ab8760)
    at ./arch/x86_64/syscall_arch.h:40
#1  0x00007fe47cd71702 in raise (sig=sig@entry=6) at src/signal/raise.c:11
#2  0x00007fe47cd40be8 in abort () at src/exit/abort.c:11
#3  0x00007fe47c8eb1b5 in fatal_error_exit (status=-1)
    at Python/pylifecycle.c:2626
#4  fatal_error (fd=fd@entry=2, header=header@entry=0, 
    prefix=prefix@entry=0x0, msg=msg@entry=0x0, status=status@entry=-1)
    at Python/pylifecycle.c:2737
#5  0x00007fe47c8eb9ed in _Py_FatalErrorFormat (
    func=0x7fe47cabcf08 <__func__.3.lto_priv.0> "none_dealloc", 
    format=0x7fe47ca628f0 "%s: bug likely caused by a refcount error in a C extension") at Python/pylifecycle.c:2857
...

@ac000 ac000 dismissed hongzhidao’s stale review September 13, 2024 13:15

Entirely my fault. This patch is plain wrong.

@ac000
Copy link
Member Author

ac000 commented Sep 13, 2024

After chatting on the #python IRC channel, it seems we should not be DECREFing without having done an INCREF, even though this None object has a way above 0 refcnt to start with.

Now to see the best way to deal with this...

@ac000
Copy link
Member Author

ac000 commented Sep 13, 2024

So in our actual case the problem is that we do

451     obj = PyDict_GetItemString(PyModule_GetDict(module), callable);

PyDict_GetItemString() returns a borrowed reference. On which we then end up doing

502     Py_XDECREF(obj);

Which then blows up in Py_Finalize().

@ac000
Copy link
Member Author

ac000 commented Sep 13, 2024

I think this is the simplest solution

diff --git a/src/python/nxt_python.c b/src/python/nxt_python.c
index 7bbf3d49..c7ba0a90 100644
--- a/src/python/nxt_python.c
+++ b/src/python/nxt_python.c
@@ -462,6 +462,7 @@ nxt_python_set_target(nxt_task_t *task, nxt_python_target_t *target,
                       "factory \"%s\" in module \"%s\" "
                       "can not be called to fetch callable",
                       callable, module_name);
+            Py_INCREF(obj);    /* borrowed reference */
             goto fail;
         }

On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash
triggered by Py_Finalize() in nxt_python_atexit() when running one of
our pytests, namely
test/test_python_factory.py::test_python_factory_invalid_callable_value

  2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable" in module "wsgi" can not be called to fetch callable
  Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
  Python runtime state: finalizing (tstate=0x00007f560b88a718)

  Current thread 0x00007f560bde7ad0 (most recent call first):
    <no Python frame>
  2024/09/12 15:07:29 [alert] 5451#5451 app process 5452 exited on signal 6 (core dumped)

This was due to

  obj = PyDict_GetItemString(PyModule_GetDict(module), callable);

in nxt_python_set_target() which returns a *borrowed* reference, then
due to the test meaning this is a `None` object we `goto fail` and call

  Py_DECREF(obj);

which then causes `Py_Finalize()` to blow up.

The simple fix is to just increment its reference count before the `goto
fail`.

Note: This problem only showed up under (the various versions of Python
we test on); 3.11.9 & 3.11.10. It doesn't show up under; 3.6, 3.7, 3.9,
3.10, 3.12

Cc: Konstantin Pavlov <[email protected]>
Closes: nginx#1413
Fixes: a9aa9e7 ("python: Support application factories")
Signed-off-by: Andrew Clayton <[email protected]>
@ac000 ac000 changed the title python: Add a missing Py_INCREF() in nxt_python_start() python: Don't decrement a reference to a borrowed object Sep 13, 2024
@ac000
Copy link
Member Author

ac000 commented Sep 13, 2024

Re-work patch for proper fix

$ git range-diff --creation-factor=150 c9860dc2...50b1aca3
1:  c9860dc2 ! 1:  50b1aca3 python: Add a missing Py_INCREF() in nxt_python_start()
    @@ Metadata
     Author: Andrew Clayton <[email protected]>
     
      ## Commit message ##
    -    python: Add a missing Py_INCREF() in nxt_python_start()
    +    python: Don't decrement a reference to a borrowed object
     
         On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash
    -    triggered by Py_Finalize() in nxt_python_start() when running one of our
    -    pytests, namely
    +    triggered by Py_Finalize() in nxt_python_atexit() when running one of
    +    our pytests, namely
         test/test_python_factory.py::test_python_factory_invalid_callable_value
     
           2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable" in module "wsgi" can not be called to fetch callable
1:  c9860dc2 ! 1:  50b1aca3 python: Add a missing Py_INCREF() in nxt_python_star
t()
    @@ Metadata
     Author: Andrew Clayton <[email protected]>
     
      ## Commit message ##
    -    python: Add a missing Py_INCREF() in nxt_python_start()
    +    python: Don't decrement a reference to a borrowed object
     
         On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash
    -    triggered by Py_Finalize() in nxt_python_start() when running one of ou
r
    -    pytests, namely
    +    triggered by Py_Finalize() in nxt_python_atexit() when running one of
    +    our pytests, namely
         test/test_python_factory.py::test_python_factory_invalid_callable_value
     
           2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable"
 in module "wsgi" can not be called to fetch callable
    @@ Commit message
             <no Python frame>
           2024/09/12 15:07:29 [alert] 5451#5451 app process 5452 exited on signal 6 (core dumped)
     
    -    It seems this was caused by
    +    This was due to
     
    -      Py_XDECREF(nxt_py_stderr_flush);
    +      obj = PyDict_GetItemString(PyModule_GetDict(module), callable);
     
    -    in nxt_python_atexit().
    +    in nxt_python_set_target() which returns a *borrowed* reference, then
    +    due to the test meaning this is a `None` object we `goto fail` and call
     
    -    We never explicitly INCREF() this variable (though there are plenty of
    -    other cases of doing DECREF()'s with no previous INCREF()'s). However in
    -    this case adding an INCREF() for this variable does fix the crash.
    +      Py_DECREF(obj);
     
    -    I'm not sure we really need to reference count this variable, it's only
    -    used in nxt_python_print_exception(), but it's perhaps the safest course
    -    of action.
    +    which then causes `Py_Finalize()` to blow up.
    +
    +    The simple fix is to just increment its reference count before the `goto
    +    fail`.
    +
    +    Note: This problem only showed up under (the various versions of Python
    +    we test on); 3.11.9 & 3.11.10. It doesn't show up under; 3.6, 3.7, 3.9,
    +    3.10, 3.12
     
         Cc: Konstantin Pavlov <[email protected]>
         Closes: https://github.com/nginx/unit/issues/1413
    +    Fixes: a9aa9e76d ("python: Support application factories")
         Signed-off-by: Andrew Clayton <[email protected]>
     
      ## src/python/nxt_python.c ##
    -@@ src/python/nxt_python.c: nxt_python_start(nxt_task_t *task, nxt_process_data_t *data)
    -         goto fail;
    -     }
    +@@ src/python/nxt_python.c: nxt_python_set_target(nxt_task_t *task, nxt_python_target_t *target,
    +                       "factory \"%s\" in module \"%s\" "
    +                       "can not be called to fetch callable",
    +                       callable, module_name);
    ++            Py_INCREF(obj);     /* borrowed reference */
    +             goto fail;
    +         }
      
    -+    Py_INCREF(nxt_py_stderr_flush);
    -+
    -     if (nxt_slow_path(nxt_python_set_path(task, c->path) != NXT_OK)) {
    -         goto fail;
    -     }

@ac000 ac000 merged commit 50b1aca into nginx:master Sep 16, 2024
25 checks passed
@ac000 ac000 deleted the pyfix branch September 16, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Current git HEAD tests fail on FreeBSD 14 / Python 3.11.9: test_python_factory_invalid_callable_value
3 participants