Skip to content

Commit

Permalink
Fix hang on GC in Colorer._write()
Browse files Browse the repository at this point in the history
How to spot the problem visually:

 | [Main process] No output from workers. It seems that we hang. Send
 | SIGKILL to workers; exiting...

How to reproduce:

 | (Patch Python to trigger GC inside Colorer._write().)
 | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,}
 | --- /usr/lib/python3.9/multiprocessing/connection.py.orig
 | +++ /usr/lib/python3.9/multiprocessing/connection.py
 | @@ -202,6 +202,8 @@
 |              raise ValueError("size is negative")
 |          elif offset + size > n:
 |              raise ValueError("buffer length < offset + size")
 | +        import gc
 | +        gc.collect()
 |          self._send_bytes(m[offset:offset + size])
 |
 |      def send(self, obj):
 |
 | (Just in case, my tarantool version.)
 | $ ./src/tarantool --version | head -n 1
 | Tarantool 2.8.0-134-g81c663335
 |
 | (Add the reduced test case.)
 | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua
 | test_run = require('test_run').new()
 | box.schema.user.grant('guest', 'replication')
 | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 | test_run:cmd('start server replica')
 | test_run:cmd('stop server replica')
 | test_run:cmd('cleanup server replica')
 | test_run:cmd('delete server replica')
 | box.schema.user.revoke('guest', 'replication')
 |
 | (Run the reduced test case.)
 | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua
 |
 | (Or run existing test with instance managing.)
 | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua

The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.

In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.

So, I just removed this __del__() method.

The problem looks related to [1], but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.

[1]: tarantool/tarantool-qa#96
  • Loading branch information
Totktonada committed Mar 19, 2021
1 parent 5941741 commit 53ce01c
Showing 1 changed file with 0 additions and 3 deletions.
3 changes: 0 additions & 3 deletions lib/tarantool_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -658,9 +658,6 @@ def __init__(self, _ini=None, test_suite=None):
if 'test_run_current_test' in caller_globals.keys():
self.current_test = caller_globals['test_run_current_test']

def __del__(self):
self.stop()

@classmethod
def version(cls):
p = subprocess.Popen([cls.binary, "--version"], stdout=subprocess.PIPE)
Expand Down

0 comments on commit 53ce01c

Please sign in to comment.