From cb5adde9d399bdf478082a061739b78647752c85 Mon Sep 17 00:00:00 2001 From: Alexander Turenko Date: Thu, 18 Mar 2021 22:00:55 +0300 Subject: [PATCH] Fix hang on GC in Colorer._write() How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | box.schema.user.grant('guest', 'replication') | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('stop server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | box.schema.user.revoke('guest', 'replication') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: https://github.com/tarantool/tarantool-qa/issues/96 --- lib/tarantool_server.py | 3 --- 1 file changed, 3 deletions(-) diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py index f611dbaf..45c5148c 100644 --- a/lib/tarantool_server.py +++ b/lib/tarantool_server.py @@ -658,9 +658,6 @@ def __init__(self, _ini=None, test_suite=None): if 'test_run_current_test' in caller_globals.keys(): self.current_test = caller_globals['test_run_current_test'] - def __del__(self): - self.stop() - @classmethod def version(cls): p = subprocess.Popen([cls.binary, "--version"], stdout=subprocess.PIPE)