A lot of hung test-run workers in CI #96

Totktonada · 2021-03-17T00:00:05Z

We see a lot of situations in CI that looks like hung workers. This looks as a problem in test-run or testing infrastructure, not even in tests.

I propose to start debugging from test-run: either run workers under strace or instrument them with color_stdout() prints and perform runs in CI to spot the case. Can we reach some resource limit (say, number of file descriptors) and spin trying to acquire it?

May it be relevant to the recent transition to Python 3?

Example: https://github.com/tarantool/tarantool/runs/2126385328

The text was updated successfully, but these errors were encountered:

How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96

test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96

How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96

How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | box.schema.user.grant('guest', 'replication') | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('stop server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | box.schema.user.revoke('guest', 'replication') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96

test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96

How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | box.schema.user.grant('guest', 'replication') | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('stop server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | box.schema.user.revoke('guest', 'replication') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96

Totktonada · 2021-03-22T22:00:19Z

The list of fixed and known problems around hanging workers is below.

This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96

This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)

test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96

NickVolynkin · 2022-06-01T12:17:37Z

@Totktonada this issue was supposedly resolved by tarantool/test-run#302. There's also tarantool/test-run#333 that we're currently working on. Can we close this particular issue (#96)?

Totktonada · 2022-06-01T13:54:13Z

Okay. Ideally it would be nice if you would just link to multivac results and show that everything is fine. However multivac does not catch 'worker hang' situations AFAIR (and the results are not updated now AFAIS).

Totktonada added the test-run label Mar 17, 2021

Totktonada mentioned this issue Mar 18, 2021

Fix hang on GC in Colorer._write() tarantool/test-run#275

Merged

kyukhin added the teamQ label Apr 15, 2021

kyukhin added this to the wishlist milestone Oct 15, 2021

Totktonada closed this as completed Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lot of hung test-run workers in CI #96

A lot of hung test-run workers in CI #96

Totktonada commented Mar 17, 2021

Totktonada commented Mar 22, 2021 •

edited

Loading

NickVolynkin commented Jun 1, 2022

Totktonada commented Jun 1, 2022

A lot of hung test-run workers in CI #96

A lot of hung test-run workers in CI #96

Comments

Totktonada commented Mar 17, 2021

Totktonada commented Mar 22, 2021 • edited Loading

NickVolynkin commented Jun 1, 2022

Totktonada commented Jun 1, 2022

Totktonada commented Mar 22, 2021 •

edited

Loading