-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A lot of hung test-run workers in CI #96
Comments
How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96
How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | box.schema.user.grant('guest', 'replication') | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('stop server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | box.schema.user.revoke('guest', 'replication') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | box.schema.user.grant('guest', 'replication') | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('stop server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | box.schema.user.revoke('guest', 'replication') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96
The list of fixed and known problems around hanging workers is below. |
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
test-run supports three types of tests: - tarantool - Test-Suite for Functional Testing - app - Another functional Test-Suite - unittest - Unit-Testing Test Suite Patch adds tests for two of supported test types: - test-app for type 'app' - test-tarantool for type 'tarantool' How-to run: $ make test_integration - test-tarantool/panic_on_broken_lsn.test.lua [1] 1. tarantool/tarantool-qa#96
@Totktonada this issue was supposedly resolved by tarantool/test-run#302. There's also tarantool/test-run#333 that we're currently working on. Can we close this particular issue (#96)? |
Okay. Ideally it would be nice if you would just link to multivac results and show that everything is fine. However multivac does not catch 'worker hang' situations AFAIR (and the results are not updated now AFAIS). |
We see a lot of situations in CI that looks like hung workers. This looks as a problem in test-run or testing infrastructure, not even in tests.
I propose to start debugging from test-run: either run workers under strace or instrument them with
color_stdout()
prints and perform runs in CI to spot the case. Can we reach some resource limit (say, number of file descriptors) and spin trying to acquire it?May it be relevant to the recent transition to Python 3?
Example: https://github.com/tarantool/tarantool/runs/2126385328
The text was updated successfully, but these errors were encountered: