Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of hung test-run workers in CI #96

Closed
Totktonada opened this issue Mar 17, 2021 · 3 comments
Closed

A lot of hung test-run workers in CI #96

Totktonada opened this issue Mar 17, 2021 · 3 comments
Labels
Milestone

Comments

@Totktonada
Copy link
Member

We see a lot of situations in CI that looks like hung workers. This looks as a problem in test-run or testing infrastructure, not even in tests.

I propose to start debugging from test-run: either run workers under strace or instrument them with color_stdout() prints and perform runs in CI to spot the case. Can we reach some resource limit (say, number of file descriptors) and spin trying to acquire it?

May it be relevant to the recent transition to Python 3?

Example: https://github.com/tarantool/tarantool/runs/2126385328

Totktonada added a commit to tarantool/test-run that referenced this issue Mar 18, 2021
How to spot the problem visually:

 | [Main process] No output from workers. It seems that we hang. Send
 | SIGKILL to workers; exiting...

How to reproduce:

 | (Patch Python to trigger GC inside Colorer._write().)
 | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,}
 | --- /usr/lib/python3.9/multiprocessing/connection.py.orig
 | +++ /usr/lib/python3.9/multiprocessing/connection.py
 | @@ -202,6 +202,8 @@
 |              raise ValueError("size is negative")
 |          elif offset + size > n:
 |              raise ValueError("buffer length < offset + size")
 | +        import gc
 | +        gc.collect()
 |          self._send_bytes(m[offset:offset + size])
 |
 |      def send(self, obj):
 |
 | (Just in case, my tarantool version.)
 | $ ./src/tarantool --version | head -n 1
 | Tarantool 2.8.0-134-g81c663335
 |
 | (Add the reduced test case.)
 | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua
 | test_run = require('test_run').new()
 | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 | test_run:cmd('start server replica')
 | test_run:cmd('cleanup server replica')
 | test_run:cmd('delete server replica')
 |
 | (Run the reduced test case.)
 | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua
 |
 | (Or run existing test with instance managing.)
 | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua

The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.

In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.

So, I just removed this __del__() method.

The problem looks related to [1], but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.

[1]: tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 19, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio pushed a commit to tarantool/test-run that referenced this issue Mar 19, 2021
How to spot the problem visually:

 | [Main process] No output from workers. It seems that we hang. Send
 | SIGKILL to workers; exiting...

How to reproduce:

 | (Patch Python to trigger GC inside Colorer._write().)
 | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,}
 | --- /usr/lib/python3.9/multiprocessing/connection.py.orig
 | +++ /usr/lib/python3.9/multiprocessing/connection.py
 | @@ -202,6 +202,8 @@
 |              raise ValueError("size is negative")
 |          elif offset + size > n:
 |              raise ValueError("buffer length < offset + size")
 | +        import gc
 | +        gc.collect()
 |          self._send_bytes(m[offset:offset + size])
 |
 |      def send(self, obj):
 |
 | (Just in case, my tarantool version.)
 | $ ./src/tarantool --version | head -n 1
 | Tarantool 2.8.0-134-g81c663335
 |
 | (Add the reduced test case.)
 | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua
 | test_run = require('test_run').new()
 | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 | test_run:cmd('start server replica')
 | test_run:cmd('cleanup server replica')
 | test_run:cmd('delete server replica')
 |
 | (Run the reduced test case.)
 | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua
 |
 | (Or run existing test with instance managing.)
 | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua

The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.

In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.

So, I just removed this __del__() method.

The problem looks related to [1], but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.

[1]: tarantool/tarantool-qa#96
Totktonada added a commit to tarantool/test-run that referenced this issue Mar 19, 2021
How to spot the problem visually:

 | [Main process] No output from workers. It seems that we hang. Send
 | SIGKILL to workers; exiting...

How to reproduce:

 | (Patch Python to trigger GC inside Colorer._write().)
 | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,}
 | --- /usr/lib/python3.9/multiprocessing/connection.py.orig
 | +++ /usr/lib/python3.9/multiprocessing/connection.py
 | @@ -202,6 +202,8 @@
 |              raise ValueError("size is negative")
 |          elif offset + size > n:
 |              raise ValueError("buffer length < offset + size")
 | +        import gc
 | +        gc.collect()
 |          self._send_bytes(m[offset:offset + size])
 |
 |      def send(self, obj):
 |
 | (Just in case, my tarantool version.)
 | $ ./src/tarantool --version | head -n 1
 | Tarantool 2.8.0-134-g81c663335
 |
 | (Add the reduced test case.)
 | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua
 | test_run = require('test_run').new()
 | box.schema.user.grant('guest', 'replication')
 | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 | test_run:cmd('start server replica')
 | test_run:cmd('stop server replica')
 | test_run:cmd('cleanup server replica')
 | test_run:cmd('delete server replica')
 | box.schema.user.revoke('guest', 'replication')
 |
 | (Run the reduced test case.)
 | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua
 |
 | (Or run existing test with instance managing.)
 | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua

The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.

In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.

So, I just removed this __del__() method.

The problem looks related to [1], but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.

[1]: tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 22, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 22, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 22, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
Totktonada added a commit to tarantool/test-run that referenced this issue Mar 22, 2021
How to spot the problem visually:

 | [Main process] No output from workers. It seems that we hang. Send
 | SIGKILL to workers; exiting...

How to reproduce:

 | (Patch Python to trigger GC inside Colorer._write().)
 | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,}
 | --- /usr/lib/python3.9/multiprocessing/connection.py.orig
 | +++ /usr/lib/python3.9/multiprocessing/connection.py
 | @@ -202,6 +202,8 @@
 |              raise ValueError("size is negative")
 |          elif offset + size > n:
 |              raise ValueError("buffer length < offset + size")
 | +        import gc
 | +        gc.collect()
 |          self._send_bytes(m[offset:offset + size])
 |
 |      def send(self, obj):
 |
 | (Just in case, my tarantool version.)
 | $ ./src/tarantool --version | head -n 1
 | Tarantool 2.8.0-134-g81c663335
 |
 | (Add the reduced test case.)
 | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua
 | test_run = require('test_run').new()
 | box.schema.user.grant('guest', 'replication')
 | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 | test_run:cmd('start server replica')
 | test_run:cmd('stop server replica')
 | test_run:cmd('cleanup server replica')
 | test_run:cmd('delete server replica')
 | box.schema.user.revoke('guest', 'replication')
 |
 | (Run the reduced test case.)
 | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua
 |
 | (Or run existing test with instance managing.)
 | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua

The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.

In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.

So, I just removed this __del__() method.

The problem looks related to [1], but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.

[1]: tarantool/tarantool-qa#96
@Totktonada
Copy link
Member Author

Totktonada commented Mar 22, 2021

Totktonada added a commit to tarantool/tarantool that referenced this issue Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96
Totktonada added a commit to tarantool/tarantool that referenced this issue Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96

(cherry picked from commit 680990a)
Totktonada added a commit to tarantool/tarantool that referenced this issue Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96

(cherry picked from commit 680990a)
Totktonada added a commit to tarantool/tarantool that referenced this issue Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96

(cherry picked from commit 680990a)
ligurio added a commit to tarantool/test-run that referenced this issue Mar 24, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 24, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 26, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 30, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 31, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 31, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 31, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
ligurio added a commit to tarantool/test-run that referenced this issue Mar 31, 2021
test-run supports three types of tests:

- tarantool - Test-Suite for Functional Testing
- app - Another functional Test-Suite
- unittest - Unit-Testing Test Suite

Patch adds tests for two of supported test types:

- test-app for type 'app'
- test-tarantool for type 'tarantool'

How-to run:

$ make test_integration

- test-tarantool/panic_on_broken_lsn.test.lua [1]

1. tarantool/tarantool-qa#96
@kyukhin kyukhin added the teamQ label Apr 15, 2021
@kyukhin kyukhin added this to the wishlist milestone Oct 15, 2021
@NickVolynkin
Copy link
Contributor

@Totktonada this issue was supposedly resolved by tarantool/test-run#302. There's also tarantool/test-run#333 that we're currently working on. Can we close this particular issue (#96)?

@Totktonada
Copy link
Member Author

Okay. Ideally it would be nice if you would just link to multivac results and show that everything is fine. However multivac does not catch 'worker hang' situations AFAIR (and the results are not updated now AFAIS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants