Non-deterministic end-to-end test failures. #113

JasonMoho · 2022-09-15T18:51:02Z

Describe the bug
The end to end tests for training and evaluation occasionally fail or timeout, especially when running on Github actions. It's difficult to reproduce this behavior locally. The failures seem to occur most on tests which use async processing + the buffer. This leads me to believe that there is a concurrency control bug (e.g. deadlock) occurring.

The workaround for this bug is to just re-run the tests.

To Reproduce
Occasionally can reproduce when running GitHub Actions workflow. E.g. https://github.com/marius-team/marius/actions/runs/3056399004/jobs/4930521831

I have not observed async processing bugs when running on large-scale datasets, only on the tiny-scale datasets used for testing.

The main challenge will be isolating and identifying the issue. My approach will be to run a highly asynchronous configuration on a small dataset, which will hopefully recreate the conditions needed for the concurrency bug to arise.

Environment
Occurs on both Linux and MacOS

JasonMoho added the bug Something isn't working label Sep 15, 2022

JasonMoho self-assigned this Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic end-to-end test failures. #113

Non-deterministic end-to-end test failures. #113

JasonMoho commented Sep 15, 2022

Non-deterministic end-to-end test failures. #113

Non-deterministic end-to-end test failures. #113

Comments

JasonMoho commented Sep 15, 2022