Add tool for analyzing and reporting random CDash test failures #600

achauphan · 2024-01-17T16:21:21Z

Related issues

Description

Random failures can bring down an entire CI iteration on a regular basis and waste resources whenever a retest is requested in order to pass the various checks of a pull request.

Spotting a randomly failing test requires a lot of manual CDash querying and analysis by the developer. However, in most cases, a developer may not have the time to trace, identify, and report the randomly failing test, and instead will opt to ignore it in favor of requesting a retest, leading to the previously stated point of wasting resources. This lack of reporting also leads to bigger issue in that it allows the randomly failing test to linger inside the code base and further affect developers in the future.

Proposed Solution

This issue proposes a new tool (which for now would live inside of TriBITS under tribits/ci_support) that can run automatically to query, scrape, analyze, and report tests that are deemed to be "randomly failing" to an operations team via email or an automated issue creation in the repository.

The definition for a randomly failing test will be a test that intermittently reports as passing or failing without any changes made to the topic or target branch being tested (topic and target tip SHA1 are the same) between CI testing iterations.

Fortunately, there is a lot of already existing work done that can be leveraged to build this tool in Python that already exists inside of tribits/ci_support. Notably, the module CreateIssueTrackerFromCDashQuery.py which can be used in the template example example_test_failure_github_issue.py along with the module CDashQueryAnalyzeReport.py which contains most of the heavy CDash querying functionality. Thus, the core work that will need to be done after utilizing the previously written modules will be to implement the algorithm that determines a random failure that is customizable on a project basis.

The goal will be for this tool to be able to look for randomly failing tests for any projects that posts their test results to CDash. The specifics of how this tool will gather the version information of the builds in CDash will be unique to each project and will require implementation on a project basis.

Ideally, this tool can be extended to analyze and report randomly failing configure, builds, and tests, however starting with randomly failing tests should lead to a similar framework that can be used for those other cases.

Requirements

~~posts a github issue upon identifying a randomly failing test~~ (TRILFRAME-614 requirement for any post starting with an email first)
be able to query cdash results over a period of time
all functionality is tested
usage is documented

The text was updated successfully, but these errors were encountered:

Added initial set of arguments for script to take in when ran.

…ub#600) Helper module functions that construct the browser and query URLs to cdash that can be used for downloaded the data from.

…#600) This initial script implementation takes in several cdash arguments and filters cdash for an initial set of all failing tests for a certain number of days. With that set of all failing tests, the script will then get all of that test's testing history. The test's full testing history is used to build a set of target,topic sha1 associated with failing testing iterations. This initial implementation current lacks the check to see if a passing test's target,topic sha1s exist in the set of failing sha1s, which denotes an unstable test. Monolithic commit as this started from a lot of exploratory coding that eventually built to this starting implementation.

…#600) Add checkIfTestUnstable() that takes in a tuple of passing sha1s and a set of tuples containing nonpassing sha1s. This requires testing.

Add a set of unit tests for getBuildIdFromTest helper function from cdash_analyze_and_report_random_failure.py

Moved argument parsing into a function that gets called by main and changed getBuildIdFromTest to return the last item of the split string rather than a constant index.

Limited the build name to only 80 characters to shorten the cache file name.

Fix regex pattern to match a string literal rather than a raw json string output which was used prior for during testing.

…#600) Moved random failure test files to its own seperate folder inside test/ci_support as to not be confused with the test files associated with other script.

Test layout was copied from another script. Renamed various functions and function calls to reflect the actual script name that is being tested.

Included summary output of analysis run and found randomly failing tests.

Initial test case for 1 passing and 1 failing test in a test history with identical sha1s between the two, signifying a failing test

bartlettroscoe · 2024-01-30T18:43:45Z

CC: @sebrowne

@achauphan , one thing that occurred to me is that this tool will need to allow the usage of build-name modifier to take in the build name from CDash and provide a name used to determine sequential builds for the Trilinos PR and nightly testing system. For example, all of the Trilinos build names have the prefix PR-<prID>-test- and the suffix -<jenkinsJobID> that must be removed from the build name to get the core build name. For example, the builds:

PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1731
PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables-1732

are sequence of the same build but CDash actually does not recognize that because the build names are different. To identify a related sequence builds, you need to at least remove the suffix -<jenkinsJobID> to give:

PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables
PR-12703-test-rhel7_sems-intel-2021.3-sems-openmpi-4.0.5_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-off_no-package-enables

Then, if the target and topic branches are the same, and if a test goes from passing to failing, then you can classify this as a random test failure.

You can provide the means for adjusting the build names using the Strategy Design Pattern.

So the two areas of variability for such a tool that will be project-specific (and therefore need to be abstracted out and pulled in as Strategy objects) are:

How to extract the version of the project for the purposes of comparing builds. (In the case of Trilinos with merge commits, you can do that by concatenating the target and topic branch SHA1s scrapped from the configure output and put tino into a string like <sha1-target>-<sha1-topic>.) Then the Python code just needs to compare the string values for this "version" to determine if the versions are the same (and it does not matter how that "version" was constructed or even what it represents).
How to edit the build names so we can determine sequences of the same build configurations. (In the case of Trilinos, at least remove the suffix -<jenkinsJobID>.)

Those can be two separate strategy objects given to the Python class(es) that are doing the data processing and analysis.

Each test in a fail test's history will have their own testname_buildname directory inside of the build_summary_cache directory. This was done to better group build summary cache files with their associated test and build names.

Added a not random failure system test case beginning from one failed test and 5 tests in its history where all tests contain merge commits with non-matching parents.

Renamed variables and cache files to be shorter in cases where the expected source or direction is related to CDash or CDash tests.

Use dummy strings that are more easily identifiable for test input files for parent commit hashes of the respective build summary output.

…TSPub#600) Normalize the groupName string before usage in a url request rather than having the user input an already normalized string. This has the added benefit of being able to use the groupName string without url normal characters for output in upcoming summary lines.

Function used 4 spaces instead of 2.

Removed individual printing of RandomFailureSummary in-line and instead use a str() function for RandomFailureSummary object

Added functionality to build an html file containing analysis results and the ability to report the results via email.

…600) Added cdash_analyze_and_report_random_failures_UnitTests ctest test using tribits_add_advanced_test().

This check was left in from an initial starting script. After adding cdash_analyze_and_report_random_failures_UnitTests.py as a ctest test, this check would cause CI ctest runs to fail as TRIBITS_DIR is set on a project basis during testing.

At a glance, the names of test cases such as "rft_0_ift_2" are not understandable without knowing what the acronyms mean. The directories of the new test case names will continue to use the acronyms as that better depicts the contents of the test files present.

The context of the script is a cdash tool so most of the variable names do not need that additional context in their names.

#600) Create driver class CDashAnalyzeReportRandomFailuresDriver inside module file CDashAnalyzeReportRandomFailures.py that will contain the main general functionality of the random test failure tool. The driver class accepts two strategy classes passed from the example script. These strategy classes ExampleVersionInfoStrategy and ExampleBuildNameStrategy contain the project specific implementation that is generically used inside of the driver class.

#600) This large commit is copying over the main() function and its associated helper functions into CDashAnalyzeReportRandomFailures.py inside the CDashAnalyzeReportRandomFailuresDriver class. This is part of the effort to refactor cdash_analyze_and_report_random_failures.py to be more generic.

…600) There were mixed use cases of 'targetTopic' or 'topicTarget', this renames all cases to use 'targetTopic' approach.

Moved example_cdash_analyze_and_report_random_failures.py to test/ci_support

Trilinos specific driver `trilinos_cdash_analyze_and_report_random_failures.py` based on `example_cdash_analyze_and_report_random_failures.py` that contains the Trilinos specific implementations of `VersionInfoStrategy` and `ExtractBuildNameStrategy`.

Example class did not include the 'Example' prefix.

This is for testing the CDashAnalyzeReportRandomFailures.py runDriver().

Adjusted spacing between classes and added newline character at the end of file.

This reverts commit dbe94f4. Reverting this commit as this specific driver implementation shouldn't be existing inside of TriBITS. Rather it should be added to the Trilinos repo after snapshotting TriBITS in.

Deleted the original `cdash_analyze_and_report_random_failures.py` script after moving its main functionality into a separate class inside `CDashAnalyzeReportRandomFailures.py`. To run the script, one must start from `example_cdash_analyze_and_report_random_failures.py` located in `test/ci_support` and supply an implementation of the two strategy objects used by the `CDashAnalyzeReportRandomFailures.py` driver class.

Removed unit tests related to the old script, `cdash_analyze_and_report_random_failures.py`. These tests will be put back as unittests for the module file `CDashAnalyzeReportRandomFailures.py`. This change will keep `cdash_analyze_and_report_random_failures_UnitTests.py` focused on the system tests for how the class `CDashAnalyzeReportRandomFailuresDriver` is used.

Added tests for `CDashAnalyzeReportRandomFailuresDriver` member functions in `CDashAnalyzeReportRandomFailures.py`.

Previous filename compression technique was to always trim the buildname to only the first 80 characters as to avoid "filename too long" errors. Cache file or directory names are built in the format of `testName_buildName` The above method does not protect against the case where testName may be very long. This implementation uses an existing function named `getCompressedFileNameIfTooLong` in `CDashQueryAnalyzeReport.py` module file which will form a hash of the passed in string if it is deamed too long. This will also help mitigate the chances of a filename collision as previously it was possible for a trimmed buildName to result in the same `testName_buildName` filename if testName was the same test and had the correct length.

Optional usageHelp string that can be passed to `CDashAnalyzeReportRandomFailuresDriver` that is outputted with when the main script is given the `--help` argument.

Used to specify the testing day start time unique to each CDash project.

CDash Report random test failure tool (#600)

…failures-tool (TriBITSPub/TriBITS#600)

bartlettroscoe · 2024-02-14T03:22:59Z

@achauphan and @sebrowne, the Trilinos PR that brings in TriBITS PR #603 is:

TriBITS snapshot 2024-02-13 trilinos/Trilinos#12741

We can work on further refactorings and feature enhancements later.

I can see were this may be useful for some metrics for other projects that submit to CDash so I will do those refactorings as needed.

Added argument to specify a prefix string for the built html page title and the email subject. This can help with the tool's email searchability.

…-02-15 CDash Random failure tool patch 2024-02-15 (#600)

achauphan added type: enhancement component: ci_support labels Jan 17, 2024

achauphan self-assigned this Jan 17, 2024

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 22, 2024

Added argument parsing for cdash random failures script (TriBITSPub#600)

09a29a3

Added initial set of arguments for script to take in when ran.

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 25, 2024

Added unit test for getBuildIdFromTest (TriBITSPub#600)

0930d75

Add a set of unit tests for getBuildIdFromTest helper function from cdash_analyze_and_report_random_failure.py

bartlettroscoe mentioned this issue Jan 26, 2024

Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 trilinos/Trilinos#12696

Open

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Add limiting of build name strings for cache file name (TriBITSPub#600)

acc3934

Limited the build name to only 80 characters to shorten the cache file name.

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Fix configure output regex pattern (TriBITSPub#600)

d86767f

Fix regex pattern to match a string literal rather than a raw json string output which was used prior for during testing.

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Added test failure summary (TriBITSPub#600)

c966b59

Included summary output of analysis run and found randomly failing tests.

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Added test for 1 pass 1 fail random failure case (TriBITSPub#600)

77491e9

Initial test case for 1 passing and 1 failing test in a test history with identical sha1s between the two, signifying a failing test

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Added test for 1 pass 1 fail random failure case (TriBITSPub#600)

d1c94ab

Initial test case for 1 passing and 1 failing test in a test history with identical sha1s between the two, signifying a failing test

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Renamed variables and cache files (TriBITSPub#600)

4693f9f

Renamed variables and cache files to be shorter in cases where the expected source or direction is related to CDash or CDash tests.

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Renamed variables and cache files (TriBITSPub#600)

fa761dc

Renamed variables and cache files to be shorter in cases where the expected source or direction is related to CDash or CDash tests.

achauphan added a commit to achauphan/TriBITS that referenced this issue Jan 30, 2024

Renamed variables and cache files (TriBITSPub#600)

4f72d3e

Renamed variables and cache files to be shorter in cases where the expected source or direction is related to CDash or CDash tests.

bartlettroscoe mentioned this issue Feb 2, 2024

Add CI for checking for broken links manually, weekly and in PRs betterscientificsoftware/bssw.io#1633

Open

6 tasks

achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024

Fix spacing to match the rest of the script (TriBITSPub#600)

7ebc52c

Function used 4 spaces instead of 2.

achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024

Added str() method to RandomFailureSummary (TriBITSPub#600)

c1a765a

Removed individual printing of RandomFailureSummary in-line and instead use a str() function for RandomFailureSummary object

achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 3, 2024

Added email summary report option (TriBITSPub#600)

cf54c8e

Added functionality to build an html file containing analysis results and the ability to report the results via email.

achauphan added a commit to achauphan/TriBITS that referenced this issue Feb 6, 2024

Added email summary report option (TriBITSPub#600)

75de3d3

Added functionality to build an html file containing analysis results and the ability to report the results via email.

achauphan added a commit that referenced this issue Feb 13, 2024

Modified starting status string output to indicate date range (#600)

93ae521

achauphan added a commit that referenced this issue Feb 13, 2024

Added cdash_analyze_and_report_random_failures_UnitTests ctest test (#…

6c68dbd

…600) Added cdash_analyze_and_report_random_failures_UnitTests ctest test using tribits_add_advanced_test().

achauphan added a commit that referenced this issue Feb 13, 2024

Shortened names that referenced 'cdash' (#600)

31b4d90

The context of the script is a cdash tool so most of the variable names do not need that additional context in their names.

achauphan added a commit that referenced this issue Feb 13, 2024

Fix declaration position of defaultPageStyle var (#600)

1b2601b

achauphan added a commit that referenced this issue Feb 13, 2024

Renamed ExampleVersionInfoStrategy member function to be consistent (#…

f56356a

…600) There were mixed use cases of 'targetTopic' or 'topicTarget', this renames all cases to use 'targetTopic' approach.

achauphan added a commit that referenced this issue Feb 13, 2024

Moved example_cdash_analyze_and_report_random_failures.py (#600)

9e0983c

Moved example_cdash_analyze_and_report_random_failures.py to test/ci_support

achauphan added a commit that referenced this issue Feb 13, 2024

Fix type of ExampleExtractBuildNameStrategy class (#600)

6285f6c

Example class did not include the 'Example' prefix.

achauphan added a commit that referenced this issue Feb 13, 2024

Implemented Trilinos versions for Example driver (#600)

40d2d28

This is for testing the CDashAnalyzeReportRandomFailures.py runDriver().

achauphan added a commit that referenced this issue Feb 13, 2024

Cleaned up spacing (#600)

6ff3ab9

Adjusted spacing between classes and added newline character at the end of file.

achauphan added a commit that referenced this issue Feb 13, 2024

Add test for member functions (#600)

e757fc9

Added tests for `CDashAnalyzeReportRandomFailuresDriver` member functions in `CDashAnalyzeReportRandomFailures.py`.

achauphan added a commit that referenced this issue Feb 13, 2024

Added optional usageHelp as argument (#600)

693cfb5

Optional usageHelp string that can be passed to `CDashAnalyzeReportRandomFailuresDriver` that is outputted with when the main script is given the `--help` argument.

achauphan added a commit that referenced this issue Feb 13, 2024

Added cdash-testing-day-start-time argument (#600)

e9b5f87

Used to specify the testing day start time unique to each CDash project.

achauphan added a commit that referenced this issue Feb 13, 2024

Added argument help strings (#600)

47bad7f

achauphan added a commit that referenced this issue Feb 13, 2024

Removed a debug print statement accidentally committed previously (#600)

b9b5bfc

achauphan added a commit that referenced this issue Feb 13, 2024

Added docs explaining each system test case (#600)

5e5bd79

bartlettroscoe added a commit that referenced this issue Feb 14, 2024

Merge pull request #603 from TriBITSPub/600-cdash-random-failure-tool

7b14c49

CDash Report random test failure tool (#600)

bartlettroscoe added a commit to trilinos/Trilinos that referenced this issue Feb 14, 2024

Merge branch 'tribits_github_snapshot' into tribits-600-cdash-random-…

103fc97

…failures-tool (TriBITSPub/TriBITS#600)

achauphan mentioned this issue Feb 15, 2024

CDash Random failure tool patch 2024-02-15 (#600) #605

Merged

bartlettroscoe added a commit that referenced this issue Feb 19, 2024

Merge pull request #605 from achauphan/random-failure-tool-patch-2024…

58c744c

…-02-15 CDash Random failure tool patch 2024-02-15 (#600)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tool for analyzing and reporting random CDash test failures #600

Add tool for analyzing and reporting random CDash test failures #600

achauphan commented Jan 17, 2024 •

edited

Loading

bartlettroscoe commented Jan 30, 2024

bartlettroscoe commented Feb 14, 2024

Add tool for analyzing and reporting random CDash test failures #600

Add tool for analyzing and reporting random CDash test failures #600

Comments

achauphan commented Jan 17, 2024 • edited Loading

Related issues

Description

Proposed Solution

Requirements

bartlettroscoe commented Jan 30, 2024

bartlettroscoe commented Feb 14, 2024

achauphan commented Jan 17, 2024 •

edited

Loading