CDash Report random test failure tool (#600) #603

achauphan · 2024-02-06T16:25:20Z

WIP: This PR is the implementation of a CDash Random Test Failure analysis tool related to:

Add tool for analyzing and reporting random CDash test failures #600

achauphan · 2024-02-06T17:24:14Z

Force push to fix typo in d075467 commit message

sebrowne

This looks more or less acceptable to me.

Biggest note that I have is that this is VERY complex code that effectively mocks out the responses from the API query library, but does so in such a way that a lot of the code in the library is actually run. In general, I recommend against doing that in unit tests, because you're testing more than one module with the test. A more robust way to do that would be to use the Mock module in Python to directly replace the library functions under test (CDQAR in this specific case). Then again, using Python to run the script via the command line is also a bit uncomfortable to me, as opposed to importing the capability and then testing it directly in Python. I generally do not test command-line parsers. Same reason as above. I'd rather have good unit tests than system tests.

sebrowne · 2024-02-07T15:06:59Z

test/ci_support/cdash_analyze_and_report_random_failures_UnitTests.py

+#############################################################################
+
+
+class test_getBuildIdFromTest(unittest.TestCase):


Recommend standard Python class naming (TestGetBuildIdFromTest)

test/ci_support/cdash_analyze_and_report_random_failures_UnitTests.py

sebrowne · 2024-02-07T15:08:21Z

test/ci_support/cdash_analyze_and_report_random_failures_UnitTests.py

+        )
+
+
+    def test_no_random_failure(self):


Maybe add a docstring similar to the above one here (describe why it is not a random failure)

sebrowne · 2024-02-07T15:10:46Z

..._random_failures/rft_0_ift_2/test_queries_cache/test_history_cache/testname1_buildname1.json

+                 u'status': u'Passed',
+                 u'statusclass': u'error',


Here is one of the (in-theory) contradictions that I was warning you about. I don't think it's a big deal, and probably would leave it alone since you don't use those values. Just a warning about mocking the API responses this way.

bartlettroscoe · 2024-02-07T16:22:47Z

Commenting on @sebrowne review above, you need system-level tests for every major use case supported by the tool, just to make sure the command-line arguments are interpreted correctly and you get the desired behavior. But most of the detailed code coverage tests should be below system-level tests (i.e. unit tests). You need to be sparing with system-level tests because they are typically expensive and/or difficult to maintain compared to lower-level (unit-level) tests.

Updated: @sebrowne and @achauphan, do you want me to do a detailed code review of this PR? Just taking a quick look, there is really not that much production code but it is not very unit testable the way it is currently written. The entire file tribits/ci_support/cdash_analyze_and_report_random_failures.py is only 342 lines long, but it is mostly a small class and some small functions and then a large main() function, so it would be difficult unit test this. If it were me, I would break that long main() function up into small functions and/or classes and have unit tests on the pieces. Those unit tests are much easier to write and maintain and might allow fewer system-level tests. In fact, if you develop the code with TDD, the unit tests come for free and you have naturally modular code. I can suggest how to refactor this code and make it more modular and unit testable if there is interest in that. Just let me know.

achauphan · 2024-02-07T18:49:01Z

@bartlettroscoe I do agree that main() needs to broken down into smaller units and I would love and appreciate a more detailed review, however, I think this is @sebrowne's call for if I should put more time into this as I've already taken too long on the story.

From a "lets get some data now on the problem" perspective, could it be worth a general review and standing it up for Trilinos to get a better picture of the frequency of random test failures and then assess how much more work should go into this?

bartlettroscoe · 2024-02-07T19:00:37Z

@achauphan and @sebrowne,

I may be able to use some other funding related to DevOps research and metrics gathering to refactor this code some and generalize it so that it can work for projects other than Trilinos. As currently written, it will only work for Trilinos PR builds. Therefore, I think if we keep the code as is, it may be better (for now) to keep the tool and its tests on a branch or put this code into its own git repo.

With that said, I don't think it is a lot of work to do the minimal refactorings needed to make 95% of this generic and keep it in the TriBITS repo and then have a Trilinos-specific driver in a different git repo (could be the main Trilinos repo or a different git repo).

What do you think?

achauphan · 2024-02-08T00:44:11Z

@bartlettroscoe I can't speak for @sebrowne, but do agree that in its current state, this is not at all ready for general use so it should stay as a branch for now. If it's a good idea for me to keep working on this I can, but if you would like to and are able to take this on further, please feel free to do so. I do also agree that this should not too much to refactor and make it more generic.

sebrowne · 2024-02-08T14:35:07Z

@achauphan @bartlettroscoe

Looking over it again for the pieces that are Trilinos-specific, I do not feel they are significant enough to bother generalizing the code until there is a greater use case for doing so, primarily given the relative ease with which it should be possible to separate the general and the specific components. That being said, I think that the argument that splitting it is simple enough to just take care of now is also perfectly valid. I don't have a strong opinion on this matter, as long as this code is in Trilinos and usable very soon. I want to be able to start setting up a Jenkins job with it, and if it needs to target the branch in this repository to do so, that's fine too. Our sprint ends on Monday 02/19 and if this capability is not working and emailing by then, I do not feel that I can justify spending more of the team's time on it.

On the subject of main() splitting, I do not see the value if the cache-mock mechanism will still be used (because unit testing for the code that is currently in main() is prohibited much more by the lack of library-function mocking than it is by all being in one function).

achauphan · 2024-02-08T19:55:58Z

@sebrowne @bartlettroscoe

Is there a reasonable middle ground here for supporting cache-mock tests along with library-function mocking?

The argument to utilizing a cache-mock system test would be, as stated earlier by @bartlettroscoe, to test this program's command line args interpretation and check the overall desired behavior of the program at the cost of having to also run the library code as mentioned by @sebrowne.

Then utilizing library-function mocking would allowmain() to be broken down to its own smaller testable units, as mentioned by @sebrowne, which we all seem to think is a needed.

I believe this approach would promote another module file with something of the name of CDashRandomFailures.py that holds smaller units of main() and can further house the generic strategy objects mentioned here.

Coming into this, my understanding of testing has been limited, but I think this conversation has been very helpful in shaping the different perceptions around test.

What do ya'll think?

(Also @bartlettroscoe I would appreciate the refactor suggestions offer you made earlier if you have time)

bartlettroscoe · 2024-02-09T01:41:43Z

Is there a reasonable middle ground here for supporting cache-mock tests along with library-function mocking?

@achauphan, what you outline above is exactly what I would recommend and it is what I did for the tool cdash_analyze_and_report.py. You will see there are a lot of unit tests for code called by that tool that typically never create files or touch the filesystem.

Also @bartlettroscoe I would appreciate the refactor suggestions offer you made earlier if you have time.

Yes, I can suggest what to do. But it sounds like @sebrowne wants to start with the minimal work to get this merged and used by Trilinos. For a minimal and easy refactoring (that does not require writing any new tests), what I would suggest would be:

Create a new Python class (called CDashAnalyzeReportRandomFailures?) that accepts two Strategy objects (one to modify build names and the other to extract the version info for a given build).
Move the existing main() function to a member function CDashAnalyzeReportRandomFailures::runDriver().
Create a test/example tribits/ci_support/example_cdash_analyze_and_report_random_failures.py that uses CDashAnalyzeReportRandomFailures and simple Strategy objects to match the existing tests (that can be the ones for Trilinos, that is fine).
Change the system-level tests in TriBITS to call example_cdash_analyze_and_report_random_failures.py (with the same exact data)
Create a new driver function in Trilinos called trilinos_cdash_analyze_and_report_random_failures.py that uses CDashAnalyzeReportRandomFailures and strategy objects specific for Trilinos (which initially would just be a direct copy of example_cdash_analyze_and_report_random_failures.py).

That is a very straightforward refactoring that can be done in stages, using the existing tests to ensure that each refactoring step goes smoothly and with no risk.

What is outlined above is exactly what was done for:

The above refactoring can be done in just a few short hours (or less). We can pair-program this together if you like. I have some availability tomorrow. (My SNL Outlook calendar is currently shown to be full but that is just blocking off time for some writing I need to get done for a paper.)

achauphan · 2024-02-09T15:51:54Z

@bartlettroscoe thanks for the suggestions! I've started a bit on these yesterday before your comment and planned to do a stacked PR. How about I finish going through these suggestions and we review the stacked PR together?

bartlettroscoe · 2024-02-09T16:00:15Z

@bartlettroscoe thanks for the suggestions! I've started a bit on these yesterday before your comment and planned to do a stacked PR. How about I finish going through these suggestions and we review the stacked PR together?

@achauphan, sounds good. Let me know when you want me to review.

…stack Stack PR with generic implementation for PR #603 (#600)

achauphan · 2024-02-13T03:25:48Z

Force push to fix codespell

achauphan · 2024-02-13T03:55:41Z

@sebrowne @bartlettroscoe Requesting another review.

The primary changes done since the last request was moving away from the script file cdash_analyze_and_report_random_failures.py and into using a module file CDashAnalyzeReportRandomFailures.py with a driver class CDashAnalyzeReportRandomFailuresDriver containing the same contents of the original script. Running the script will be done from a project specific script based on example_cdash_analyze_and_report_random_failures.py. The example script currently contains placeholder strategy classes that are the ones that would be used for Trilinos.

These are the changes that I hope are enough to stand up. I did not make any changes related to breaking up main(), or in this case runDriver(), into smaller testable bits as it was previously deemed good enough for now. That can perhaps be scheduled later down the road.

bartlettroscoe · 2024-02-13T14:57:42Z

@achauphan, doing a detailed review now ...

bartlettroscoe · 2024-02-13T15:14:57Z

@achauphan, while I am doing this review, could you please rebase this branch on top of the upstream 'master' target branch? GitHub is reporting the branch is out of date showing:

?

If not, I can take care of that after my reveiw.

NOTE: This is GitHub being super careful testing the exact version of the repo that will be merged

Helper module functions that construct the browser and query URLs to cdash that can be used for downloaded the data from.

This initial script implementation takes in several cdash arguments and filters cdash for an initial set of all failing tests for a certain number of days. With that set of all failing tests, the script will then get all of that test's testing history. The test's full testing history is used to build a set of target,topic sha1 associated with failing testing iterations. This initial implementation current lacks the check to see if a passing test's target,topic sha1s exist in the set of failing sha1s, which denotes an unstable test. Monolithic commit as this started from a lot of exploratory coding that eventually built to this starting implementation.

Add checkIfTestUnstable() that takes in a tuple of passing sha1s and a set of tuples containing nonpassing sha1s. This requires testing.

Should help with development. Very rough and Just Got Stuff Working.

Add a set of unit tests for getBuildIdFromTest helper function from cdash_analyze_and_report_random_failure.py

Moved argument parsing into a function that gets called by main and changed getBuildIdFromTest to return the last item of the split string rather than a constant index.

Limited the build name to only 80 characters to shorten the cache file name.

Fix regex pattern to match a string literal rather than a raw json string output which was used prior for during testing.

Moved random failure test files to its own separate folder inside test/ci_support as to not be confused with the test files associated with other script.

At a glance, the names of test cases such as "rft_0_ift_2" are not understandable without knowing what the acronyms mean. The directories of the new test case names will continue to use the acronyms as that better depicts the contents of the test files present.

The context of the script is a cdash tool so most of the variable names do not need that additional context in their names.

#600) Create driver class CDashAnalyzeReportRandomFailuresDriver inside module file CDashAnalyzeReportRandomFailures.py that will contain the main general functionality of the random test failure tool. The driver class accepts two strategy classes passed from the example script. These strategy classes ExampleVersionInfoStrategy and ExampleBuildNameStrategy contain the project specific implementation that is generically used inside of the driver class.

#600) This large commit is copying over the main() function and its associated helper functions into CDashAnalyzeReportRandomFailures.py inside the CDashAnalyzeReportRandomFailuresDriver class. This is part of the effort to refactor cdash_analyze_and_report_random_failures.py to be more generic.

…600) There were mixed use cases of 'targetTopic' or 'topicTarget', this renames all cases to use 'targetTopic' approach.

Moved example_cdash_analyze_and_report_random_failures.py to test/ci_support

Trilinos specific driver `trilinos_cdash_analyze_and_report_random_failures.py` based on `example_cdash_analyze_and_report_random_failures.py` that contains the Trilinos specific implementations of `VersionInfoStrategy` and `ExtractBuildNameStrategy`.

Example class did not include the 'Example' prefix.

This is for testing the CDashAnalyzeReportRandomFailures.py runDriver().

Adjusted spacing between classes and added newline character at the end of file.

This reverts commit dbe94f4. Reverting this commit as this specific driver implementation shouldn't be existing inside of TriBITS. Rather it should be added to the Trilinos repo after snapshotting TriBITS in.

Deleted the original `cdash_analyze_and_report_random_failures.py` script after moving its main functionality into a separate class inside `CDashAnalyzeReportRandomFailures.py`. To run the script, one must start from `example_cdash_analyze_and_report_random_failures.py` located in `test/ci_support` and supply an implementation of the two strategy objects used by the `CDashAnalyzeReportRandomFailures.py` driver class.

Removed unit tests related to the old script, `cdash_analyze_and_report_random_failures.py`. These tests will be put back as unittests for the module file `CDashAnalyzeReportRandomFailures.py`. This change will keep `cdash_analyze_and_report_random_failures_UnitTests.py` focused on the system tests for how the class `CDashAnalyzeReportRandomFailuresDriver` is used.

Added tests for `CDashAnalyzeReportRandomFailuresDriver` member functions in `CDashAnalyzeReportRandomFailures.py`.

Previous filename compression technique was to always trim the buildname to only the first 80 characters as to avoid "filename too long" errors. Cache file or directory names are built in the format of `testName_buildName` The above method does not protect against the case where testName may be very long. This implementation uses an existing function named `getCompressedFileNameIfTooLong` in `CDashQueryAnalyzeReport.py` module file which will form a hash of the passed in string if it is deamed too long. This will also help mitigate the chances of a filename collision as previously it was possible for a trimmed buildName to result in the same `testName_buildName` filename if testName was the same test and had the correct length.

Optional usageHelp string that can be passed to `CDashAnalyzeReportRandomFailuresDriver` that is outputted with when the main script is given the `--help` argument.

Used to specify the testing day start time unique to each CDash project.

bartlettroscoe

@achauphan, I did not have time for a line-by-line review of the test driver test/ci_support/CDashAnalyzeReportRandomFailures_UnitTests.py but I looked over the production code in the module CDashAnalyzeReportRandomFailures.py and I did not see any big problems. I only have suggestions for improvements that can be done in future PRs during future work.

I think this is good to merge as is and get used in Trilinos.

tribits/ci_support/cdash_analyze_and_report_random_failures.py

bartlettroscoe · 2024-02-10T01:14:13Z

test/ci_support/cdash_analyze_and_report_random_failures_UnitTests.py

@@ -0,0 +1,288 @@
+# @HEADER


The name of this file should be changed to example_cdash_analyze_and_report_random_failures_UnitTests.py? The file/module cdash_analyze_and_report_random_failures.py does not exist anymore. Or, you could call this system-level driver something like CDashAnalyzeReportRandomFailures_SystemLevelTests.py? (But the name example_cdash_analyze_and_report_random_failures_UnitTests.py might be more obvious?)

IMHO, the more detailed unit tests for CDashAnalyzeReportRandomFailures.py should be put in the driver CDashAnalyzeReportRandomFailures_UnitTests.py. It is nice to have very fast unit tests that are easer to debug in their own driver separate from the more expensive system-level tests driver. So when you are making changes, you run the faster CDashAnalyzeReportRandomFailures_UnitTests.py and resolve any errors in those tests first before digging into the failures in the system-level tests. That is just my opinion and I find this separation works well from experience.

test/ci_support/cdash_analyze_and_report_random_failures_UnitTests.py

bartlettroscoe · 2024-02-13T19:39:13Z

test/ci_support/example_cdash_analyze_and_report_random_failures.py

+  def getTargetTopicSha1s(self, buildData):
+    pattern = r"Parent [12]:\n\s+(\w+)"
+    matchedList = regex.findall(pattern, buildData)
+
+    if len(matchedList) != 2: return None
+    return tuple(matchedList)


In the future, we may need to refactor this Strategy API to allow more flexibility in how the version info is extracted from CDash. For example, for a raw CMake project, the CMake configure output will not contain the version info and you will need to get that info in some other way (either directly or indirectly through other CDash REST API calls). Therefore, for the more general use case, we would need to pass in an object containing enough info to call CDash and get the needed data. But we can wait on that.

But note that even once Trilinos moves to GitHub Actions, this current implementation will still correctly extract the topic and target branch SHA1 info since the top commit will still be a merge commit. (But in the case of GHA, the merge commit is created by GitHub and that merge commit is just checked out locally instead of created locally.)

bartlettroscoe · 2024-02-13T19:45:43Z

test/ci_support/example_cdash_analyze_and_report_random_failures.py

+  def checkTargetTopicRandomFailure(self, targetTopicPair, knownTargetTopicPairs):
+    return targetTopicPair in knownTargetTopicPairs


I don't think this function is needed in the strategy object. I think as long as the object returned from the getTargetTopicSha1s() function is comparable (which you can specify as a requirement for the strategy interface), then this is perfectly general code for all implementations that I can image.

Therefore, I would remove this function from here and put it back as a free function called by the CDashAnalyzeReportRandomFailuresDriver.runDriver() function.

bartlettroscoe · 2024-02-13T23:27:33Z