-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Often random failures of Audit_Spec
on the CI
#11278
Comments
Radosław Waśko reports a new STANDUP for the provided date (2024-10-09): Progress: Continuing Investigating recent CI problems with randomly failing audit log test. Difficulty because it does not reproduce locally. Finally found that sometimes logs are send to cloud instead of mock? Why? Not sure yet but hypothesis is I'm using the first entry in batch to determine the URI and single batch can come from multiple tests, making this a problem. Still some investigation needed. Some further work on asset id in audit logs. It should be finished by 2024-10-10. Next Day: Next day I will be working on the #9869 task. Continue investigation. Hopefully splitting batch upon different URIs should fix the issue - let's see. Continue work on asset id in logs. |
Finally I have found the root cause:
So the problem was that I'm sending the whole batch to the endpoint associated with first message in the batch. In prod that's OK - the endpoint is always the same. But in tests - sometimes messages from previous tests stay in the queue and change the endpoint unexpectedly. |
The problem appeared because I did not expect we can have messages scheduled to the 'real cloud endpoint' stuck in the queue when executing next tests that are supposed to talk with 'mock cloud'. Because these messages were in queue, a whole batch, part of which was supposed to go to the mock, would be sent to the 'real cloud'. This was not happening before, because we were not authenticated into the cloud, thus no tests were talking to real cloud - they were instead falling back to logging audit logs to our local logger. Something has changed recently (1-2 weeks ago) that our CI is now running authenticated to our Cloud. Thus the tests were now logged in and sending audit logs to cloud instead of stderr. And interfering with the mock tests. This change triggered the bad logic that before lied dormant. I suspect it was most likely #11198 - when we are running the GUI E2E tests on our CI, this is most likely creating an |
Radosław Waśko reports a new STANDUP for yesterday (2024-10-09): Progress: Figured out all what was happening with audit log failures. Fixed problem with batching in #11255, also added a PR #11285 that cleans up after E2E tests to avoid accidentally running our tests as logged in user. Continuing work on asset id in audit logs - got most of logic done and added tests. It should be finished by 2024-10-10. Next Day: Next day I will be working on the #9869 task. Make sure tests are passing. |
Radosław Waśko reports a new STANDUP for yesterday (2024-10-10): Progress: Fixed a bug caused by my typo in the pending metadata PR containing the audit log fix. Finished and merged the asset id in audit logs PR. Created tickets for next tasks regarding datalinks and DBs. Started work on enabling audit logs for Snowflake. It should be finished by 2024-10-10. Next Day: Next day I will be working on the #11292 task. Continue - extracting common code, ensure tests are set up correctly. |
Numerous reports about spontaneous failures of Audit_Spec on CI have been reported:
The text was updated successfully, but these errors were encountered: