Often random failures of `Audit_Spec` on the CI #11278

radeusgd · 2024-10-08T18:12:13Z

Numerous reports about spontaneous failures of Audit_Spec on CI have been reported:

 INFO ide_ci::program::command: bash ℹ️ [FAILED] [PostgreSQL] Audit Logs: [2/3, 112884ms]
 INFO ide_ci::program::command: bash ℹ️ 
 INFO ide_ci::program::command: bash ℹ️     - [FAILED] should see Database operations performed on a table through the standard workflow [112422ms]
 INFO ide_ci::program::command: bash ℹ️         Reason: An unexpected dataflow error (Not_Found) has been matched (at /runner/_work/enso/enso/test/Table_Tests/src/Database/Common/Audit_Spec.enso:44:17-37).
 INFO ide_ci::program::command: bash ℹ️         at <enso> <default::Vector.find::if_missing>(/runner/_work/enso/enso/built-distribution/enso-engine-2024.5.1-dev-linux-amd64/enso-2024.5.1-dev/lib/Standard/Base/2024.5.1-dev/src/Data/Vector.enso:407:100-120)

The text was updated successfully, but these errors were encountered:

enso-bot · 2024-10-08T18:14:37Z

Radosław Waśko reports a new STANDUP for the provided date (2024-10-09):

Progress: Continuing Investigating recent CI problems with randomly failing audit log test. Difficulty because it does not reproduce locally. Finally found that sometimes logs are send to cloud instead of mock? Why? Not sure yet but hypothesis is I'm using the first entry in batch to determine the URI and single batch can come from multiple tests, making this a problem. Still some investigation needed. Some further work on asset id in audit logs. It should be finished by 2024-10-10.

Next Day: Next day I will be working on the #9869 task. Continue investigation. Hopefully splitting batch upon different URIs should fix the issue - let's see. Continue work on asset id in logs.

radeusgd · 2024-10-09T09:49:24Z

Finally I have found the root cause:

 INFO ide_ci::program::command: bash ℹ️ Audit logger cache reset.
 INFO ide_ci::program::command: bash ℹ️ STARTING AUDIT LOG TEST
 INFO ide_ci::program::command: bash ℹ️ Audit logger configured to URI http://localhost:23756/enso-cloud-mock/logs
 INFO ide_ci::program::command: bash ℹ️ DB OPS COMPLETE
 INFO ide_ci::program::command: bash ℹ️ Request (1 messages) sent successfully.
 INFO ide_ci::program::command: bash ℹ️ Log message sent successfully: {"message":"SHOW server_encoding","projectId":null,"metadata":{"connectionUri":"jdbc:postgresql://localhost:5432/enso_test_db","sequenceNumber":1,"operation":"query","projectName":"enso_dev.Table_Tests"},"kind":"Lib"}
 INFO ide_ci::program::command: bash ℹ️ Batch of 13 log messages to send: [https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs]
 INFO ide_ci::program::command: bash ℹ️ Sending request with 13 log messages to https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs
 INFO ide_ci::program::command: bash ℹ️ Request (13 messages) sent successfully.

So the problem was that I'm sending the whole batch to the endpoint associated with first message in the batch. In prod that's OK - the endpoint is always the same. But in tests - sometimes messages from previous tests stay in the queue and change the endpoint unexpectedly.

radeusgd · 2024-10-09T10:43:18Z

The problem appeared because I did not expect we can have messages scheduled to the 'real cloud endpoint' stuck in the queue when executing next tests that are supposed to talk with 'mock cloud'. Because these messages were in queue, a whole batch, part of which was supposed to go to the mock, would be sent to the 'real cloud'.

This was not happening before, because we were not authenticated into the cloud, thus no tests were talking to real cloud - they were instead falling back to logging audit logs to our local logger.

Something has changed recently (1-2 weeks ago) that our CI is now running authenticated to our Cloud. Thus the tests were now logged in and sending audit logs to cloud instead of stderr. And interfering with the mock tests. This change triggered the bad logic that before lied dormant.

I suspect it was most likely #11198 - when we are running the GUI E2E tests on our CI, this is most likely creating an ~/.enso/credentials file on the runners that is not cleaned up afterwards - thus on future runs our runners are now authenticated.

- While #11255 fixes the root cause of #11278, this PR fixes what triggered it - since #11198 our runners were keeping `~/.enso/credentials` file between runs, meaning that some tests were unexpectedly running in 'authenticated' mode. This PR cleans up this file to avoid that.

enso-bot · 2024-10-10T11:02:34Z

Radosław Waśko reports a new STANDUP for yesterday (2024-10-09):

Progress: Figured out all what was happening with audit log failures. Fixed problem with batching in #11255, also added a PR #11285 that cleans up after E2E tests to avoid accidentally running our tests as logged in user. Continuing work on asset id in audit logs - got most of logic done and added tests. It should be finished by 2024-10-10.

Next Day: Next day I will be working on the #9869 task. Make sure tests are passing.

enso-bot · 2024-10-11T08:51:07Z

Radosław Waśko reports a new STANDUP for yesterday (2024-10-10):

Progress: Fixed a bug caused by my typo in the pending metadata PR containing the audit log fix. Finished and merged the asset id in audit logs PR. Created tickets for next tasks regarding datalinks and DBs. Started work on enabling audit logs for Snowflake. It should be finished by 2024-10-10.

Next Day: Next day I will be working on the #11292 task. Continue - extracting common code, ensure tests are set up correctly.

radeusgd self-assigned this Oct 8, 2024

github-project-automation bot added this to Issues Board Oct 8, 2024

github-project-automation bot moved this to ❓New in Issues Board Oct 8, 2024

radeusgd added -ci --bug Type: bug -libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work labels Oct 8, 2024

radeusgd mentioned this issue Oct 9, 2024

Accessing and modifying description and labels of Enso Cloud assets #11255

Merged

4 tasks

radeusgd mentioned this issue Oct 9, 2024

Clean up credentials file after E2E job. #11285

Merged

4 tasks

mergify bot closed this as completed in #11255 Oct 10, 2024

mergify bot closed this as completed in 3458fe4 Oct 10, 2024

github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Oct 10, 2024

radeusgd mentioned this issue Oct 21, 2024

Package libraries Java code as Java Modules #10714

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Often random failures of `Audit_Spec` on the CI #11278

Often random failures of `Audit_Spec` on the CI #11278

radeusgd commented Oct 8, 2024

enso-bot bot commented Oct 8, 2024

radeusgd commented Oct 9, 2024

radeusgd commented Oct 9, 2024

enso-bot bot commented Oct 10, 2024

enso-bot bot commented Oct 11, 2024

Often random failures of Audit_Spec on the CI #11278

Often random failures of Audit_Spec on the CI #11278

Comments

radeusgd commented Oct 8, 2024

enso-bot bot commented Oct 8, 2024

radeusgd commented Oct 9, 2024

radeusgd commented Oct 9, 2024

enso-bot bot commented Oct 10, 2024

enso-bot bot commented Oct 11, 2024

Often random failures of `Audit_Spec` on the CI #11278

Often random failures of `Audit_Spec` on the CI #11278