Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Often random failures of Audit_Spec on the CI #11278

Closed
radeusgd opened this issue Oct 8, 2024 · 5 comments · Fixed by #11255
Closed

Often random failures of Audit_Spec on the CI #11278

radeusgd opened this issue Oct 8, 2024 · 5 comments · Fixed by #11255
Assignees
Labels
--bug Type: bug -ci -libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work

Comments

@radeusgd
Copy link
Member

radeusgd commented Oct 8, 2024

Numerous reports about spontaneous failures of Audit_Spec on CI have been reported:

 INFO ide_ci::program::command: bash ℹ️ [FAILED] [PostgreSQL] Audit Logs: [2/3, 112884ms]
 INFO ide_ci::program::command: bash ℹ️ 
 INFO ide_ci::program::command: bash ℹ️     - [FAILED] should see Database operations performed on a table through the standard workflow [112422ms]
 INFO ide_ci::program::command: bash ℹ️         Reason: An unexpected dataflow error (Not_Found) has been matched (at /runner/_work/enso/enso/test/Table_Tests/src/Database/Common/Audit_Spec.enso:44:17-37).
 INFO ide_ci::program::command: bash ℹ️         at <enso> <default::Vector.find::if_missing>(/runner/_work/enso/enso/built-distribution/enso-engine-2024.5.1-dev-linux-amd64/enso-2024.5.1-dev/lib/Standard/Base/2024.5.1-dev/src/Data/Vector.enso:407:100-120)
@radeusgd radeusgd self-assigned this Oct 8, 2024
@github-project-automation github-project-automation bot moved this to ❓New in Issues Board Oct 8, 2024
@radeusgd radeusgd added -ci --bug Type: bug -libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work labels Oct 8, 2024
@enso-bot
Copy link

enso-bot bot commented Oct 8, 2024

Radosław Waśko reports a new STANDUP for the provided date (2024-10-09):

Progress: Continuing Investigating recent CI problems with randomly failing audit log test. Difficulty because it does not reproduce locally. Finally found that sometimes logs are send to cloud instead of mock? Why? Not sure yet but hypothesis is I'm using the first entry in batch to determine the URI and single batch can come from multiple tests, making this a problem. Still some investigation needed. Some further work on asset id in audit logs. It should be finished by 2024-10-10.

Next Day: Next day I will be working on the #9869 task. Continue investigation. Hopefully splitting batch upon different URIs should fix the issue - let's see. Continue work on asset id in logs.

@radeusgd
Copy link
Member Author

radeusgd commented Oct 9, 2024

Finally I have found the root cause:

 INFO ide_ci::program::command: bash ℹ️ Audit logger cache reset.
 INFO ide_ci::program::command: bash ℹ️ STARTING AUDIT LOG TEST
 INFO ide_ci::program::command: bash ℹ️ Audit logger configured to URI http://localhost:23756/enso-cloud-mock/logs
 INFO ide_ci::program::command: bash ℹ️ DB OPS COMPLETE
 INFO ide_ci::program::command: bash ℹ️ Request (1 messages) sent successfully.
 INFO ide_ci::program::command: bash ℹ️ Log message sent successfully: {"message":"SHOW server_encoding","projectId":null,"metadata":{"connectionUri":"jdbc:postgresql://localhost:5432/enso_test_db","sequenceNumber":1,"operation":"query","projectName":"enso_dev.Table_Tests"},"kind":"Lib"}
 INFO ide_ci::program::command: bash ℹ️ Batch of 13 log messages to send: [https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs, http://localhost:23756/enso-cloud-mock/logs]
 INFO ide_ci::program::command: bash ℹ️ Sending request with 13 log messages to https://7aqkn3tnbc.execute-api.eu-west-1.amazonaws.com/logs
 INFO ide_ci::program::command: bash ℹ️ Request (13 messages) sent successfully.

So the problem was that I'm sending the whole batch to the endpoint associated with first message in the batch. In prod that's OK - the endpoint is always the same. But in tests - sometimes messages from previous tests stay in the queue and change the endpoint unexpectedly.

@radeusgd
Copy link
Member Author

radeusgd commented Oct 9, 2024

The problem appeared because I did not expect we can have messages scheduled to the 'real cloud endpoint' stuck in the queue when executing next tests that are supposed to talk with 'mock cloud'. Because these messages were in queue, a whole batch, part of which was supposed to go to the mock, would be sent to the 'real cloud'.

This was not happening before, because we were not authenticated into the cloud, thus no tests were talking to real cloud - they were instead falling back to logging audit logs to our local logger.

Something has changed recently (1-2 weeks ago) that our CI is now running authenticated to our Cloud. Thus the tests were now logged in and sending audit logs to cloud instead of stderr. And interfering with the mock tests. This change triggered the bad logic that before lied dormant.

I suspect it was most likely #11198 - when we are running the GUI E2E tests on our CI, this is most likely creating an ~/.enso/credentials file on the runners that is not cleaned up afterwards - thus on future runs our runners are now authenticated.

mergify bot pushed a commit that referenced this issue Oct 9, 2024
- While #11255 fixes the root cause of #11278, this PR fixes what triggered it - since #11198 our runners were keeping `~/.enso/credentials` file between runs, meaning that some tests were unexpectedly running in 'authenticated' mode. This PR cleans up this file to avoid that.
@enso-bot
Copy link

enso-bot bot commented Oct 10, 2024

Radosław Waśko reports a new STANDUP for yesterday (2024-10-09):

Progress: Figured out all what was happening with audit log failures. Fixed problem with batching in #11255, also added a PR #11285 that cleans up after E2E tests to avoid accidentally running our tests as logged in user. Continuing work on asset id in audit logs - got most of logic done and added tests. It should be finished by 2024-10-10.

Next Day: Next day I will be working on the #9869 task. Make sure tests are passing.

@mergify mergify bot closed this as completed in #11255 Oct 10, 2024
@mergify mergify bot closed this as completed in 3458fe4 Oct 10, 2024
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Oct 10, 2024
@enso-bot
Copy link

enso-bot bot commented Oct 11, 2024

Radosław Waśko reports a new STANDUP for yesterday (2024-10-10):

Progress: Fixed a bug caused by my typo in the pending metadata PR containing the audit log fix. Finished and merged the asset id in audit logs PR. Created tickets for next tasks regarding datalinks and DBs. Started work on enabling audit logs for Snowflake. It should be finished by 2024-10-10.

Next Day: Next day I will be working on the #11292 task. Continue - extracting common code, ensure tests are set up correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--bug Type: bug -ci -libs Libraries: New libraries to be implemented l-cloud-integration Enso Cloud integration work
Projects
Status: 🟢 Accepted
Development

Successfully merging a pull request may close this issue.

1 participant