-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jeli 75 #79
Conversation
rkanson
commented
Jul 15, 2024
•
edited
Loading
edited
- try and handle messages async, letting the main loop receive as fast as possible
- could spawn an ever increasing amount of subroutines, we'll see if it can't keep up or not
- use a new buffer for each receive call to fetch/read messages from
- handle large messages
- dynamically fetch the max buffer size from rmem_max (mounted in)
- will fallback to config value if not found
To investigate: Is the 8million buffer being set / enough?
|
To address this, I've added some code to try and fetch the value from disk. Needs https://github.com/pantheon-systems/cos-daemonsets/pull/1186 to mount the file in from the host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!! This looks like some much needed improvement to the way we're consuming the data off the wire. Only finding is the need for a recover block on that go routine. If we don't protect it, the panic will crash the whole application.
Thanks so much for putting this together!
@rkanson I had another question on this: Can we reproduce this issue in sandbox, and are we able to test these changes there by chance? Not sure how we'd explicitly reproduce this if it's not already happening, though I imagine it could require scripting a bunch of changes to flood the system. |
I'm not sure how we could emulate the same level of events. We can verify that the daemon works in sandbox to make sure we're not deploying a broken change, and then test on a single pod w/ an updated image in prod to limit the blast radius. |
I have some concerns regarding these code changes: Use of goroutine to receive messages:
Dynamic buffer size:
|
Correct, but I don't think we care about the order. pauditd is just sending pubsub messages to diff-watcher saying "Hey, files updated. [bid=$BID, endpoint=$EP, trace=$TRACE]. Generate a diff and post it to the dashboard."
There is no "lag" so to speak -- we're unable to process messages as fast as the system is generating them, so we're losing them and not publishing messages to pubsub.
The changes will definitely increase overhead -- that's a tradeoff I'm willing to accept to increase the message receiving/processing performance.
Previously, we had a set size buffer.
We don't log message size, but we do still see messages in the logs about possible missed sequences.
|
@rkanson I have no issues with the proposed changes as long as they are thoroughly tested and we can ensure that they do not introduce any new problems. However, my primary concern remains regarding the detection of missed sequences. With this change, the current detection mechanism will no longer work reliably, meaning we won't be able to confirm if a sequence was missed. Therefore, we cannot guarantee that this error is fully resolved. |
I don't think order of messages matters, unless you have reason to believe otherwise? If you inspect the message being parsed by diff-watcher, it only cares about binding, hostname, and trace id. It uses that to make a request to endpoint-rest /diffstat to generate the latest diff. If they get processed out of order it'll still fetch the diff from the latest changes on disk. |
I agree that the order of message doesn’t matter when processing it, however when I say the logic to calculate missing sequence might break I meant the function pauditd/pkg/marshaller/marshaller.go Line 179 in 4f223bc
|
Unless there are some wildly different execution times for each set of messages during the beginning of the Consume function, each Goroutine should be created and tracked sequentially. Even if one they get held up later in the routine during the network call, I believe they should make it to at least the detectMissing call in a sequential order. If they don't though, I don't think it will cause anything more than log messages. Frankly, we can probably drop that part of the code altogether -- to my knowledge it's just a holdover from forking the Slack go-audit project, and we've never cared about message order. We just wanted something that we controlled that notified us when paths we cared about change so we can publish events. |
Approving it, however let’s test before deploying it to prod |
f62726d
to
44a35a6
Compare
Closing to preserve for history. Feel free to reopen if necessary. |