Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jeli 75 #79

Closed
wants to merge 7 commits into from
Closed

Jeli 75 #79

wants to merge 7 commits into from

Conversation

rkanson
Copy link
Member

@rkanson rkanson commented Jul 15, 2024

  • try and handle messages async, letting the main loop receive as fast as possible
    • could spawn an ever increasing amount of subroutines, we'll see if it can't keep up or not
    • use a new buffer for each receive call to fetch/read messages from
  • handle large messages
  • dynamically fetch the max buffer size from rmem_max (mounted in)
    • will fallback to config value if not found

@rkanson
Copy link
Member Author

rkanson commented Jul 15, 2024

To investigate: Is the 8million buffer being set / enough?
From an appserver:

rjkanson@appserver-0be56056 ~ $ sudo ls /proc/sys/net/core | grep rmem
rmem_default
rmem_max
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_max
4000000
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_default
212992

@rkanson
Copy link
Member Author

rkanson commented Jul 16, 2024

To investigate: Is the 8million buffer being set / enough? From an appserver:

rjkanson@appserver-0be56056 ~ $ sudo ls /proc/sys/net/core | grep rmem
rmem_default
rmem_max
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_max
4000000
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_default
212992

To address this, I've added some code to try and fetch the value from disk. Needs https://github.com/pantheon-systems/cos-daemonsets/pull/1186 to mount the file in from the host.

@rkanson rkanson marked this pull request as ready for review July 16, 2024 20:27
@rkanson rkanson requested a review from a team as a code owner July 16, 2024 20:27
@rkanson rkanson requested a review from a team July 16, 2024 20:27
mfb2
mfb2 previously requested changes Jul 17, 2024
Copy link
Member

@mfb2 mfb2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!! This looks like some much needed improvement to the way we're consuming the data off the wire. Only finding is the need for a recover block on that go routine. If we don't protect it, the panic will crash the whole application.

Thanks so much for putting this together!

audit.go Show resolved Hide resolved
audit.go Show resolved Hide resolved
@mfb2
Copy link
Member

mfb2 commented Jul 17, 2024

@rkanson I had another question on this:

Can we reproduce this issue in sandbox, and are we able to test these changes there by chance? Not sure how we'd explicitly reproduce this if it's not already happening, though I imagine it could require scripting a bunch of changes to flood the system.

@rkanson
Copy link
Member Author

rkanson commented Jul 17, 2024

@rkanson I had another question on this:

Can we reproduce this issue in sandbox, and are we able to test these changes there by chance? Not sure how we'd explicitly reproduce this if it's not already happening, though I imagine it could require scripting a bunch of changes to flood the system.

I'm not sure how we could emulate the same level of events. We can verify that the daemon works in sandbox to make sure we're not deploying a broken change, and then test on a single pod w/ an updated image in prod to limit the blast radius.

@rkanson rkanson requested a review from mfb2 July 17, 2024 19:29
@aamirtiwari
Copy link
Contributor

I have some concerns regarding these code changes:

Use of goroutine to receive messages:

  • Currently, messages are mostly processed in the order they are received. With this change, they may not be processed sequentially, making the code that detects missing sequences much more unreliable.
  • Additionally, not processing messages sequentially could introduce issues that need thorough testing in a sandbox environment with multiple messages from different operations to see how it behaves.
  • The primary benefit of this change is improving performance. However, I don't see any performance lags causing issues for customers that need immediate attention. If we are seeing "no buffer space available," I think it would be better to increase the socket_buffer.receive size in the config, as you have already done by using rmemMax.

Dynamic buffer size:

  • I am concerned that this change could cause additional overhead due to repeated peeking and buffer resizing.
  • Repeated peeking could also affect performance. While the first change aims to improve performance, this change might introduce potential issues we are overlooking.
  • There could be higher memory consumption due to allowing message processing with message sizes higher than MAX_AUDIT_MESSAGE_LENGTH. Is there any evidence of missed messages of higher size that this change would fix?

@rkanson
Copy link
Member Author

rkanson commented Jul 18, 2024

@aamirtiwari

  • Currently, messages are mostly processed in the order they are received. With this change, they may not be processed sequentially, making the code that detects missing sequences much more unreliable.

Correct, but I don't think we care about the order. pauditd is just sending pubsub messages to diff-watcher saying "Hey, files updated. [bid=$BID, endpoint=$EP, trace=$TRACE]. Generate a diff and post it to the dashboard."

  • The primary benefit of this change is improving performance. However, I don't see any performance lags causing issues for customers that need immediate attention. If we are seeing "no buffer space available," I think it would be better to increase the socket_buffer.receive size in the config, as you have already done by using rmemMax.

There is no "lag" so to speak -- we're unable to process messages as fast as the system is generating them, so we're losing them and not publishing messages to pubsub.
The socket_buffer.receive is already 2x what the reported max size for an appserver can support. We can try increasing it, but I don't have high confidence in that change.

  • I am concerned that this change could cause additional overhead due to repeated peeking and buffer resizing.

The changes will definitely increase overhead -- that's a tradeoff I'm willing to accept to increase the message receiving/processing performance.

  • Repeated peeking could also affect performance. While the first change aims to improve performance, this change might introduce potential issues we are overlooking.

Previously, we had a set size buffer. MAX_AUDIT_MESSAGE_LENGTH only works for us if every time we call Receive there is 1 message to be processed, or less messages combined than the total bytes of that value. The socket is continually writing messages, and we were previously only processing them as fast as we could receive, consume, and post a message to the notification service. In the time it takes to do that, the socket could write multiple messages and exceed that byte limit.

  • There could be higher memory consumption due to allowing message processing with message sizes higher than MAX_AUDIT_MESSAGE_LENGTH. Is there any evidence of missed messages of higher size that this change would fix?

We don't log message size, but we do still see messages in the logs about possible missed sequences.

Likely missed sequence 10325845, current 10326346, worst message delay 58

@aamirtiwari
Copy link
Contributor

@aamirtiwari

  • Currently, messages are mostly processed in the order they are received. With this change, they may not be processed sequentially, making the code that detects missing sequences much more unreliable.

Correct, but I don't think we care about the order. pauditd is just sending pubsub messages to diff-watcher saying "Hey, files updated. [bid=$BID, endpoint=$EP, trace=$TRACE]. Generate a diff and post it to the dashboard."

  • The primary benefit of this change is improving performance. However, I don't see any performance lags causing issues for customers that need immediate attention. If we are seeing "no buffer space available," I think it would be better to increase the socket_buffer.receive size in the config, as you have already done by using rmemMax.

There is no "lag" so to speak -- we're unable to process messages as fast as the system is generating them, so we're losing them and not publishing messages to pubsub. The socket_buffer.receive is already 2x what the reported max size for an appserver can support. We can try increasing it, but I don't have high confidence in that change.

  • I am concerned that this change could cause additional overhead due to repeated peeking and buffer resizing.

The changes will definitely increase overhead -- that's a tradeoff I'm willing to accept to increase the message receiving/processing performance.

  • Repeated peeking could also affect performance. While the first change aims to improve performance, this change might introduce potential issues we are overlooking.

Previously, we had a set size buffer. MAX_AUDIT_MESSAGE_LENGTH only works for us if every time we call Receive there is 1 message to be processed, or less messages combined than the total bytes of that value. The socket is continually writing messages, and we were previously only processing them as fast as we could receive, consume, and post a message to the notification service. In the time it takes to do that, the socket could write multiple messages and exceed that byte limit.

  • There could be higher memory consumption due to allowing message processing with message sizes higher than MAX_AUDIT_MESSAGE_LENGTH. Is there any evidence of missed messages of higher size that this change would fix?

We don't log message size, but we do still see messages in the logs about possible missed sequences.

Likely missed sequence 10325845, current 10326346, worst message delay 58

@rkanson I have no issues with the proposed changes as long as they are thoroughly tested and we can ensure that they do not introduce any new problems. However, my primary concern remains regarding the detection of missed sequences. With this change, the current detection mechanism will no longer work reliably, meaning we won't be able to confirm if a sequence was missed. Therefore, we cannot guarantee that this error is fully resolved.

@rkanson
Copy link
Member Author

rkanson commented Jul 19, 2024

@aamirtiwari

I have no issues with the proposed changes as long as they are thoroughly tested and we can ensure that they do not introduce any new problems. However, my primary concern remains regarding the detection of missed sequences. With this change, the current detection mechanism will no longer work reliably, meaning we won't be able to confirm if a sequence was missed. Therefore, we cannot guarantee that this error is fully resolved.

I don't think order of messages matters, unless you have reason to believe otherwise? If you inspect the message being parsed by diff-watcher, it only cares about binding, hostname, and trace id. It uses that to make a request to endpoint-rest /diffstat to generate the latest diff. If they get processed out of order it'll still fetch the diff from the latest changes on disk.

@aamirtiwari
Copy link
Contributor

aamirtiwari commented Jul 19, 2024

@aamirtiwari

I have no issues with the proposed changes as long as they are thoroughly tested and we can ensure that they do not introduce any new problems. However, my primary concern remains regarding the detection of missed sequences. With this change, the current detection mechanism will no longer work reliably, meaning we won't be able to confirm if a sequence was missed. Therefore, we cannot guarantee that this error is fully resolved.

I don't think order of messages matters, unless you have reason to believe otherwise? If you inspect the message being parsed by diff-watcher, it only cares about binding, hostname, and trace id. It uses that to make a request to endpoint-rest /diffstat to generate the latest diff. If they get processed out of order it'll still fetch the diff from the latest changes on disk.

I agree that the order of message doesn’t matter when processing it, however when I say the logic to calculate missing sequence might break I meant the function

func (a *AuditMarshaller) detectMissing(seq int) {
plz check the same and if you feel that it should not be affected we can proceed with testing the changes

@rkanson
Copy link
Member Author

rkanson commented Jul 19, 2024

@aamirtiwari

Unless there are some wildly different execution times for each set of messages during the beginning of the Consume function, each Goroutine should be created and tracked sequentially. Even if one they get held up later in the routine during the network call, I believe they should make it to at least the detectMissing call in a sequential order.

If they don't though, I don't think it will cause anything more than log messages. Frankly, we can probably drop that part of the code altogether -- to my knowledge it's just a holdover from forking the Slack go-audit project, and we've never cared about message order. We just wanted something that we controlled that notified us when paths we cared about change so we can publish events.

@aamirtiwari
Copy link
Contributor

@aamirtiwari

Unless there are some wildly different execution times for each set of messages during the beginning of the Consume function, each Goroutine should be created and tracked sequentially. Even if one they get held up later in the routine during the network call, I believe they should make it to at least the detectMissing call in a sequential order.

If they don't though, I don't think it will cause anything more than log messages. Frankly, we can probably drop that part of the code altogether -- to my knowledge it's just a holdover from forking the Slack go-audit project, and we've never cared about message order. We just wanted something that we controlled that notified us when paths we cared about change so we can publish events.

Approving it, however let’s test before deploying it to prod

@rkanson rkanson dismissed mfb2’s stale review July 19, 2024 19:11

Addressed the comment about panics

@rkanson rkanson force-pushed the jeli-75 branch 3 times, most recently from f62726d to 44a35a6 Compare July 19, 2024 20:08
@rkanson
Copy link
Member Author

rkanson commented Jul 29, 2024

Closing to preserve for history. Feel free to reopen if necessary.

@rkanson rkanson closed this Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants