Jeli 75 #79

rkanson · 2024-07-15T20:58:19Z

try and handle messages async, letting the main loop receive as fast as possible
- could spawn an ever increasing amount of subroutines, we'll see if it can't keep up or not
- use a new buffer for each receive call to fetch/read messages from
handle large messages
dynamically fetch the max buffer size from rmem_max (mounted in)
- will fallback to config value if not found

rkanson · 2024-07-15T20:59:18Z

To investigate: Is the 8million buffer being set / enough?
From an appserver:

rjkanson@appserver-0be56056 ~ $ sudo ls /proc/sys/net/core | grep rmem
rmem_default
rmem_max
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_max
4000000
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_default
212992

rkanson · 2024-07-16T20:03:14Z

To investigate: Is the 8million buffer being set / enough? From an appserver:

rjkanson@appserver-0be56056 ~ $ sudo ls /proc/sys/net/core | grep rmem
rmem_default
rmem_max
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_max
4000000
rjkanson@appserver-0be56056 ~ $ sudo cat /proc/sys/net/core/rmem_default
212992

To address this, I've added some code to try and fetch the value from disk. Needs https://github.com/pantheon-systems/cos-daemonsets/pull/1186 to mount the file in from the host.

mfb2

Nice work!! This looks like some much needed improvement to the way we're consuming the data off the wire. Only finding is the need for a recover block on that go routine. If we don't protect it, the panic will crash the whole application.

Thanks so much for putting this together!

audit.go

mfb2 · 2024-07-17T14:31:34Z

@rkanson I had another question on this:

Can we reproduce this issue in sandbox, and are we able to test these changes there by chance? Not sure how we'd explicitly reproduce this if it's not already happening, though I imagine it could require scripting a bunch of changes to flood the system.

rkanson · 2024-07-17T14:34:51Z

@rkanson I had another question on this:

Can we reproduce this issue in sandbox, and are we able to test these changes there by chance? Not sure how we'd explicitly reproduce this if it's not already happening, though I imagine it could require scripting a bunch of changes to flood the system.

I'm not sure how we could emulate the same level of events. We can verify that the daemon works in sandbox to make sure we're not deploying a broken change, and then test on a single pod w/ an updated image in prod to limit the blast radius.

aamirtiwari · 2024-07-18T05:55:34Z

I have some concerns regarding these code changes:

Use of goroutine to receive messages:

Currently, messages are mostly processed in the order they are received. With this change, they may not be processed sequentially, making the code that detects missing sequences much more unreliable.
Additionally, not processing messages sequentially could introduce issues that need thorough testing in a sandbox environment with multiple messages from different operations to see how it behaves.
The primary benefit of this change is improving performance. However, I don't see any performance lags causing issues for customers that need immediate attention. If we are seeing "no buffer space available," I think it would be better to increase the socket_buffer.receive size in the config, as you have already done by using rmemMax.

Dynamic buffer size:

I am concerned that this change could cause additional overhead due to repeated peeking and buffer resizing.
Repeated peeking could also affect performance. While the first change aims to improve performance, this change might introduce potential issues we are overlooking.
There could be higher memory consumption due to allowing message processing with message sizes higher than MAX_AUDIT_MESSAGE_LENGTH. Is there any evidence of missed messages of higher size that this change would fix?

rkanson · 2024-07-18T13:57:22Z

@aamirtiwari

Currently, messages are mostly processed in the order they are received. With this change, they may not be processed sequentially, making the code that detects missing sequences much more unreliable.

Correct, but I don't think we care about the order. pauditd is just sending pubsub messages to diff-watcher saying "Hey, files updated. [bid=$BID, endpoint=$EP, trace=$TRACE]. Generate a diff and post it to the dashboard."

The primary benefit of this change is improving performance. However, I don't see any performance lags causing issues for customers that need immediate attention. If we are seeing "no buffer space available," I think it would be better to increase the socket_buffer.receive size in the config, as you have already done by using rmemMax.

There is no "lag" so to speak -- we're unable to process messages as fast as the system is generating them, so we're losing them and not publishing messages to pubsub.
The socket_buffer.receive is already 2x what the reported max size for an appserver can support. We can try increasing it, but I don't have high confidence in that change.

I am concerned that this change could cause additional overhead due to repeated peeking and buffer resizing.

The changes will definitely increase overhead -- that's a tradeoff I'm willing to accept to increase the message receiving/processing performance.

Repeated peeking could also affect performance. While the first change aims to improve performance, this change might introduce potential issues we are overlooking.

Previously, we had a set size buffer. MAX_AUDIT_MESSAGE_LENGTH only works for us if every time we call Receive there is 1 message to be processed, or less messages combined than the total bytes of that value. The socket is continually writing messages, and we were previously only processing them as fast as we could receive, consume, and post a message to the notification service. In the time it takes to do that, the socket could write multiple messages and exceed that byte limit.

There could be higher memory consumption due to allowing message processing with message sizes higher than MAX_AUDIT_MESSAGE_LENGTH. Is there any evidence of missed messages of higher size that this change would fix?

We don't log message size, but we do still see messages in the logs about possible missed sequences.

Likely missed sequence 10325845, current 10326346, worst message delay 58

aamirtiwari · 2024-07-19T04:39:23Z

@aamirtiwari

Currently, messages are mostly processed in the order they are received. With this change, they may not be processed sequentially, making the code that detects missing sequences much more unreliable.

Correct, but I don't think we care about the order. pauditd is just sending pubsub messages to diff-watcher saying "Hey, files updated. [bid=$BID, endpoint=$EP, trace=$TRACE]. Generate a diff and post it to the dashboard."

The primary benefit of this change is improving performance. However, I don't see any performance lags causing issues for customers that need immediate attention. If we are seeing "no buffer space available," I think it would be better to increase the socket_buffer.receive size in the config, as you have already done by using rmemMax.

There is no "lag" so to speak -- we're unable to process messages as fast as the system is generating them, so we're losing them and not publishing messages to pubsub. The socket_buffer.receive is already 2x what the reported max size for an appserver can support. We can try increasing it, but I don't have high confidence in that change.

I am concerned that this change could cause additional overhead due to repeated peeking and buffer resizing.

The changes will definitely increase overhead -- that's a tradeoff I'm willing to accept to increase the message receiving/processing performance.

Repeated peeking could also affect performance. While the first change aims to improve performance, this change might introduce potential issues we are overlooking.

Previously, we had a set size buffer. MAX_AUDIT_MESSAGE_LENGTH only works for us if every time we call Receive there is 1 message to be processed, or less messages combined than the total bytes of that value. The socket is continually writing messages, and we were previously only processing them as fast as we could receive, consume, and post a message to the notification service. In the time it takes to do that, the socket could write multiple messages and exceed that byte limit.

There could be higher memory consumption due to allowing message processing with message sizes higher than MAX_AUDIT_MESSAGE_LENGTH. Is there any evidence of missed messages of higher size that this change would fix?

We don't log message size, but we do still see messages in the logs about possible missed sequences.
Likely missed sequence 10325845, current 10326346, worst message delay 58

@rkanson I have no issues with the proposed changes as long as they are thoroughly tested and we can ensure that they do not introduce any new problems. However, my primary concern remains regarding the detection of missed sequences. With this change, the current detection mechanism will no longer work reliably, meaning we won't be able to confirm if a sequence was missed. Therefore, we cannot guarantee that this error is fully resolved.

rkanson · 2024-07-19T13:52:13Z

@aamirtiwari

I have no issues with the proposed changes as long as they are thoroughly tested and we can ensure that they do not introduce any new problems. However, my primary concern remains regarding the detection of missed sequences. With this change, the current detection mechanism will no longer work reliably, meaning we won't be able to confirm if a sequence was missed. Therefore, we cannot guarantee that this error is fully resolved.

I don't think order of messages matters, unless you have reason to believe otherwise? If you inspect the message being parsed by diff-watcher, it only cares about binding, hostname, and trace id. It uses that to make a request to endpoint-rest /diffstat to generate the latest diff. If they get processed out of order it'll still fetch the diff from the latest changes on disk.

aamirtiwari · 2024-07-19T14:09:17Z

@aamirtiwari

I have no issues with the proposed changes as long as they are thoroughly tested and we can ensure that they do not introduce any new problems. However, my primary concern remains regarding the detection of missed sequences. With this change, the current detection mechanism will no longer work reliably, meaning we won't be able to confirm if a sequence was missed. Therefore, we cannot guarantee that this error is fully resolved.

I don't think order of messages matters, unless you have reason to believe otherwise? If you inspect the message being parsed by diff-watcher, it only cares about binding, hostname, and trace id. It uses that to make a request to endpoint-rest /diffstat to generate the latest diff. If they get processed out of order it'll still fetch the diff from the latest changes on disk.

I agree that the order of message doesn’t matter when processing it, however when I say the logic to calculate missing sequence might break I meant the function

pauditd/pkg/marshaller/marshaller.go

Line 179 in 4f223bc

func (a *AuditMarshaller) detectMissing(seq int) {

plz check the same and if you feel that it should not be affected we can proceed with testing the changes

rkanson · 2024-07-19T14:27:30Z

@aamirtiwari

Unless there are some wildly different execution times for each set of messages during the beginning of the Consume function, each Goroutine should be created and tracked sequentially. Even if one they get held up later in the routine during the network call, I believe they should make it to at least the detectMissing call in a sequential order.

If they don't though, I don't think it will cause anything more than log messages. Frankly, we can probably drop that part of the code altogether -- to my knowledge it's just a holdover from forking the Slack go-audit project, and we've never cared about message order. We just wanted something that we controlled that notified us when paths we cared about change so we can publish events.

aamirtiwari · 2024-07-19T14:40:10Z

@aamirtiwari

Unless there are some wildly different execution times for each set of messages during the beginning of the Consume function, each Goroutine should be created and tracked sequentially. Even if one they get held up later in the routine during the network call, I believe they should make it to at least the detectMissing call in a sequential order.

If they don't though, I don't think it will cause anything more than log messages. Frankly, we can probably drop that part of the code altogether -- to my knowledge it's just a holdover from forking the Slack go-audit project, and we've never cared about message order. We just wanted something that we controlled that notified us when paths we cared about change so we can publish events.

Approving it, however let’s test before deploying it to prod

Addressed the comment about panics

rkanson · 2024-07-29T19:40:57Z

Closing to preserve for history. Feel free to reopen if necessary.

rkanson force-pushed the jeli-75 branch from 1d05e85 to 42c8266 Compare July 16, 2024 19:24

rkanson marked this pull request as ready for review July 16, 2024 20:27

rkanson requested a review from a team as a code owner July 16, 2024 20:27

rkanson requested a review from a team July 16, 2024 20:27

mfb2 previously requested changes Jul 17, 2024

View reviewed changes

audit.go Show resolved Hide resolved

audit.go Show resolved Hide resolved

rkanson requested a review from mfb2 July 17, 2024 19:29

aamirtiwari approved these changes Jul 19, 2024

View reviewed changes

rkanson force-pushed the jeli-75 branch 3 times, most recently from f62726d to 44a35a6 Compare July 19, 2024 20:08

rkanson added 4 commits July 19, 2024 16:16

[minor] implement large message handling

cf6b530

handle messages in a goroutine

a055574

fetch the rmem_max value to not need a hardcoded value in a config

9d3ca1f

extract to new func + dont use fatal

28794c0

rkanson force-pushed the jeli-75 branch from 44a35a6 to c19d93e Compare July 19, 2024 20:16

rkanson added 3 commits July 19, 2024 16:21

create and use a local buf for each receive call

7acf9a6

handle a panic in the goroutine

7a8921b

build-release for all PRs

04c747b

rkanson force-pushed the jeli-75 branch from c19d93e to 04c747b Compare July 19, 2024 20:21

rkanson closed this Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jeli 75 #79

Jeli 75 #79

rkanson commented Jul 15, 2024 •

edited

Loading

rkanson commented Jul 15, 2024

rkanson commented Jul 16, 2024

mfb2 left a comment

mfb2 commented Jul 17, 2024

rkanson commented Jul 17, 2024

aamirtiwari commented Jul 18, 2024

rkanson commented Jul 18, 2024

aamirtiwari commented Jul 19, 2024

rkanson commented Jul 19, 2024

aamirtiwari commented Jul 19, 2024 •

edited

Loading

rkanson commented Jul 19, 2024 •

edited

Loading

aamirtiwari commented Jul 19, 2024

rkanson commented Jul 29, 2024

Jeli 75 #79

Jeli 75 #79

Conversation

rkanson commented Jul 15, 2024 • edited Loading

rkanson commented Jul 15, 2024

rkanson commented Jul 16, 2024

mfb2 left a comment

Choose a reason for hiding this comment

mfb2 commented Jul 17, 2024

rkanson commented Jul 17, 2024

aamirtiwari commented Jul 18, 2024

rkanson commented Jul 18, 2024

aamirtiwari commented Jul 19, 2024

rkanson commented Jul 19, 2024

aamirtiwari commented Jul 19, 2024 • edited Loading

rkanson commented Jul 19, 2024 • edited Loading

aamirtiwari commented Jul 19, 2024

rkanson commented Jul 29, 2024

rkanson commented Jul 15, 2024 •

edited

Loading

aamirtiwari commented Jul 19, 2024 •

edited

Loading

rkanson commented Jul 19, 2024 •

edited

Loading