Fix: message-id in postprocessor/gelf-chunking #2662

BharatKJain · 2024-09-29T14:24:42Z

Pull request

Description

Changed message-id from auto-increment ID to randomized ID in postprocessor/gelf_chunking.rs

HELP NEEDED: I have avoided adding hostname while producing message-id, I am not sure how can we handle adding hostname, please suggest.

Checklist

The RFC, if required, has been submitted and approved
Any user-facing impact of the changes is reflected in docs.tremor.rs
The code is tested
Use of unsafe code is reasoned about in a comment
Update CHANGELOG.md appropriately, recording any changes, bug fixes, or other observable changes in behavior
The performance impact of the change is measured (see below)

Performance

codecov · 2024-09-29T14:29:56Z

Codecov Report

Attention: Patch coverage is 97.54098% with 3 lines in your changes missing coverage. Please review.

Project coverage is 91.34%. Comparing base (f1f2b7f) to head (116a48f).
Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
...mor-interceptor/src/postprocessor/gelf_chunking.rs	97.54%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2662      +/-   ##
==========================================
+ Coverage   91.27%   91.34%   +0.07%     
==========================================
  Files         308      308              
  Lines       60116    60229     +113     
==========================================
+ Hits        54868    55016     +148     
+ Misses       5248     5213      -35

Flag	Coverage Δ
e2e-command	`11.27% <0.00%> (-0.02%)`	⬇️
e2e-integration	`50.52% <0.00%> (+0.09%)`	⬆️
e2e-unit	`12.56% <0.00%> (-0.02%)`	⬇️
e2etests	`52.85% <0.00%> (+0.09%)`	⬆️
tremorapi	`14.50% <0.00%> (-0.03%)`	⬇️
tremorcodec	`63.11% <ø> (ø)`
tremorcommon	`63.04% <ø> (ø)`
tremorconnectors	`28.81% <0.00%> (-0.05%)`	⬇️
tremorconnectorsaws	`11.25% <0.00%> (-0.03%)`	⬇️
tremorconnectorsazure	`4.68% <0.00%> (-0.02%)`	⬇️
tremorconnectorsgcp	`25.36% <0.00%> (+0.05%)`	⬆️
tremorconnectorsobjectstorage	`0.06% <ø> (ø)`
tremorconnectorsotel	`12.55% <0.00%> (-0.03%)`	⬇️
tremorconnectorstesthelpers	`68.25% <ø> (ø)`
tremorinflux	`87.71% <ø> (ø)`
tremorinterceptor	`55.34% <97.54%> (+0.98%)`	⬆️
tremorpipeline	`31.15% <ø> (ø)`
tremorruntime	`47.20% <0.00%> (-0.05%)`	⬇️
tremorscript	`55.06% <ø> (ø)`
tremorsystem	`5.78% <ø> (ø)`
tremorvalue	`69.52% <ø> (-0.04%)`	⬇️
unittests	`89.20% <97.54%> (+0.07%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...mor-interceptor/src/postprocessor/gelf_chunking.rs	`96.00% <97.54%> (+2.89%)`	⬆️

... and 10 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1f2b7f...116a48f. Read the comment docs.

Licenser

👍 looks reasonable nothing to prevent using this I see, great documentation too :)

two things I notice.

it would be nice to mention how message id's are generated in the docs (the //! section)
if you are up for a challenge: we try to keep all time and randomness out of tremor to allow for deterministic replays. We do this by using ingest_ns for as a random seed, and for times, that way a event that is logged with it's ingest ns can be replayed and generate the exact same out put. The random function is a good example. It'd be interesting to see this same concept re-used for this to allow repeatable yet still random message id's; one way would be to use ingest-ns instead of the current epoch (which would be nice anyway as looking uo time isn't fast), and then seed the RNG somehow (probably not with the ingest ns as that would make it useless) but perhaps with the fist n bytes of the message? or with some bytes of a hash of the message?

BharatKJain · 2024-09-30T04:44:37Z

Okay, will do.
Trying to think-out-loud, so let's say if we create a hash of a message but repetitive logs will create same message-id which is a problem because message-id has to be unique in nature, collisions can cause problems when we're decoding the message on the server side. I am not completely sure but ideally we would want to have uniqueness in the message-id to make sure that we are not breaking the server GELF-decoding.

(How it will break server-side due to collision? So message-id is a way of determining if the UDP packet is associated with already existing log or it's for a new log, when we're sending same message-id for multiple logs then server behaviour will be to merge the data-together which will end-up breaking the log)

TBH I am also trying to figure this out, please share any suggestions, am I thinking right? 😅

Licenser · 2024-10-03T13:11:34Z

Ja just the message content would not work, I'm still considering if message content + ingest_ns (nanosecond when the message was registered at tremor) would be enough, if a server produces the same log twice in the same nanosecond that'd be very odd (but not impossible) OTOH having two random generated numbers be the same is also odd (but not impossible) it would also one a more deterministic failure case "When messages with the same content arrive at exactly the same time they will get duplicated message ids" instead of "if the RNG hates you, you'll get duplicated message ids"

Licenser · 2024-10-05T18:15:22Z

Sorry for all the forth and back, so I've been thinking about a good way to make this:

fast
deterministic
non-coliding

and wanted to throw out a suggestion using the ingest_ns + an incremental id as a message ID. This is:

fast since we just have to look at two integers, no RNG, no hashing
deterministic since the ingest_ns is settable, and the incremental counter is well, incremental so determinstic as well
it avoids duplicates by not creating duplicates on the same system due to incremtal id's and makes them extremely unlikely on multiple systems due to nano second timestamps involved.

BharatKJain · 2024-10-05T18:32:06Z

Sorry for all the forth and back, so I've been thinking about a good way to make this:
1. fast

2. deterministic

3. non-coliding
and wanted to throw out a suggestion using the ingest_ns + an incremental id as a message ID. This is:
1. fast since we just have to look at two integers, no RNG, no hashing

2. deterministic since the ingest_ns is settable, and the incremental counter is well, incremental so determinstic as well

3. it avoids duplicates by not creating duplicates on the same system due to incremtal id's and makes them extremely unlikely on multiple systems due to nano second timestamps involved.

Can I keep thread-id just to reduce the collision probability more? 😅

BharatKJain · 2024-10-05T19:03:10Z

Okay, I have made the changes as suggested in the latest comment.

(epoch_timestamp & BITS_13) | (auto_increment_id & !BITS_13) | (thread_id_u64 & BITS_13)

(epoch_timestamp is ingest_ns when process() is called, in finish() there's no ingest_ns so I am using current timestamp.)

Added a unit-test for it.
Renamed id to auto_increment_id

Let me know your thoughts!

Licenser · 2024-10-06T14:59:40Z

I would remove the thread ID. It is breaking the distribution and making the date less unique also it prevents event id's to be replayable. Basiclly it will result in the lower 13 bit to ave 3 times as many 1's as 0's making them more likely to collide while also making the id not determinstic.

The second problem I spot is that by removing the lower 13 bit from increment, that part will have no effect for the first 8192 messages and then only change every 8192 messages. My suggestion would be to shift it by 13 bits instead of truncating them.

Lastly I'd probably pull in a few more bits form the timestap, 13 bit are only 0,0000082 secs even if you bump it to 16 that means you get a window of 0,000066 secs + a 48bit counter (or 281.474.976.710.656 events to count)

You'd end up with something like this:

(epoch_timestamp & 0xFF_FF) | (auto_increment_id << 16 )
 ``
 
 do you think that would solve the original problem?

Licenser

wooh wooh

tremor-interceptor/src/postprocessor/gelf_chunking.rs

BharatKJain · 2024-10-16T15:41:21Z

I will fix the clippy checks and DCO, please allow me sometime.

Honestly I didn't anticipate this to happen but I also created otel gelf-exporter. :)

Opentelemetry PR for gelf-exporter (WIP)

BharatKJain · 2024-10-16T19:46:14Z

Line 190:

let current_epoch_timestamp = u64::try_from(SystemTime::now().duration_since(UNIX_EPOCH).expect("SystemTime before UNIX EPOCH!").as_nanos())?;

Not sure how to handle this better, is this okay?

Licenser · 2024-10-16T20:06:58Z

We'd want to avoid the expect. Let me explain why.

expect and unwrap cause a crash, in some applications that's okay since, say you have a CLI if something goes wrong you might just want to error out and restart from scratch. Now tremor is a bit more complex there might be more then one pipeline running so we don't want one pipeline affect the progress of another. If we'd crash in pipeline A we'd take pipeline B down with it - that's not desirable. So we handle errors and let them bubble up so only the pipeline causing the issue is affected.

To do that there are a few ways ? in many places work, but if the error type can't be mapped something like map_err(...)? is a good alternative.

Signed-off-by: Bharat Jain <[email protected]>

Co-authored-by: Heinz N. Gies <[email protected]> Signed-off-by: Bharat Kumar Jain <[email protected]> Signed-off-by: Bharat Jain <[email protected]>

Signed-off-by: Bharat Jain <[email protected]>

BharatKJain · 2024-10-21T07:07:07Z

@Licenser @darach Does it look good now?

(I have fixed the DCO, format, clippy-check, code quality checks)

darach · 2024-10-21T17:02:15Z

@BharatKJain LGTM now. I think you just need to click on resolve conversation for the open review comment and we're all good!

Licenser

<3

darach

LGTM 🚀

BharatKJain requested review from darach, Licenser and mfelsche as code owners September 29, 2024 14:24

BharatKJain force-pushed the fix-gelf-post-processor branch from 7954549 to ba74fab Compare September 29, 2024 14:26

Licenser previously approved these changes Sep 29, 2024

View reviewed changes

BharatKJain force-pushed the fix-gelf-post-processor branch from ba74fab to 3e4f5a1 Compare October 1, 2024 08:36

BharatKJain dismissed Licenser’s stale review via cb303e1 October 5, 2024 18:06

BharatKJain force-pushed the fix-gelf-post-processor branch from cb303e1 to b45a96a Compare October 5, 2024 18:07

BharatKJain force-pushed the fix-gelf-post-processor branch 4 times, most recently from 7870889 to 026f39c Compare October 5, 2024 19:02

Licenser previously approved these changes Oct 14, 2024

View reviewed changes

Licenser reviewed Oct 14, 2024

View reviewed changes

tremor-interceptor/src/postprocessor/gelf_chunking.rs Outdated Show resolved Hide resolved

BharatKJain dismissed Licenser’s stale review via 6f26c59 October 16, 2024 15:33

BharatKJain force-pushed the fix-gelf-post-processor branch 3 times, most recently from d63e383 to b42443d Compare October 16, 2024 19:41

BharatKJain force-pushed the fix-gelf-post-processor branch 2 times, most recently from 8f0422a to c340296 Compare October 18, 2024 11:43

BharatKJain force-pushed the fix-gelf-post-processor branch from 745d14e to 6679484 Compare October 18, 2024 12:03

BharatKJain and others added 8 commits October 21, 2024 12:34

Fix: message-id in postprocessor/gelf-chunking

ec2a485

Signed-off-by: Bharat Jain <[email protected]>

Updated: gelf-message-id construction

26cabd7

Signed-off-by: Bharat Jain <[email protected]>

Updated: gelf-message-id construction v3

4e69e70

Signed-off-by: Bharat Jain <[email protected]>

Updated: gelf-message-id construction v4

6b31305

Signed-off-by: Bharat Jain <[email protected]>

Update tremor-interceptor/src/postprocessor/gelf_chunking.rs

1c60a1e

Co-authored-by: Heinz N. Gies <[email protected]> Signed-off-by: Bharat Kumar Jain <[email protected]> Signed-off-by: Bharat Jain <[email protected]>

Fix: clippy suggestion

53ff0b7

Signed-off-by: Bharat Jain <[email protected]>

Added: test for max-autoincrement-id

fec5885

Signed-off-by: Bharat Jain <[email protected]>

fixed: code-formatting

116a48f

Signed-off-by: Bharat Jain <[email protected]>

BharatKJain force-pushed the fix-gelf-post-processor branch from a0f57f9 to 116a48f Compare October 21, 2024 07:04

Licenser approved these changes Oct 21, 2024

View reviewed changes

darach approved these changes Oct 21, 2024

View reviewed changes

Licenser merged commit db9a434 into tremor-rs:main Oct 21, 2024
57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: message-id in postprocessor/gelf-chunking #2662

Fix: message-id in postprocessor/gelf-chunking #2662

BharatKJain commented Sep 29, 2024

codecov bot commented Sep 29, 2024 •

edited

Loading

Licenser left a comment

BharatKJain commented Sep 30, 2024

Licenser commented Oct 3, 2024

Licenser commented Oct 5, 2024

BharatKJain commented Oct 5, 2024

BharatKJain commented Oct 5, 2024 •

edited

Loading

Licenser commented Oct 6, 2024

Licenser left a comment

BharatKJain commented Oct 16, 2024 •

edited

Loading

BharatKJain commented Oct 16, 2024

Licenser commented Oct 16, 2024

BharatKJain commented Oct 21, 2024

darach commented Oct 21, 2024

Licenser left a comment

darach left a comment

Fix: message-id in postprocessor/gelf-chunking #2662

Fix: message-id in postprocessor/gelf-chunking #2662

Conversation

BharatKJain commented Sep 29, 2024

Pull request

Description

Related

Checklist

Performance

codecov bot commented Sep 29, 2024 • edited Loading

Codecov Report

Licenser left a comment

Choose a reason for hiding this comment

BharatKJain commented Sep 30, 2024

Licenser commented Oct 3, 2024

Licenser commented Oct 5, 2024

BharatKJain commented Oct 5, 2024

BharatKJain commented Oct 5, 2024 • edited Loading

Licenser commented Oct 6, 2024

Licenser left a comment

Choose a reason for hiding this comment

BharatKJain commented Oct 16, 2024 • edited Loading

BharatKJain commented Oct 16, 2024

Licenser commented Oct 16, 2024

BharatKJain commented Oct 21, 2024

darach commented Oct 21, 2024

Licenser left a comment

Choose a reason for hiding this comment

darach left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 29, 2024 •

edited

Loading

BharatKJain commented Oct 5, 2024 •

edited

Loading

BharatKJain commented Oct 16, 2024 •

edited

Loading