Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: message-id in postprocessor/gelf-chunking #2662

Merged
merged 8 commits into from
Oct 21, 2024

Conversation

BharatKJain
Copy link
Contributor

Pull request

Description

Changed message-id from auto-increment ID to randomized ID in postprocessor/gelf_chunking.rs

HELP NEEDED: I have avoided adding hostname while producing message-id, I am not sure how can we handle adding hostname, please suggest.

Related

Checklist

  • The RFC, if required, has been submitted and approved
  • Any user-facing impact of the changes is reflected in docs.tremor.rs
  • The code is tested
  • Use of unsafe code is reasoned about in a comment
  • Update CHANGELOG.md appropriately, recording any changes, bug fixes, or other observable changes in behavior
  • The performance impact of the change is measured (see below)

Performance

Copy link

codecov bot commented Sep 29, 2024

Codecov Report

Attention: Patch coverage is 97.54098% with 3 lines in your changes missing coverage. Please review.

Project coverage is 91.34%. Comparing base (f1f2b7f) to head (116a48f).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...mor-interceptor/src/postprocessor/gelf_chunking.rs 97.54% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2662      +/-   ##
==========================================
+ Coverage   91.27%   91.34%   +0.07%     
==========================================
  Files         308      308              
  Lines       60116    60229     +113     
==========================================
+ Hits        54868    55016     +148     
+ Misses       5248     5213      -35     
Flag Coverage Δ
e2e-command 11.27% <0.00%> (-0.02%) ⬇️
e2e-integration 50.52% <0.00%> (+0.09%) ⬆️
e2e-unit 12.56% <0.00%> (-0.02%) ⬇️
e2etests 52.85% <0.00%> (+0.09%) ⬆️
tremorapi 14.50% <0.00%> (-0.03%) ⬇️
tremorcodec 63.11% <ø> (ø)
tremorcommon 63.04% <ø> (ø)
tremorconnectors 28.81% <0.00%> (-0.05%) ⬇️
tremorconnectorsaws 11.25% <0.00%> (-0.03%) ⬇️
tremorconnectorsazure 4.68% <0.00%> (-0.02%) ⬇️
tremorconnectorsgcp 25.36% <0.00%> (+0.05%) ⬆️
tremorconnectorsobjectstorage 0.06% <ø> (ø)
tremorconnectorsotel 12.55% <0.00%> (-0.03%) ⬇️
tremorconnectorstesthelpers 68.25% <ø> (ø)
tremorinflux 87.71% <ø> (ø)
tremorinterceptor 55.34% <97.54%> (+0.98%) ⬆️
tremorpipeline 31.15% <ø> (ø)
tremorruntime 47.20% <0.00%> (-0.05%) ⬇️
tremorscript 55.06% <ø> (ø)
tremorsystem 5.78% <ø> (ø)
tremorvalue 69.52% <ø> (-0.04%) ⬇️
unittests 89.20% <97.54%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...mor-interceptor/src/postprocessor/gelf_chunking.rs 96.00% <97.54%> (+2.89%) ⬆️

... and 10 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1f2b7f...116a48f. Read the comment docs.

Licenser
Licenser previously approved these changes Sep 29, 2024
Copy link
Member

@Licenser Licenser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks reasonable nothing to prevent using this I see, great documentation too :)

two things I notice.

  1. it would be nice to mention how message id's are generated in the docs (the //! section)

  2. if you are up for a challenge: we try to keep all time and randomness out of tremor to allow for deterministic replays. We do this by using ingest_ns for as a random seed, and for times, that way a event that is logged with it's ingest ns can be replayed and generate the exact same out put. The random function is a good example. It'd be interesting to see this same concept re-used for this to allow repeatable yet still random message id's; one way would be to use ingest-ns instead of the current epoch (which would be nice anyway as looking uo time isn't fast), and then seed the RNG somehow (probably not with the ingest ns as that would make it useless) but perhaps with the fist n bytes of the message? or with some bytes of a hash of the message?

@BharatKJain
Copy link
Contributor Author

  1. Okay, will do.

  2. Trying to think-out-loud, so let's say if we create a hash of a message but repetitive logs will create same message-id which is a problem because message-id has to be unique in nature, collisions can cause problems when we're decoding the message on the server side. I am not completely sure but ideally we would want to have uniqueness in the message-id to make sure that we are not breaking the server GELF-decoding.

(How it will break server-side due to collision? So message-id is a way of determining if the UDP packet is associated with already existing log or it's for a new log, when we're sending same message-id for multiple logs then server behaviour will be to merge the data-together which will end-up breaking the log)

TBH I am also trying to figure this out, please share any suggestions, am I thinking right? 😅

@Licenser
Copy link
Member

Licenser commented Oct 3, 2024

Ja just the message content would not work, I'm still considering if message content + ingest_ns (nanosecond when the message was registered at tremor) would be enough, if a server produces the same log twice in the same nanosecond that'd be very odd (but not impossible) OTOH having two random generated numbers be the same is also odd (but not impossible) it would also one a more deterministic failure case "When messages with the same content arrive at exactly the same time they will get duplicated message ids" instead of "if the RNG hates you, you'll get duplicated message ids"

@Licenser
Copy link
Member

Licenser commented Oct 5, 2024

Sorry for all the forth and back, so I've been thinking about a good way to make this:

  1. fast
  2. deterministic
  3. non-coliding

and wanted to throw out a suggestion using the ingest_ns + an incremental id as a message ID. This is:

  1. fast since we just have to look at two integers, no RNG, no hashing
  2. deterministic since the ingest_ns is settable, and the incremental counter is well, incremental so determinstic as well
  3. it avoids duplicates by not creating duplicates on the same system due to incremtal id's and makes them extremely unlikely on multiple systems due to nano second timestamps involved.

@BharatKJain
Copy link
Contributor Author

Sorry for all the forth and back, so I've been thinking about a good way to make this:

1. fast

2. deterministic

3. non-coliding

and wanted to throw out a suggestion using the ingest_ns + an incremental id as a message ID. This is:

1. fast since we just have to look at two integers, no RNG, no hashing

2. deterministic since the ingest_ns is settable, and the incremental counter is well, incremental so determinstic as well

3. it avoids duplicates by not creating duplicates on the same system due to incremtal id's and makes them extremely unlikely on multiple systems due to nano second timestamps involved.

Can I keep thread-id just to reduce the collision probability more? 😅

@BharatKJain BharatKJain force-pushed the fix-gelf-post-processor branch 4 times, most recently from 7870889 to 026f39c Compare October 5, 2024 19:02
@BharatKJain
Copy link
Contributor Author

BharatKJain commented Oct 5, 2024

Okay, I have made the changes as suggested in the latest comment.

(epoch_timestamp & BITS_13) | (auto_increment_id & !BITS_13) | (thread_id_u64 & BITS_13)

(epoch_timestamp is ingest_ns when process() is called, in finish() there's no ingest_ns so I am using current timestamp.)

  • Added a unit-test for it.
  • Renamed id to auto_increment_id

Let me know your thoughts!

@Licenser
Copy link
Member

Licenser commented Oct 6, 2024

I would remove the thread ID. It is breaking the distribution and making the date less unique also it prevents event id's to be replayable. Basiclly it will result in the lower 13 bit to ave 3 times as many 1's as 0's making them more likely to collide while also making the id not determinstic.

The second problem I spot is that by removing the lower 13 bit from increment, that part will have no effect for the first 8192 messages and then only change every 8192 messages. My suggestion would be to shift it by 13 bits instead of truncating them.

Lastly I'd probably pull in a few more bits form the timestap, 13 bit are only 0,0000082 secs even if you bump it to 16 that means you get a window of 0,000066 secs + a 48bit counter (or 281.474.976.710.656 events to count)

You'd end up with something like this:

(epoch_timestamp & 0xFF_FF) | (auto_increment_id << 16 )
 ``
 
 do you think that would solve the original problem?

Licenser
Licenser previously approved these changes Oct 14, 2024
Copy link
Member

@Licenser Licenser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wooh wooh

@BharatKJain
Copy link
Contributor Author

BharatKJain commented Oct 16, 2024

I will fix the clippy checks and DCO, please allow me sometime.


Honestly I didn't anticipate this to happen but I also created otel gelf-exporter. :)

Opentelemetry PR for gelf-exporter (WIP)

@BharatKJain BharatKJain force-pushed the fix-gelf-post-processor branch 3 times, most recently from d63e383 to b42443d Compare October 16, 2024 19:41
@BharatKJain
Copy link
Contributor Author

Line 190:

let current_epoch_timestamp = u64::try_from(SystemTime::now().duration_since(UNIX_EPOCH).expect("SystemTime before UNIX EPOCH!").as_nanos())?;

Not sure how to handle this better, is this okay?

@Licenser
Copy link
Member

We'd want to avoid the expect. Let me explain why.

expect and unwrap cause a crash, in some applications that's okay since, say you have a CLI if something goes wrong you might just want to error out and restart from scratch. Now tremor is a bit more complex there might be more then one pipeline running so we don't want one pipeline affect the progress of another. If we'd crash in pipeline A we'd take pipeline B down with it - that's not desirable. So we handle errors and let them bubble up so only the pipeline causing the issue is affected.

To do that there are a few ways ? in many places work, but if the error type can't be mapped something like map_err(...)? is a good alternative.

@BharatKJain BharatKJain force-pushed the fix-gelf-post-processor branch 2 times, most recently from 8f0422a to c340296 Compare October 18, 2024 11:43
@BharatKJain
Copy link
Contributor Author

@Licenser @darach Does it look good now?

(I have fixed the DCO, format, clippy-check, code quality checks)

@darach
Copy link
Member

darach commented Oct 21, 2024

@BharatKJain LGTM now. I think you just need to click on resolve conversation for the open review comment and we're all good!

Copy link
Member

@Licenser Licenser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

Copy link
Member

@darach darach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@Licenser Licenser merged commit db9a434 into tremor-rs:main Oct 21, 2024
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants