Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized record size for concatenable events like JSON #57

Open
jeisinge opened this issue Nov 10, 2016 · 6 comments
Open

Optimized record size for concatenable events like JSON #57

jeisinge opened this issue Nov 10, 2016 · 6 comments

Comments

@jeisinge
Copy link

jeisinge commented Nov 10, 2016

Background

AWS Kinesis Firehose has a pricing policy that is great for large records. However, for small records of sizes of 1 KB, the customer ends up paying 5x the price of the same amount of data. This is because Kinesis rounds up to the nearest 5 kb.

Further, certain records can be concatenated together and still processed separately down stream. For example, many downstream big data products like Redshift, Hadoop and Spark support multiple JSON documents per line.

So, the following JSON events are equivalent in the above products:

{"event_id": 123, "doc_id": 9372}
{"event_id": 124, "doc_id": 7642}
{"event_id": 125, "doc_id": 1298}

and

{"event_id": 123, "doc_id": 9372}{"event_id": 124, "doc_id": 7642}{"event_id": 125, "doc_id": 1298}

Solution

It would be great if Kinesis Agent automatically combined events that can be combined into as large of records as possible.

Other Similar Items

@innonagle
Copy link

👍 The workaround given for #5 would be, at best, hard to implement for say Apache access logs.

In addition compressing the data prior to sending it Kinesis Firehose would also be appreciated. I have seen click stream data which compress up to 94% when in JSON format. This causes Firehose to write files to S3 much more frequently than is really necessary. While compressing in the agent before sending to the delivery stream is not as efficient overall it does allow producers to scale even more.

@innonagle
Copy link

innonagle commented Dec 8, 2016

@jeisinge I've forked this repo (https://github.com/lennynyktyk/amazon-kinesis-agent) and added the ability to put multiple rows in a file into one Firehose record. In the flow definition where you specify filePattern and DeliveryStreamName specify maxDataBytes. You can set this to any integer value, multiples of 5120 are best for Firehose.

I've been running this patch in a production environment for 2 weeks and it's reduced our daily Firehose costss by over 80%.

Installation

I didn't have time to crate an RPM. The approach I've taken is to build a jar with Maven and Ant and then install the original RPM via yum and then overwrite the installed jar with the patched one.

Java is not my forté so I am unsure how to run the test suite.

@jeisinge
Copy link
Author

jeisinge commented Dec 8, 2016

@lennynyktyk , fantastic! I'll take a look.

After more testing, it seems I misstated compatibility down stream. Multiple records per line do not work for Spark. So, we have gone to an external process of adding a tab (\t) character after X bytes and using the multiLineStartPattern option with it being set to "\\t". Obviously, this is not ideal because it adds complexity and reduces the ability to stream data to Kinesis. So, I am very much looking forward to your fork!

@innonagle
Copy link

@jeisinge FYI, I noticed a issue today with files who's size never exceeds the maxDataBytes parameter.

If the file never exceeds maxDataBytes that data never generates a record. This may be a larger problem with the kinesis agent. I'm not sure when/if I'll have time to fix it but just putting it out there.

@chaochenq
Copy link
Contributor

@lennynyktyk I'm looking at the PRs now #60, sorry about the delay! Did you get a chance to fix the issue you mentioned above?

@innonagle
Copy link

@chaochenq I don't think this is an issue with my PR but rather with how the Kinesis Agent handles lines which do contain the delimiter. E.g. in a normal Kinesis Agent configuration using the normal line parser if there is no newline character in the file the contents of the file will never be sent to Kinesis. This is understandable because the agent is looking for the delimiter and under the assumption there is more data to be added to the record.

There are ways of fixing this though I believe they are out of the scope of #60 . For example, keeping the last record in a buffer and then when the file under observation is rotated then have the agent submit the fragment to Kinesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants