-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized record size for concatenable events like JSON #57
Comments
👍 The workaround given for #5 would be, at best, hard to implement for say Apache access logs. In addition compressing the data prior to sending it Kinesis Firehose would also be appreciated. I have seen click stream data which compress up to 94% when in JSON format. This causes Firehose to write files to S3 much more frequently than is really necessary. While compressing in the agent before sending to the delivery stream is not as efficient overall it does allow producers to scale even more. |
@jeisinge I've forked this repo (https://github.com/lennynyktyk/amazon-kinesis-agent) and added the ability to put multiple rows in a file into one Firehose record. In the flow definition where you specify I've been running this patch in a production environment for 2 weeks and it's reduced our daily Firehose costss by over 80%. InstallationI didn't have time to crate an RPM. The approach I've taken is to build a jar with Maven and Ant and then install the original RPM via yum and then overwrite the installed jar with the patched one. Java is not my forté so I am unsure how to run the test suite. |
@lennynyktyk , fantastic! I'll take a look. After more testing, it seems I misstated compatibility down stream. Multiple records per line do not work for Spark. So, we have gone to an external process of adding a tab ( |
@jeisinge FYI, I noticed a issue today with files who's size never exceeds the If the file never exceeds |
@lennynyktyk I'm looking at the PRs now #60, sorry about the delay! Did you get a chance to fix the issue you mentioned above? |
@chaochenq I don't think this is an issue with my PR but rather with how the Kinesis Agent handles lines which do contain the delimiter. E.g. in a normal Kinesis Agent configuration using the normal line parser if there is no newline character in the file the contents of the file will never be sent to Kinesis. This is understandable because the agent is looking for the delimiter and under the assumption there is more data to be added to the record. There are ways of fixing this though I believe they are out of the scope of #60 . For example, keeping the last record in a buffer and then when the file under observation is rotated then have the agent submit the fragment to Kinesis. |
Background
AWS Kinesis Firehose has a pricing policy that is great for large records. However, for small records of sizes of 1 KB, the customer ends up paying 5x the price of the same amount of data. This is because Kinesis rounds up to the nearest 5 kb.
Further, certain records can be concatenated together and still processed separately down stream. For example, many downstream big data products like Redshift, Hadoop and Spark support multiple JSON documents per line.
So, the following JSON events are equivalent in the above products:
and
Solution
It would be great if Kinesis Agent automatically combined events that can be combined into as large of records as possible.
Other Similar Items
The text was updated successfully, but these errors were encountered: