-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for parquet compression #221
Comments
s3 plugin's format or compression are pluggable. |
Looks like there is a plugin that supports parquet: |
Thanks for sharing. Is there someone who uses that plugin (fluent-plugin-arrow) somewhere? Thanks for any feedback, I haven't checked the code yet. |
@freelancerlucas did you end up trying the https://github.com/joker1007/fluent-plugin-arrow plugin? I also want to have Parquet events in S3 and looking at how i can achieve that |
@dobesv have you tried the plugin? i just found an issue: tl;dr it doesn't work with the current version of |
I tried to get it working but it seems to crash if I give it data that doesn't match the schema. It's still a work in progress for me, but now I'm thinking I'll output JSON and run another process to periodically convert JSON to Parquet instead. |
Thanks guys for the input, today I just started working on this topic. I will let you know about the progress, but I was also thinking about running another process to convert JSON to Parquet. |
We are developing the compressor for the parquet format. Which is prefer creating a PR or a gem? |
Awesome @okkez could update this issue when you have something? |
parquet is now popular and it doesn't depend on external gems. |
@repeatedly what do you mean it doesn't depend on external gems? |
OK, I will create new PR to add parquet compressor after we test it our environment. |
@okkez any progress to report on this PR? Parquet formatting would be very useful, so I'd be happy to contribute if there is still work to be done. |
We would GREATLY appreciate Parquet support for fluent. |
@shevisjohnson @realknorke Could you try #338 ? |
Sorry, not feasable for us because columnify is not mature enough. For a test run I tried to convert a simple TSV file to parquet (all columns in schema set to type "string"). The process failed because in some columns the value is NAN or INF (valid locodes for regions in France). Also, the columnify tools is eating up a lot of memory because the columnization is done in memory. This may not be a problem in most cases, but its a huge problem as part of a (relatively) small fluent logging container, writing a lot of data per day. You don't want large (like in 4+ GB) peaks in memory consumtion (cpu peaks may be okay to some degree) |
@realknorke Thank you for testing columnify. Please create a new issue on https://github.com/reproio/columnify/issues if you have problems to use columnify. |
Columnify v0.1.0 has been released.
|
If you want some examples, the repository https://github.com/abicky/docker-log-and-fluent-plugin-s3-with-columnify-example might be helpful. |
@okkez sorry, took me some time to get priority on this issue. |
It seems such plugin can also be some kind of schema validator to only allow "good" messages went to s3, because in our case later Athena breaks in case schema has been changed |
FYI, fluent-plugin-s3 1.6.0 including the changes in #338 has been released. |
Thanks for notifying it. |
We're looking in to using fluentd to write logs to S3, where they'll be read by AWS Redshift Spectrum. Spectrum charges by read throughput, but it also supports the columnar compression format of Parquet, which can greatly reduce the amount of data you have to scan through. It would be great if the s3 plugin could support this compression format.
The text was updated successfully, but these errors were encountered: