Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for parquet compression #221

Closed
mattyb opened this issue Jan 18, 2018 · 23 comments
Closed

support for parquet compression #221

mattyb opened this issue Jan 18, 2018 · 23 comments

Comments

@mattyb
Copy link

mattyb commented Jan 18, 2018

We're looking in to using fluentd to write logs to S3, where they'll be read by AWS Redshift Spectrum. Spectrum charges by read throughput, but it also supports the columnar compression format of Parquet, which can greatly reduce the amount of data you have to scan through. It would be great if the s3 plugin could support this compression format.

@repeatedly
Copy link
Member

s3 plugin's format or compression are pluggable.
So if you write the plugin, you can use parquet format in s3 output.
But I'm not sure how parquet processes event streams...

@dobesv
Copy link

dobesv commented Oct 22, 2019

Looks like there is a plugin that supports parquet:

https://github.com/joker1007/fluent-plugin-arrow

@lsudo
Copy link

lsudo commented Oct 30, 2019

Thanks for sharing. Is there someone who uses that plugin (fluent-plugin-arrow) somewhere? Thanks for any feedback, I haven't checked the code yet.

@eredi93
Copy link

eredi93 commented Feb 14, 2020

@freelancerlucas did you end up trying the https://github.com/joker1007/fluent-plugin-arrow plugin?

I also want to have Parquet events in S3 and looking at how i can achieve that

@eredi93
Copy link

eredi93 commented Feb 20, 2020

@dobesv have you tried the plugin? i just found an issue: tl;dr it doesn't work with the current version of td-agent

@dobesv
Copy link

dobesv commented Feb 21, 2020

I tried to get it working but it seems to crash if I give it data that doesn't match the schema. It's still a work in progress for me, but now I'm thinking I'll output JSON and run another process to periodically convert JSON to Parquet instead.

@lsudo
Copy link

lsudo commented Mar 10, 2020

Thanks guys for the input, today I just started working on this topic. I will let you know about the progress, but I was also thinking about running another process to convert JSON to Parquet.

@okkez
Copy link
Contributor

okkez commented May 11, 2020

We are developing the compressor for the parquet format.
We will create a PR to publish the compressor or create a new gem to publish it.
The compressor depends on the external command we are developing.

Which is prefer creating a PR or a gem?

@eredi93
Copy link

eredi93 commented May 11, 2020

Awesome @okkez could update this issue when you have something?
also happy to help if you need a hand

@repeatedly
Copy link
Member

parquet is now popular and it doesn't depend on external gems.
So support it in s3 plugin seems better. Could you send a PR?

@eredi93
Copy link

eredi93 commented May 12, 2020

@repeatedly what do you mean it doesn't depend on external gems?
it would require https://github.com/apache/arrow/tree/master/ruby/red-parquet no?

@okkez
Copy link
Contributor

okkez commented May 30, 2020

OK, I will create new PR to add parquet compressor after we test it our environment.

@shevisj
Copy link

shevisj commented Jun 16, 2020

@okkez any progress to report on this PR? Parquet formatting would be very useful, so I'd be happy to contribute if there is still work to be done.

@realknorke
Copy link

We would GREATLY appreciate Parquet support for fluent.

@okkez
Copy link
Contributor

okkez commented Jul 3, 2020

@shevisjohnson @realknorke Could you try #338 ?

@realknorke
Copy link

realknorke commented Jul 3, 2020

@shevisjohnson @realknorke Could you try #338 ?

Sorry, not feasable for us because columnify is not mature enough. For a test run I tried to convert a simple TSV file to parquet (all columns in schema set to type "string"). The process failed because in some columns the value is NAN or INF (valid locodes for regions in France). Also, the columnify tools is eating up a lot of memory because the columnization is done in memory. This may not be a problem in most cases, but its a huge problem as part of a (relatively) small fluent logging container, writing a lot of data per day. You don't want large (like in 4+ GB) peaks in memory consumtion (cpu peaks may be okay to some degree)

@okkez
Copy link
Contributor

okkez commented Aug 13, 2020

@realknorke Thank you for testing columnify.
We are trying to improve memory usage reproio/columnify#52

Please create a new issue on https://github.com/reproio/columnify/issues if you have problems to use columnify.

@okkez
Copy link
Contributor

okkez commented Aug 24, 2020

Columnify v0.1.0 has been released.
https://github.com/reproio/columnify/blob/v0.1.0/CHANGELOG.md

Implement stream-based input record decoder's to reduce memory consumption.

@abicky
Copy link

abicky commented Sep 11, 2020

If you want some examples, the repository https://github.com/abicky/docker-log-and-fluent-plugin-s3-with-columnify-example might be helpful.

@realknorke
Copy link

@okkez sorry, took me some time to get priority on this issue.
I successfully integrated columnify as a compression flavor for s3 in fluentd (your ruby example basically).
I'm going to test columnify-fluent in more depth starting monday and with high load. :)
But it looks very promising now. I'll keep you informed.

@123BLiN
Copy link

123BLiN commented Mar 31, 2021

It seems such plugin can also be some kind of schema validator to only allow "good" messages went to s3, because in our case later Athena breaks in case schema has been changed

@abicky
Copy link

abicky commented May 12, 2021

FYI, fluent-plugin-s3 1.6.0 including the changes in #338 has been released.

@ashie
Copy link
Member

ashie commented May 12, 2021

Thanks for notifying it.
Yes, it's already released.
I close this issue.

@ashie ashie closed this as completed May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests