support for parquet compression #221

mattyb · 2018-01-18T01:11:11Z

We're looking in to using fluentd to write logs to S3, where they'll be read by AWS Redshift Spectrum. Spectrum charges by read throughput, but it also supports the columnar compression format of Parquet, which can greatly reduce the amount of data you have to scan through. It would be great if the s3 plugin could support this compression format.

repeatedly · 2018-01-20T11:30:58Z

s3 plugin's format or compression are pluggable.
So if you write the plugin, you can use parquet format in s3 output.
But I'm not sure how parquet processes event streams...

dobesv · 2019-10-22T20:34:02Z

Looks like there is a plugin that supports parquet:

https://github.com/joker1007/fluent-plugin-arrow

lsudo · 2019-10-30T15:53:11Z

Thanks for sharing. Is there someone who uses that plugin (fluent-plugin-arrow) somewhere? Thanks for any feedback, I haven't checked the code yet.

eredi93 · 2020-02-14T15:26:47Z

@freelancerlucas did you end up trying the https://github.com/joker1007/fluent-plugin-arrow plugin?

I also want to have Parquet events in S3 and looking at how i can achieve that

eredi93 · 2020-02-20T23:00:49Z

@dobesv have you tried the plugin? i just found an issue: tl;dr it doesn't work with the current version of td-agent

dobesv · 2020-02-21T00:57:50Z

I tried to get it working but it seems to crash if I give it data that doesn't match the schema. It's still a work in progress for me, but now I'm thinking I'll output JSON and run another process to periodically convert JSON to Parquet instead.

lsudo · 2020-03-10T16:39:04Z

Thanks guys for the input, today I just started working on this topic. I will let you know about the progress, but I was also thinking about running another process to convert JSON to Parquet.

okkez · 2020-05-11T00:39:47Z

We are developing the compressor for the parquet format.
We will create a PR to publish the compressor or create a new gem to publish it.
The compressor depends on the external command we are developing.

Which is prefer creating a PR or a gem?

eredi93 · 2020-05-11T17:20:19Z

Awesome @okkez could update this issue when you have something?
also happy to help if you need a hand

repeatedly · 2020-05-12T01:28:44Z

parquet is now popular and it doesn't depend on external gems.
So support it in s3 plugin seems better. Could you send a PR?

eredi93 · 2020-05-12T01:43:48Z

@repeatedly what do you mean it doesn't depend on external gems?
it would require https://github.com/apache/arrow/tree/master/ruby/red-parquet no?

okkez · 2020-05-30T15:29:57Z

OK, I will create new PR to add parquet compressor after we test it our environment.

shevisj · 2020-06-16T20:36:10Z

@okkez any progress to report on this PR? Parquet formatting would be very useful, so I'd be happy to contribute if there is still work to be done.

realknorke · 2020-06-26T10:47:13Z

We would GREATLY appreciate Parquet support for fluent.

okkez · 2020-07-03T02:48:54Z

@shevisjohnson @realknorke Could you try #338 ?

realknorke · 2020-07-03T10:59:10Z

@shevisjohnson @realknorke Could you try #338 ?

Sorry, not feasable for us because columnify is not mature enough. For a test run I tried to convert a simple TSV file to parquet (all columns in schema set to type "string"). The process failed because in some columns the value is NAN or INF (valid locodes for regions in France). Also, the columnify tools is eating up a lot of memory because the columnization is done in memory. This may not be a problem in most cases, but its a huge problem as part of a (relatively) small fluent logging container, writing a lot of data per day. You don't want large (like in 4+ GB) peaks in memory consumtion (cpu peaks may be okay to some degree)

okkez · 2020-08-13T01:52:13Z

@realknorke Thank you for testing columnify.
We are trying to improve memory usage reproio/columnify#52

Please create a new issue on https://github.com/reproio/columnify/issues if you have problems to use columnify.

okkez · 2020-08-24T04:05:26Z

Columnify v0.1.0 has been released.
https://github.com/reproio/columnify/blob/v0.1.0/CHANGELOG.md

Implement stream-based input record decoder's to reduce memory consumption.

abicky · 2020-09-11T02:44:59Z

If you want some examples, the repository https://github.com/abicky/docker-log-and-fluent-plugin-s3-with-columnify-example might be helpful.

realknorke · 2020-10-23T15:31:21Z

@okkez sorry, took me some time to get priority on this issue.
I successfully integrated columnify as a compression flavor for s3 in fluentd (your ruby example basically).
I'm going to test columnify-fluent in more depth starting monday and with high load. :)
But it looks very promising now. I'll keep you informed.

123BLiN · 2021-03-31T12:22:26Z

It seems such plugin can also be some kind of schema validator to only allow "good" messages went to s3, because in our case later Athena breaks in case schema has been changed

abicky · 2021-05-12T03:36:17Z

FYI, fluent-plugin-s3 1.6.0 including the changes in #338 has been released.

ashie · 2021-05-12T03:54:19Z

Thanks for notifying it.
Yes, it's already released.
I close this issue.

okkez mentioned this issue Jun 30, 2020

Add parquet compressor using columnify #338

Merged

ashie closed this as completed May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for parquet compression #221

support for parquet compression #221

mattyb commented Jan 18, 2018

repeatedly commented Jan 20, 2018

dobesv commented Oct 22, 2019

lsudo commented Oct 30, 2019

eredi93 commented Feb 14, 2020

eredi93 commented Feb 20, 2020

dobesv commented Feb 21, 2020

lsudo commented Mar 10, 2020

okkez commented May 11, 2020

eredi93 commented May 11, 2020

repeatedly commented May 12, 2020

eredi93 commented May 12, 2020

okkez commented May 30, 2020

shevisj commented Jun 16, 2020

realknorke commented Jun 26, 2020

okkez commented Jul 3, 2020

realknorke commented Jul 3, 2020 •

edited

Loading

okkez commented Aug 13, 2020

okkez commented Aug 24, 2020

abicky commented Sep 11, 2020

realknorke commented Oct 23, 2020

123BLiN commented Mar 31, 2021

abicky commented May 12, 2021

ashie commented May 12, 2021

support for parquet compression #221

support for parquet compression #221

Comments

mattyb commented Jan 18, 2018

repeatedly commented Jan 20, 2018

dobesv commented Oct 22, 2019

lsudo commented Oct 30, 2019

eredi93 commented Feb 14, 2020

eredi93 commented Feb 20, 2020

dobesv commented Feb 21, 2020

lsudo commented Mar 10, 2020

okkez commented May 11, 2020

eredi93 commented May 11, 2020

repeatedly commented May 12, 2020

eredi93 commented May 12, 2020

okkez commented May 30, 2020

shevisj commented Jun 16, 2020

realknorke commented Jun 26, 2020

okkez commented Jul 3, 2020

realknorke commented Jul 3, 2020 • edited Loading

okkez commented Aug 13, 2020

okkez commented Aug 24, 2020

abicky commented Sep 11, 2020

realknorke commented Oct 23, 2020

123BLiN commented Mar 31, 2021

abicky commented May 12, 2021

ashie commented May 12, 2021

realknorke commented Jul 3, 2020 •

edited

Loading