Inefficient Parquet Conversion with columnify compared to pyarrow #93

harishoss · 2023-09-23T04:57:10Z

I am working on converting JSONL log files to Parquet format to improve log search capabilities.
To achieve this, I've been exploring tools compatible with Fluentd, and I came across the s3-plugin, which uses the columnify tool for conversion.

In my quest to find the most efficient conversion method, I conducted tests using two different approaches:

I created a custom Python script utilizing the pandas and pyarrow libraries for JSONL to Parquet conversion.
I used the columnify tool for the same purpose.

I used a JSONL file containing approximately 27,000 log lines, all structured similarly to the following example:

{ "stdouttype": "stdout", "letter": "F", "level": "info", "f_t": "2023-09-21T16:35:46.608Z", "ist_timestamp": "21 Sept 2023, 22:05:46 GMT+5:30", "f_s": "service-name", "f_l": "module_name", "apiName": "<name_of_api>", "workflow": "some-workflow-qwewqe-0", "step": "somestepid0", "sender": "234567854321345670", "traceId": "23456785432134567_wertjlwqkjrtljjjwelfe0", "sid": "", "request": "<stringified-request-body>", "response": "<stringified-request-body>"}

For both methods, I generated GZIP-compressed JSON and Parquet files. The image below illustrates the resulting Parquet files:
in the below image you can see 3 parquet files that are generated

main_file.log.gz.parquet (101KB) is generated by python script (pandas+pyarrow)
main_file1.columnify.parquet (8.7MB) is generated by columnify

As shown, the Parquet file generated by columnify is significantly larger than the one created by the Python script.

Upon further investigation, I discovered that the default row_group_size and page_size settings differ between pyarrow (used in the Python script) and columnify (utilizing parquet-go):

In Pyarrow:

Default row_group_size: 1MB (maximum of 64MB)
Default page_size: 1MB

In columnify (parquet-go):
Default row_group_size: 128MB
Default page_size: 8KB

So, I adjusted the page_size for columnify to 1MB (-parquetPageSize 1048576), which reduced the file size from 8.7MB to 438KB. However, modifying the row_group_size option did not result in further size reduction.

I'm seeking help in understanding why the columnify-generated Parquet file remains larger than the one generated by the Python script using pyarrow. Is this due to limitations in the parquet-go library ? or am I missing something in my configuration?

kindly give some insights, advice, or any recommendations on optimizing the Parquet conversion process with columnify.

LINKS
pyarrow doc ref. for page_size and row_group_size
pyarrow default row group size value
pyarrow default page_size
parquet-go row_group_size and page_size

The text was updated successfully, but these errors were encountered:

okkez · 2023-09-25T01:34:44Z

In general, the optimal settings will vary depending on the workload. The default settings for converting to Parquet using fluent-plugin-s3 are not optimized, so you will need to experiment to find the best settings for your workload.

Here are a few quick tips:

Increase the chunk_limit_size setting in Fluentd. This will reduce the number of times that data needs to be buffered and flushed to disk.
Experiment with different values for the parquet_row_group_size setting. This setting controls the size of the chunks that are written to Parquet files.
Try different compression codecs for Parquet files. The default compression codec is gzip, but other codecs may be more efficient for your workload.

It is important to note that the Python script (pandas + pyarrow) is not a streaming process. This means that it will not be affected by the same performance bottlenecks as Fluentd + fluent-plugin-s3. Therefore, it is not a fair comparison.

harishoss · 2023-09-25T12:37:55Z

Thanks for the reply.

All the tests I've done were in my local system, using the Columnify cmd line not directly in the Fluentd pipeline.
I tried with different parquet_row_group_size values but that didn't give much difference. but changing page_size showed some better results (I've already mentioned it)
I did tests on 1GB data files as well. So not just the 33MB file I've mentioned above.
the default compression is actually snappy. Yes! I've tried with snappy and gzip compression codecs.

It is important to note that the Python script (pandas + pyarrow) is not a streaming process. This means that it will not be affected by the same performance bottlenecks as Fluentd + fluent-plugin-s3. Therefore, it is not a fair comparison

Could you please describe it more? I could not understand what you mean it's "unfair" comparison.

okkez · 2023-09-26T00:55:35Z

Columnify command vs. python script (pandas + pyarrow) is a fair comparison if they run on a local machine without Fluentd pipeline. I misunderstood.

I am not familiar with Python (pandas, pyarrow), but I assume that the default settings are optimized for maximum compression.

On the other hand, Columnify uses the default settings of the library it uses (parquet-go), so it is not specifically tuned. Therefore, users need to tune it themselves.

If the Parquet file generated by Columnify is larger than the Parquet file generated by Python (pandas, pyarrow) even when using the same compression codec, page size, and row group size, then it is a problem with parquet-go.

harishoss mentioned this issue Sep 23, 2023

Inefficient Parquet Conversion with columnify (parquet-go) compared to pyarrow #93 xitongsys/parquet-go#565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient Parquet Conversion with columnify compared to pyarrow #93

Inefficient Parquet Conversion with columnify compared to pyarrow #93

harishoss commented Sep 23, 2023

okkez commented Sep 25, 2023

harishoss commented Sep 25, 2023 •

edited

Loading

okkez commented Sep 26, 2023

Inefficient Parquet Conversion with columnify compared to pyarrow #93

Inefficient Parquet Conversion with columnify compared to pyarrow #93

Comments

harishoss commented Sep 23, 2023

okkez commented Sep 25, 2023

harishoss commented Sep 25, 2023 • edited Loading

okkez commented Sep 26, 2023

harishoss commented Sep 25, 2023 •

edited

Loading