Investigate `string[pyarrow]` performance on h2o groupby queries at 50 GB #18

ncclementi · 2022-10-17T21:36:59Z

Create data that use string[pyarrow]
Run groupby queries string[pyarrow] using 50GB casting object as pyarrow string (exisiting dataset) and reading directly from new dataset.
Check work done by Greg https://github.com/coiled/h2o-benchmarks/tree/compare_june_sept_releases

The text was updated successfully, but these errors were encountered:

ncclementi · 2022-10-20T22:44:03Z

I'm trying to determine whether I should create a dataset that has all its string columns saved as string[pyarrow] for the 50GB dataset.

I did a quick comparison of writing a parquet dataset to disk using string[python] and string[pyarrow] and then check the sizes on disk of the files. Things turn a bit weird when we repartition based on partition_size.

import dask.dataframe as dd
import dask
from dask.distributed import Client
client = Client()

ddf = dask.datasets.timeseries(start='2000-01-01', end='2000-01-31')

ddf_python = ddf.astype({"name": "string[python]"})

ddf_python.to_parquet("temp/python/")  #this gives 30 files of 2.5 MB 
ddf_python.repartition(partition_size="100MB").to_parquet("temp/python/") # 3 files, 2 files 24MB and 1files 13MB

ddf_pyarrow = ddf.astype({"name": "string[pyarrow]"})
ddf_pyarrow.to_parquet("temp/pyarrow/")  #this gives 30 files of 2.5 MB 
ddf_pyarrow.repartition(partition_size="100MB").to_parquet("temp/pyarrow/") #2 files one 54MB and one 6.7MB

It looks like if we aim for ~100MB in memory it might be worth it to create it and compare.

@ian-r-rose @jrbourbeau any thoughts?

ncclementi · 2022-10-31T20:00:26Z

See #14 (comment) for summary results

ncclementi self-assigned this Oct 17, 2022

ncclementi mentioned this issue Oct 31, 2022

Pyarrow str 50gb groupby #21

Merged

ncclementi closed this as completed in #21 Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate `string[pyarrow]` performance on h2o groupby queries at 50 GB #18

Investigate `string[pyarrow]` performance on h2o groupby queries at 50 GB #18

ncclementi commented Oct 17, 2022 •

edited

Loading

ncclementi commented Oct 20, 2022

ncclementi commented Oct 31, 2022

Investigate string[pyarrow] performance on h2o groupby queries at 50 GB #18

Investigate string[pyarrow] performance on h2o groupby queries at 50 GB #18

Comments

ncclementi commented Oct 17, 2022 • edited Loading

ncclementi commented Oct 20, 2022

ncclementi commented Oct 31, 2022

Investigate `string[pyarrow]` performance on h2o groupby queries at 50 GB #18

Investigate `string[pyarrow]` performance on h2o groupby queries at 50 GB #18

ncclementi commented Oct 17, 2022 •

edited

Loading