Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate string[pyarrow] performance on h2o groupby queries at 50 GB #18

Closed
3 tasks done
ncclementi opened this issue Oct 17, 2022 · 2 comments · Fixed by #21
Closed
3 tasks done

Investigate string[pyarrow] performance on h2o groupby queries at 50 GB #18

ncclementi opened this issue Oct 17, 2022 · 2 comments · Fixed by #21
Assignees

Comments

@ncclementi
Copy link
Contributor

ncclementi commented Oct 17, 2022

@ncclementi ncclementi self-assigned this Oct 17, 2022
@ncclementi
Copy link
Contributor Author

I'm trying to determine whether I should create a dataset that has all its string columns saved as string[pyarrow] for the 50GB dataset.

I did a quick comparison of writing a parquet dataset to disk using string[python] and string[pyarrow] and then check the sizes on disk of the files. Things turn a bit weird when we repartition based on partition_size.

import dask.dataframe as dd
import dask
from dask.distributed import Client
client = Client()

ddf = dask.datasets.timeseries(start='2000-01-01', end='2000-01-31')

ddf_python = ddf.astype({"name": "string[python]"})

ddf_python.to_parquet("temp/python/")  #this gives 30 files of 2.5 MB 
ddf_python.repartition(partition_size="100MB").to_parquet("temp/python/") # 3 files, 2 files 24MB and 1files 13MB

ddf_pyarrow = ddf.astype({"name": "string[pyarrow]"})
ddf_pyarrow.to_parquet("temp/pyarrow/")  #this gives 30 files of 2.5 MB 
ddf_pyarrow.repartition(partition_size="100MB").to_parquet("temp/pyarrow/") #2 files one 54MB and one 6.7MB

It looks like if we aim for ~100MB in memory it might be worth it to create it and compare.

@ian-r-rose @jrbourbeau any thoughts?

@ncclementi
Copy link
Contributor Author

See #14 (comment) for summary results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant