You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to determine whether I should create a dataset that has all its string columns saved as string[pyarrow] for the 50GB dataset.
I did a quick comparison of writing a parquet dataset to disk using string[python] and string[pyarrow] and then check the sizes on disk of the files. Things turn a bit weird when we repartition based on partition_size.
importdask.dataframeasddimportdaskfromdask.distributedimportClientclient=Client()
ddf=dask.datasets.timeseries(start='2000-01-01', end='2000-01-31')
ddf_python=ddf.astype({"name": "string[python]"})
ddf_python.to_parquet("temp/python/") #this gives 30 files of 2.5 MB ddf_python.repartition(partition_size="100MB").to_parquet("temp/python/") # 3 files, 2 files 24MB and 1files 13MBddf_pyarrow=ddf.astype({"name": "string[pyarrow]"})
ddf_pyarrow.to_parquet("temp/pyarrow/") #this gives 30 files of 2.5 MB ddf_pyarrow.repartition(partition_size="100MB").to_parquet("temp/pyarrow/") #2 files one 54MB and one 6.7MB
It looks like if we aim for ~100MB in memory it might be worth it to create it and compare.
string[pyarrow]
string[pyarrow]
using 50GB casting object as pyarrow string (exisiting dataset) and reading directly from new dataset.The text was updated successfully, but these errors were encountered: