saving a model into HDF5 #9

sergsb · 2018-07-03T11:26:54Z

Hi,

Is it possible to dump a model into HDF5 file? The problem is when i try to dump 1500 samples with your function save_prefix the total disk space required is more than 10Gb!, because csv is not well-designed to store numerical values.

jaak-s · 2018-07-03T11:47:42Z

Hi,

Currently it is not possible. However, it would not be too difficult to implement. I could implement HDF5 support where each sample is dumped into a separate HDF5 file. Would that be useful for you?

sergsb · 2018-07-03T13:06:54Z

I think it is not a very good idea, because the number of files will be great and overhead of writing will be quite big. Is it possible to collect and dump all records together?

jaak-s · 2018-07-03T19:25:35Z

Unfortunately, this will work for only small models where all samples fit into memory. For larger cases where one sample takes a GB or more, it is not feasible to store all in memory.

gabora · 2019-07-01T13:10:52Z

Hi,
I have a similar problem, but instead of a huge model size, I get too many csv files in a run (600 samples over 40 matrices results in ~100k csv files and now I need to repeat this like 200 times with other matrices). I want to do some crossvalidation outside of macau, so I need the model matrices.

Would it be possible to return the model matrices directly from the macau function?
Then the user could decide how to store them.

jaak-s · 2019-07-01T20:24:10Z

Hi,
One key question is whether all samples will fit into memory?
Main reason why did not provide that option was to avoid running out of memory.
As a quick solution I would propose create a script that loads all CSV files (for 1 matrix) and stacks them and saves them into 1 file, e.g., with numpy.save or with hdf5. This loop can be just run after the run finishes and then you can delete the CSV files.

gabora · 2019-07-02T08:08:37Z

thanks for the quick response.
I see your point, probably the memory usage of these matrices can escalate very quickly...

I took a pipeline from someone which followed approximately your procedure (read the csv files, use the model matrices and then delete the csv files). Then I modified the pipeline not to delete the model files ( so I can use them later). The result was hilarious: after 10 minutes running of the code parallel in 80 jobs, I exceeded my harddisk quota on the cluster and then I spent more than an hour to delete all the files created :)

thanks anyway, I will convert the csv files to binary data and save them compressed. hopefully that will be smaller.
If someone else would follow this approach: there is the feather package here that can save the files in binary format that is also compatible with R.
thanks again, have a nice day.

jaak-s · 2019-07-02T08:30:09Z

Using binary format should reduce the needed disk-space, also float32 should have sufficient precision for storing the matrices. I hope this solves the issue :).

The feather package you linked looks attractive for that purpose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

saving a model into HDF5 #9

saving a model into HDF5 #9

sergsb commented Jul 3, 2018

jaak-s commented Jul 3, 2018

sergsb commented Jul 3, 2018

jaak-s commented Jul 3, 2018

gabora commented Jul 1, 2019

jaak-s commented Jul 1, 2019

gabora commented Jul 2, 2019

jaak-s commented Jul 2, 2019

saving a model into HDF5 #9

saving a model into HDF5 #9

Comments

sergsb commented Jul 3, 2018

jaak-s commented Jul 3, 2018

sergsb commented Jul 3, 2018

jaak-s commented Jul 3, 2018

gabora commented Jul 1, 2019

jaak-s commented Jul 1, 2019

gabora commented Jul 2, 2019

jaak-s commented Jul 2, 2019