Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

saving a model into HDF5 #9

Open
sergsb opened this issue Jul 3, 2018 · 7 comments
Open

saving a model into HDF5 #9

sergsb opened this issue Jul 3, 2018 · 7 comments

Comments

@sergsb
Copy link

sergsb commented Jul 3, 2018

Hi,

Is it possible to dump a model into HDF5 file? The problem is when i try to dump 1500 samples with your function save_prefix the total disk space required is more than 10Gb!, because csv is not well-designed to store numerical values.

@jaak-s
Copy link
Owner

jaak-s commented Jul 3, 2018

Hi,

Currently it is not possible. However, it would not be too difficult to implement. I could implement HDF5 support where each sample is dumped into a separate HDF5 file. Would that be useful for you?

@sergsb
Copy link
Author

sergsb commented Jul 3, 2018

I think it is not a very good idea, because the number of files will be great and overhead of writing will be quite big. Is it possible to collect and dump all records together?

@jaak-s
Copy link
Owner

jaak-s commented Jul 3, 2018

Unfortunately, this will work for only small models where all samples fit into memory. For larger cases where one sample takes a GB or more, it is not feasible to store all in memory.

@gabora
Copy link

gabora commented Jul 1, 2019

Hi,
I have a similar problem, but instead of a huge model size, I get too many csv files in a run (600 samples over 40 matrices results in ~100k csv files and now I need to repeat this like 200 times with other matrices). I want to do some crossvalidation outside of macau, so I need the model matrices.

Would it be possible to return the model matrices directly from the macau function?
Then the user could decide how to store them.

@jaak-s
Copy link
Owner

jaak-s commented Jul 1, 2019

Hi,
One key question is whether all samples will fit into memory?
Main reason why did not provide that option was to avoid running out of memory.
As a quick solution I would propose create a script that loads all CSV files (for 1 matrix) and stacks them and saves them into 1 file, e.g., with numpy.save or with hdf5. This loop can be just run after the run finishes and then you can delete the CSV files.

@gabora
Copy link

gabora commented Jul 2, 2019

thanks for the quick response.
I see your point, probably the memory usage of these matrices can escalate very quickly...

I took a pipeline from someone which followed approximately your procedure (read the csv files, use the model matrices and then delete the csv files). Then I modified the pipeline not to delete the model files ( so I can use them later). The result was hilarious: after 10 minutes running of the code parallel in 80 jobs, I exceeded my harddisk quota on the cluster and then I spent more than an hour to delete all the files created :)

thanks anyway, I will convert the csv files to binary data and save them compressed. hopefully that will be smaller.
If someone else would follow this approach: there is the feather package here that can save the files in binary format that is also compatible with R.
thanks again, have a nice day.

@jaak-s
Copy link
Owner

jaak-s commented Jul 2, 2019

Using binary format should reduce the needed disk-space, also float32 should have sufficient precision for storing the matrices. I hope this solves the issue :).

The feather package you linked looks attractive for that purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants