unusally high memory footprint #202

cschwem2er · 2020-12-06T19:57:44Z

Hi,

I'm using the latest CRAN version of spacyr / spacy installer. For a dataset of 300k documents (~ email length, about 3GB file size uncompressed) I am using spacyr for lemmatization. The result is an insanely high memory footprint:

Unfortunately I can't share the dataset for reproduction, but if I can help to find out what's going on in any other way please let me know :)

kbenoit · 2020-12-06T20:15:09Z

I've noticed similar patterns too. It would also be very interesting to compare this to the memory usage when parsing these texts in spaCy in Python. @amatsuo want to run some tests? I wonder if this is spaCy or whether it's reticulate.

@cschwem2er maybe batching the parsing would solve this?

cschwem2er · 2020-12-06T20:18:46Z

Thanks Ken for the fast response and yes, I used batching as a workaround before and it did the trick (so does buying more RAM =D).

SeanFobbe · 2022-01-25T09:07:03Z

This issue may be due to multithread = TRUE. I've used spacyr a lot over the past couple of months and whenever multithread = TRUE (regardless of corpus) the memory usage increases drastically over multithread = FALSE. The inbuilt multithreading also doesn't spawn any additional processes (that would be detectable via top on a Linux machine).

I'm fairly certain this is somehow related to #206 and multithreading is not working as intended, eating up massive amounts of RAM instead of parallelizing the calculations...

My setup is Fedora 34, running on an AMD Ryzen 7 3700X, using spacyr_1.2.1. I'm happy to supply smaller and larger corpora to test this, but I believe this is a spacyr issue, not a data issue. A good testing corpus (not too large) might be this one (is in German, though): https://doi.org/10.5281/zenodo.3902658

I did succeed in building a parallelized workaround by setting multithread = FALSE and adding a doParallel/foreach framework on top: https://github.com/SeanFobbe/R-fobbe-proto-package/blob/main/f.dopar.spacyparse.R The same approach with a future front/backend fails because of non-exportable objects. Not sure why this doesn't affect the doParallel approach.

kbenoit · 2022-09-01T09:30:03Z

We are aware of these issues and are (finally!) getting around to addressing them in #185. spaCy has also improved in this regard too. Hope to have solutions soon.

SeanFobbe mentioned this issue Jan 25, 2022

Using multithreading #206

Closed

kbenoit closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unusally high memory footprint #202

unusally high memory footprint #202

cschwem2er commented Dec 6, 2020

kbenoit commented Dec 6, 2020

cschwem2er commented Dec 6, 2020

SeanFobbe commented Jan 25, 2022

kbenoit commented Sep 1, 2022

unusally high memory footprint #202

unusally high memory footprint #202

Comments

cschwem2er commented Dec 6, 2020

kbenoit commented Dec 6, 2020

cschwem2er commented Dec 6, 2020

SeanFobbe commented Jan 25, 2022

kbenoit commented Sep 1, 2022