Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unusally high memory footprint #202

Closed
cschwem2er opened this issue Dec 6, 2020 · 4 comments
Closed

unusally high memory footprint #202

cschwem2er opened this issue Dec 6, 2020 · 4 comments

Comments

@cschwem2er
Copy link

Hi,

I'm using the latest CRAN version of spacyr / spacy installer. For a dataset of 300k documents (~ email length, about 3GB file size uncompressed) I am using spacyr for lemmatization. The result is an insanely high memory footprint:

image

Unfortunately I can't share the dataset for reproduction, but if I can help to find out what's going on in any other way please let me know :)

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 6, 2020

I've noticed similar patterns too. It would also be very interesting to compare this to the memory usage when parsing these texts in spaCy in Python. @amatsuo want to run some tests? I wonder if this is spaCy or whether it's reticulate.

@cschwem2er maybe batching the parsing would solve this?

@cschwem2er
Copy link
Author

Thanks Ken for the fast response and yes, I used batching as a workaround before and it did the trick (so does buying more RAM =D).

@SeanFobbe
Copy link

This issue may be due to multithread = TRUE. I've used spacyr a lot over the past couple of months and whenever multithread = TRUE (regardless of corpus) the memory usage increases drastically over multithread = FALSE. The inbuilt multithreading also doesn't spawn any additional processes (that would be detectable via top on a Linux machine).

I'm fairly certain this is somehow related to #206 and multithreading is not working as intended, eating up massive amounts of RAM instead of parallelizing the calculations...

My setup is Fedora 34, running on an AMD Ryzen 7 3700X, using spacyr_1.2.1. I'm happy to supply smaller and larger corpora to test this, but I believe this is a spacyr issue, not a data issue. A good testing corpus (not too large) might be this one (is in German, though): https://doi.org/10.5281/zenodo.3902658

I did succeed in building a parallelized workaround by setting multithread = FALSE and adding a doParallel/foreach framework on top: https://github.com/SeanFobbe/R-fobbe-proto-package/blob/main/f.dopar.spacyparse.R The same approach with a future front/backend fails because of non-exportable objects. Not sure why this doesn't affect the doParallel approach.

@kbenoit
Copy link
Collaborator

kbenoit commented Sep 1, 2022

We are aware of these issues and are (finally!) getting around to addressing them in #185. spaCy has also improved in this regard too. Hope to have solutions soon.

@kbenoit kbenoit closed this as completed Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants