-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unusally high memory footprint #202
Comments
I've noticed similar patterns too. It would also be very interesting to compare this to the memory usage when parsing these texts in spaCy in Python. @amatsuo want to run some tests? I wonder if this is spaCy or whether it's reticulate. @cschwem2er maybe batching the parsing would solve this? |
Thanks Ken for the fast response and yes, I used batching as a workaround before and it did the trick (so does buying more RAM =D). |
This issue may be due to I'm fairly certain this is somehow related to #206 and multithreading is not working as intended, eating up massive amounts of RAM instead of parallelizing the calculations... My setup is Fedora 34, running on an AMD Ryzen 7 3700X, using spacyr_1.2.1. I'm happy to supply smaller and larger corpora to test this, but I believe this is a spacyr issue, not a data issue. A good testing corpus (not too large) might be this one (is in German, though): https://doi.org/10.5281/zenodo.3902658 I did succeed in building a parallelized workaround by setting |
We are aware of these issues and are (finally!) getting around to addressing them in #185. spaCy has also improved in this regard too. Hope to have solutions soon. |
Hi,
I'm using the latest CRAN version of spacyr / spacy installer. For a dataset of 300k documents (~ email length, about 3GB file size uncompressed) I am using spacyr for lemmatization. The result is an insanely high memory footprint:
Unfortunately I can't share the dataset for reproduction, but if I can help to find out what's going on in any other way please let me know :)
The text was updated successfully, but these errors were encountered: