-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leaks #977
Comments
I added a basic memory benchmark for Node.js here which recognizes 30 images 10 times each, and prints the result after each of the 10 sets. The results with the current version of Tesseract.js can be seen below.
|
Upon investigation, it appears that there are several causes behind the memory leak. Interesting through lines are (1) the memory issues appear to be worse on Node.js than in the browser, and (2) apparent memory leaks are often caused by garbage collection not working correctly in difficult cases, rather than there being a memory leak in a traditional sense (i.e. where memory in WebAssembly is allocated but never freed). As a result, I do not have a good conceptual explanation for why changing all of the factors below fixes the leak. The proximate causes behind the increase in memory over time appear to be:
Additionally, memory issues are made worse by the fact that so many formats are enabled by default. This significantly increases both the amount of memory allocation/deallocation that occurs as well as making the issues with |
#977 made a significant dent, as seen below re-running the benchmark.
|
After all changes are implemented, this is the final result. The memory leak appears to be resolved.
|
Several Node.js users have reported that using a single worker with hundreds of images increases memory usage linearly over time, which indicates the presence of a memory leak. The recommended solution has been to periodically terminate workers and create new ones. While this is good advice for other reasons (see note below), we should still attempt to resolve the memory leak.
The leak is small enough as to only (based on user reports) impact Node.js users recognizing many images on a server, so is likely relatively small on a per-image basis. The most likely explanation is that there is some issue with how we export results from Tesseract. This is based purely on process of elimination--if the issue was with the input (images), the leak would be much larger in magnitude, and if the leak occurred within Tesseract presumably it would be reported and (hopefully) patched within the main Tesseract repo.
Note for users: the advice to not reuse the same workers in perpetuity on a server is good, even if the memory leak gets fixed. This is because Tesseract workers "learn" over time by default. While this learning generally improves results, it assumes that (1) previous results are generally correct and (2) the image that is being recognized closely resembles previous images. As a result, if the same worker is used with hundreds of different documents from different users, it is common for Tesseract to "learn" something incorrect or inapplicable, making results worse than had a fresh worker be used.
The text was updated successfully, but these errors were encountered: