-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large images cause excessive memory usage #900
Comments
@Balearica |
@rohitsahu-bstack Good suggestion, I added a new section explaining this case. https://github.com/naptha/tesseract.js/blob/master/docs/workers_vs_schedulers.md#reusing-workers-in-nodejs-server-code |
I'm closing this issue as I am no longer sure it is correct. I believe this was motivated primarily by memory leaks which have since been resolved. Furthermore, to the extent that large images are problematic, it is unclear that anything can be done as the image will need to be loaded regardless. |
Overview
Tesseract.js currently accepts any valid image, and does not downsize large images. Additionally, while the memory allocated for the webassembly "heap" can increase if needed, it cannot decrease. These behaviors, taken together, can cause issues for applications that run recognition on arbitrary user inputs. A single excessively large image can cause the allocated memory to expand, and for the rest of the workers lifespan, it will always use a large amount of memory. This is especially problematic in cases where schedulers are used with 4+ workers.
Solutions
Individual Projects
Individual projects can mitigate by checking the size of images before sending to Tesseract. If an image is excessively large, it could be rejected or downsized.
Additionally, if Tesseract.js is being run on Node.js for hours on end within server code, the workers should be killed and recreated every so often. While workers are re-usable, and should not be created/killed for every image recognized, there are disadvantages to using them forever. As noted above, memory use can only expand over time, so a single large image will permanently increase the memory footprint of a worker. Additionally, workers "learn" over time by default, editing their internal dictionaries based on words recognized in documents. This is useful within the context of a single document, or group of similar documents, however is not necessarily desirable if recognizing hundreds of unrelated documents. Re-creating the worker resets the dictionary.
Tesseract.js
Eventually, Tesseract.js should automatically downsize images that are over a certain size. This size should be configurable by the user.
The text was updated successfully, but these errors were encountered: