Fix memory leaks #977

Balearica · 2024-12-18T03:04:38Z

Several Node.js users have reported that using a single worker with hundreds of images increases memory usage linearly over time, which indicates the presence of a memory leak. The recommended solution has been to periodically terminate workers and create new ones. While this is good advice for other reasons (see note below), we should still attempt to resolve the memory leak.

The leak is small enough as to only (based on user reports) impact Node.js users recognizing many images on a server, so is likely relatively small on a per-image basis. The most likely explanation is that there is some issue with how we export results from Tesseract. This is based purely on process of elimination--if the issue was with the input (images), the leak would be much larger in magnitude, and if the leak occurred within Tesseract presumably it would be reported and (hopefully) patched within the main Tesseract repo.

Note for users: the advice to not reuse the same workers in perpetuity on a server is good, even if the memory leak gets fixed. This is because Tesseract workers "learn" over time by default. While this learning generally improves results, it assumes that (1) previous results are generally correct and (2) the image that is being recognized closely resembles previous images. As a result, if the same worker is used with hundreds of different documents from different users, it is common for Tesseract to "learn" something incorrect or inapplicable, making results worse than had a fresh worker be used.

Balearica · 2024-12-21T07:24:14Z

I added a basic memory benchmark for Node.js here which recognizes 30 images 10 times each, and prints the result after each of the 10 sets. The results with the current version of Tesseract.js can be seen below.

Iteration	Time	Heap Used	Heap Total	RSS Non-Heap	RSS Total	External
1	9.70s	39 MB	73 MB	697 MB	770 MB	1 MB
2	9.76s	71 MB	107 MB	781 MB	888 MB	1 MB
3	10.33s	104 MB	142 MB	803 MB	945 MB	1 MB
4	9.79s	137 MB	174 MB	818 MB	992 MB	1 MB
5	9.59s	169 MB	208 MB	850 MB	1058 MB	1 MB
6	9.61s	202 MB	244 MB	894 MB	1138 MB	1 MB
7	9.98s	234 MB	277 MB	887 MB	1163 MB	1 MB
8	10.11s	267 MB	308 MB	908 MB	1216 MB	1 MB
9	9.65s	300 MB	342 MB	899 MB	1240 MB	1 MB
10	10.50s	332 MB	378 MB	924 MB	1303 MB	1 MB

Balearica · 2024-12-24T06:15:18Z

Upon investigation, it appears that there are several causes behind the memory leak. Interesting through lines are (1) the memory issues appear to be worse on Node.js than in the browser, and (2) apparent memory leaks are often caused by garbage collection not working correctly in difficult cases, rather than there being a memory leak in a traditional sense (i.e. where memory in WebAssembly is allocated but never freed). As a result, I do not have a good conceptual explanation for why changing all of the factors below fixes the leak.

The proximate causes behind the increase in memory over time appear to be:

The part of the dump function that generates the blocks output using Tesseract iterators.

tesseract.js/src/worker-script/utils/dump.js

Lines 85 to 216 in 9827c2e

    
             if (output.blocks || output.layoutBlocks) { 
        
               ri.Begin(); 
        
               do { 
        
                 if (ri.IsAtBeginningOf(RIL_BLOCK)) { 
        
                   const poly = ri.BlockPolygon(); 
        
                   let polygon = null; 
        
                   // BlockPolygon() returns null when automatic page segmentation is off 
        
                   if (TessModule.getPointer(poly) > 0) { 
        
                     const n = poly.get_n(); 
        
                     const px = poly.get_x(); 
        
                     const py = poly.get_y(); 
        
                     polygon = []; 
        
                     for (let i = 0; i < n; i += 1) { 
        
                       polygon.push([px.getValue(i), py.getValue(i)]); 
        
                     } 
        
                     /* 
        
                      * TODO: find out why _ptaDestroy doesn't work 
        
                      */ 
        
                     // TessModule._ptaDestroy(TessModule.getPointer(poly)); 
        
                   } 
        
                   block = { 
        
                     paragraphs: [], 
        
                     text: !options.skipRecognition ? ri.GetUTF8Text(RIL_BLOCK) : null, 
        
                     confidence: !options.skipRecognition ? ri.Confidence(RIL_BLOCK) : null, 
        
                     baseline: ri.getBaseline(RIL_BLOCK), 
        
                     bbox: ri.getBoundingBox(RIL_BLOCK), 
        
                     blocktype: enumToString(ri.BlockType(), 'PT'), 
        
                     polygon, 
        
                   }; 
        
                   blocks.push(block); 
        
                 } 
        
                 if (ri.IsAtBeginningOf(RIL_PARA)) { 
        
                   para = { 
        
                     lines: [], 
        
                     text: !options.skipRecognition ? ri.GetUTF8Text(RIL_PARA) : null, 
        
                     confidence: !options.skipRecognition ? ri.Confidence(RIL_PARA) : null, 
        
                     baseline: ri.getBaseline(RIL_PARA), 
        
                     bbox: ri.getBoundingBox(RIL_PARA), 
        
                     is_ltr: !!ri.ParagraphIsLtr(), 
        
                   }; 
        
                   block.paragraphs.push(para); 
        
                 } 
        
                 if (ri.IsAtBeginningOf(RIL_TEXTLINE)) { 
        
                   // getRowAttributes was added in a recent minor version of Tesseract.js-core, 
        
                   // so we need to check if it exists before calling it. 
        
                   // This can be removed in the next major version (v6). 
        
                   let rowAttributes; 
        
                   if (ri.getRowAttributes) { 
        
                     rowAttributes = ri.getRowAttributes(); 
        
                     // Descenders is reported as a negative within Tesseract internally so we need to flip it. 
        
                     // The positive version is intuitive, and matches what is reported in the hOCR output. 
        
                     rowAttributes.descenders *= -1; 
        
                   } 
        
                   textline = { 
        
                     words: [], 
        
                     text: !options.skipRecognition ? ri.GetUTF8Text(RIL_TEXTLINE) : null, 
        
                     confidence: !options.skipRecognition ? ri.Confidence(RIL_TEXTLINE) : null, 
        
                     baseline: ri.getBaseline(RIL_TEXTLINE), 
        
                     rowAttributes, 
        
                     bbox: ri.getBoundingBox(RIL_TEXTLINE), 
        
                   }; 
        
                   para.lines.push(textline); 
        
                 } 
        
                 if (ri.IsAtBeginningOf(RIL_WORD)) { 
        
                   const fontInfo = ri.getWordFontAttributes(); 
        
                   const wordDir = ri.WordDirection(); 
        
                   word = { 
        
                     symbols: [], 
        
                     choices: [], 
        
                     text: !options.skipRecognition ? ri.GetUTF8Text(RIL_WORD) : null, 
        
                     confidence: !options.skipRecognition ? ri.Confidence(RIL_WORD) : null, 
        
                     baseline: ri.getBaseline(RIL_WORD), 
        
                     bbox: ri.getBoundingBox(RIL_WORD), 
        
                     is_numeric: !!ri.WordIsNumeric(), 
        
                     in_dictionary: !!ri.WordIsFromDictionary(), 
        
                     direction: enumToString(wordDir, 'DIR'), 
        
                     language: ri.WordRecognitionLanguage(), 
        
                     is_bold: fontInfo.is_bold, 
        
                     is_italic: fontInfo.is_italic, 
        
                     is_underlined: fontInfo.is_underlined, 
        
                     is_monospace: fontInfo.is_monospace, 
        
                     is_serif: fontInfo.is_serif, 
        
                     is_smallcaps: fontInfo.is_smallcaps, 
        
                     font_size: fontInfo.pointsize, 
        
                     font_id: fontInfo.font_id, 
        
                     font_name: fontInfo.font_name, 
        
                   }; 
        
                   const wc = new TessModule.WordChoiceIterator(ri); 
        
                   do { 
        
                     word.choices.push({ 
        
                       text: !options.skipRecognition ? wc.GetUTF8Text() : null, 
        
                       confidence: !options.skipRecognition ? wc.Confidence() : null, 
        
                     }); 
        
                   } while (wc.Next()); 
        
                   TessModule.destroy(wc); 
        
                   textline.words.push(word); 
        
                 } 
        
                 // let image = null; 
        
                 // var pix = ri.GetBinaryImage(TessModule.RIL_SYMBOL) 
        
                 // var image = pix2array(pix); 
        
                 // // for some reason it seems that things stop working if you destroy pics 
        
                 // TessModule._pixDestroy(TessModule.getPointer(pix)); 
        
                 if (ri.IsAtBeginningOf(RIL_SYMBOL)) { 
        
                   symbol = { 
        
                     choices: [], 
        
                     image: null, 
        
                     text: !options.skipRecognition ? ri.GetUTF8Text(RIL_SYMBOL) : null, 
        
                     confidence: !options.skipRecognition ? ri.Confidence(RIL_SYMBOL) : null, 
        
                     baseline: ri.getBaseline(RIL_SYMBOL), 
        
                     bbox: ri.getBoundingBox(RIL_SYMBOL), 
        
                     is_superscript: !!ri.SymbolIsSuperscript(), 
        
                     is_subscript: !!ri.SymbolIsSubscript(), 
        
                     is_dropcap: !!ri.SymbolIsDropcap(), 
        
                   }; 
        
                   word.symbols.push(symbol); 
        
                   const ci = new TessModule.ChoiceIterator(ri); 
        
                   do { 
        
                     symbol.choices.push({ 
        
                       text: !options.skipRecognition ? ci.GetUTF8Text() : null, 
        
                       confidence: !options.skipRecognition ? ci.Confidence() : null, 
        
                     }); 
        
                   } while (ci.Next()); 
        
                   // TessModule.destroy(i); 
        
                 } 
        
               } while (ri.Next(RIL_SYMBOL)); 
        
               TessModule.destroy(ri); 
        
             }

Something about calling the Tesseract (WebAssembly) iterators from JavaScript seems to cause issues that do not impact other formats (which are constructed in Tesseract and sent to JavaScript using a single function).

Promises are not explicitly deleted when they are resolved.

tesseract.js/src/createWorker.js

Lines 228 to 252 in 9827c2e

    
           onMessage(worker, ({ 
        
             workerId, jobId, status, action, data, 
        
           }) => { 
        
             const promiseId = `${action}-${jobId}`; 
        
             if (status === 'resolve') { 
        
               log(`[${workerId}]: Complete ${jobId}`); 
        
               let d = data; 
        
               if (action === 'recognize') { 
        
                 d = circularize(data); 
        
               } else if (action === 'getPDF') { 
        
                 d = Array.from({ ...data, length: Object.keys(data).length }); 
        
               } 
        
               resolves[promiseId]({ jobId, data: d }); 
        
             } else if (status === 'reject') { 
        
               rejects[promiseId](data); 
        
               if (action === 'load') workerResReject(data); 
        
               if (errorHandler) { 
        
                 errorHandler(data); 
        
               } else { 
        
                 throw Error(data); 
        
               } 
        
             } else if (status === 'progress') { 
        
               logger({ ...data, userJobId: jobId }); 
        
             } 
        
           });

The fact that the resolves and rejects arrays are never cleared of old elements appears to prevent the garbage collector from clearing internal data in Node.js. Interestingly, this does not appear to impact the browser.

Additionally, memory issues are made worse by the fact that so many formats are enabled by default. This significantly increases both the amount of memory allocation/deallocation that occurs as well as making the issues with blocks apply to all users. This would be resolved by #916.

Balearica · 2024-12-24T08:14:24Z

#977 made a significant dent, as seen below re-running the benchmark.

Iteration	Time	Heap Used	Heap Total	RSS Non-Heap	RSS Total	External
1	10.35s	6 MB	39 MB	717 MB	755 MB	2 MB
2	9.70s	6 MB	39 MB	783 MB	822 MB	2 MB
3	9.40s	6 MB	40 MB	805 MB	846 MB	2 MB
4	9.67s	6 MB	40 MB	852 MB	892 MB	2 MB
5	9.58s	6 MB	40 MB	878 MB	918 MB	2 MB
6	9.73s	6 MB	40 MB	884 MB	924 MB	2 MB
7	9.38s	6 MB	39 MB	911 MB	951 MB	2 MB
8	9.48s	6 MB	39 MB	952 MB	992 MB	2 MB
9	9.66s	6 MB	40 MB	937 MB	977 MB	2 MB
10	9.66s	6 MB	40 MB	948 MB	988 MB	2 MB

Balearica · 2024-12-25T09:34:20Z

After all changes are implemented, this is the final result. The memory leak appears to be resolved.

Iteration	Time	Heap Used	Heap Total	RSS Non-Heap	RSS Total	External
1	10.21s	6 MB	39 MB	622 MB	661 MB	2 MB
2	9.44s	6 MB	39 MB	656 MB	695 MB	2 MB
3	9.47s	6 MB	39 MB	667 MB	706 MB	2 MB
4	9.49s	6 MB	40 MB	666 MB	706 MB	2 MB
5	9.55s	6 MB	40 MB	674 MB	714 MB	2 MB
6	9.66s	6 MB	40 MB	675 MB	715 MB	2 MB
7	9.79s	6 MB	41 MB	675 MB	716 MB	2 MB
8	9.45s	6 MB	39 MB	677 MB	716 MB	2 MB
9	9.63s	6 MB	40 MB	676 MB	715 MB	2 MB
10	9.46s	6 MB	41 MB	675 MB	715 MB	2 MB

Balearica added a commit that referenced this issue Dec 24, 2024

Updated internal storing of promises to fix memory leak per #977

483131b

Balearica mentioned this issue Dec 24, 2024

Updated internal storing of promises to fix memory leak per #977 #980

Merged

Balearica added a commit that referenced this issue Dec 24, 2024

Updated internal storing of promises to fix memory leak per #977 (#980)

2f2b5e3

Balearica added this to the v6.0 milestone Dec 24, 2024

Balearica mentioned this issue Dec 25, 2024

Moved JSON export code from JavaScript to C++ per #977 #984

Merged

Balearica added a commit that referenced this issue Dec 25, 2024

Updated Tesseract.js-core to fix memory leak per #977

564ec5a

Balearica closed this as completed Dec 25, 2024

Balearica mentioned this issue Jan 7, 2025

Version 6 Changes #993

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leaks #977

Fix memory leaks #977

Balearica commented Dec 18, 2024 •

edited

Loading

Balearica commented Dec 21, 2024

Balearica commented Dec 24, 2024 •

edited

Loading

Balearica commented Dec 24, 2024

Balearica commented Dec 25, 2024

Fix memory leaks #977

Fix memory leaks #977

Comments

Balearica commented Dec 18, 2024 • edited Loading

Balearica commented Dec 21, 2024

Balearica commented Dec 24, 2024 • edited Loading

Balearica commented Dec 24, 2024

Balearica commented Dec 25, 2024

Balearica commented Dec 18, 2024 •

edited

Loading

Balearica commented Dec 24, 2024 •

edited

Loading