Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leaks #977

Closed
Balearica opened this issue Dec 18, 2024 · 4 comments
Closed

Fix memory leaks #977

Balearica opened this issue Dec 18, 2024 · 4 comments
Milestone

Comments

@Balearica
Copy link
Member

Balearica commented Dec 18, 2024

Several Node.js users have reported that using a single worker with hundreds of images increases memory usage linearly over time, which indicates the presence of a memory leak. The recommended solution has been to periodically terminate workers and create new ones. While this is good advice for other reasons (see note below), we should still attempt to resolve the memory leak.

The leak is small enough as to only (based on user reports) impact Node.js users recognizing many images on a server, so is likely relatively small on a per-image basis. The most likely explanation is that there is some issue with how we export results from Tesseract. This is based purely on process of elimination--if the issue was with the input (images), the leak would be much larger in magnitude, and if the leak occurred within Tesseract presumably it would be reported and (hopefully) patched within the main Tesseract repo.

Note for users: the advice to not reuse the same workers in perpetuity on a server is good, even if the memory leak gets fixed. This is because Tesseract workers "learn" over time by default. While this learning generally improves results, it assumes that (1) previous results are generally correct and (2) the image that is being recognized closely resembles previous images. As a result, if the same worker is used with hundreds of different documents from different users, it is common for Tesseract to "learn" something incorrect or inapplicable, making results worse than had a fresh worker be used.

@Balearica
Copy link
Member Author

I added a basic memory benchmark for Node.js here which recognizes 30 images 10 times each, and prints the result after each of the 10 sets. The results with the current version of Tesseract.js can be seen below.

Iteration Time Heap Used Heap Total RSS Non-Heap RSS Total External
1 9.70s 39 MB 73 MB 697 MB 770 MB 1 MB
2 9.76s 71 MB 107 MB 781 MB 888 MB 1 MB
3 10.33s 104 MB 142 MB 803 MB 945 MB 1 MB
4 9.79s 137 MB 174 MB 818 MB 992 MB 1 MB
5 9.59s 169 MB 208 MB 850 MB 1058 MB 1 MB
6 9.61s 202 MB 244 MB 894 MB 1138 MB 1 MB
7 9.98s 234 MB 277 MB 887 MB 1163 MB 1 MB
8 10.11s 267 MB 308 MB 908 MB 1216 MB 1 MB
9 9.65s 300 MB 342 MB 899 MB 1240 MB 1 MB
10 10.50s 332 MB 378 MB 924 MB 1303 MB 1 MB

@Balearica
Copy link
Member Author

Balearica commented Dec 24, 2024

Upon investigation, it appears that there are several causes behind the memory leak. Interesting through lines are (1) the memory issues appear to be worse on Node.js than in the browser, and (2) apparent memory leaks are often caused by garbage collection not working correctly in difficult cases, rather than there being a memory leak in a traditional sense (i.e. where memory in WebAssembly is allocated but never freed). As a result, I do not have a good conceptual explanation for why changing all of the factors below fixes the leak.

The proximate causes behind the increase in memory over time appear to be:

  1. The part of the dump function that generates the blocks output using Tesseract iterators.
    • if (output.blocks || output.layoutBlocks) {
      ri.Begin();
      do {
      if (ri.IsAtBeginningOf(RIL_BLOCK)) {
      const poly = ri.BlockPolygon();
      let polygon = null;
      // BlockPolygon() returns null when automatic page segmentation is off
      if (TessModule.getPointer(poly) > 0) {
      const n = poly.get_n();
      const px = poly.get_x();
      const py = poly.get_y();
      polygon = [];
      for (let i = 0; i < n; i += 1) {
      polygon.push([px.getValue(i), py.getValue(i)]);
      }
      /*
      * TODO: find out why _ptaDestroy doesn't work
      */
      // TessModule._ptaDestroy(TessModule.getPointer(poly));
      }
      block = {
      paragraphs: [],
      text: !options.skipRecognition ? ri.GetUTF8Text(RIL_BLOCK) : null,
      confidence: !options.skipRecognition ? ri.Confidence(RIL_BLOCK) : null,
      baseline: ri.getBaseline(RIL_BLOCK),
      bbox: ri.getBoundingBox(RIL_BLOCK),
      blocktype: enumToString(ri.BlockType(), 'PT'),
      polygon,
      };
      blocks.push(block);
      }
      if (ri.IsAtBeginningOf(RIL_PARA)) {
      para = {
      lines: [],
      text: !options.skipRecognition ? ri.GetUTF8Text(RIL_PARA) : null,
      confidence: !options.skipRecognition ? ri.Confidence(RIL_PARA) : null,
      baseline: ri.getBaseline(RIL_PARA),
      bbox: ri.getBoundingBox(RIL_PARA),
      is_ltr: !!ri.ParagraphIsLtr(),
      };
      block.paragraphs.push(para);
      }
      if (ri.IsAtBeginningOf(RIL_TEXTLINE)) {
      // getRowAttributes was added in a recent minor version of Tesseract.js-core,
      // so we need to check if it exists before calling it.
      // This can be removed in the next major version (v6).
      let rowAttributes;
      if (ri.getRowAttributes) {
      rowAttributes = ri.getRowAttributes();
      // Descenders is reported as a negative within Tesseract internally so we need to flip it.
      // The positive version is intuitive, and matches what is reported in the hOCR output.
      rowAttributes.descenders *= -1;
      }
      textline = {
      words: [],
      text: !options.skipRecognition ? ri.GetUTF8Text(RIL_TEXTLINE) : null,
      confidence: !options.skipRecognition ? ri.Confidence(RIL_TEXTLINE) : null,
      baseline: ri.getBaseline(RIL_TEXTLINE),
      rowAttributes,
      bbox: ri.getBoundingBox(RIL_TEXTLINE),
      };
      para.lines.push(textline);
      }
      if (ri.IsAtBeginningOf(RIL_WORD)) {
      const fontInfo = ri.getWordFontAttributes();
      const wordDir = ri.WordDirection();
      word = {
      symbols: [],
      choices: [],
      text: !options.skipRecognition ? ri.GetUTF8Text(RIL_WORD) : null,
      confidence: !options.skipRecognition ? ri.Confidence(RIL_WORD) : null,
      baseline: ri.getBaseline(RIL_WORD),
      bbox: ri.getBoundingBox(RIL_WORD),
      is_numeric: !!ri.WordIsNumeric(),
      in_dictionary: !!ri.WordIsFromDictionary(),
      direction: enumToString(wordDir, 'DIR'),
      language: ri.WordRecognitionLanguage(),
      is_bold: fontInfo.is_bold,
      is_italic: fontInfo.is_italic,
      is_underlined: fontInfo.is_underlined,
      is_monospace: fontInfo.is_monospace,
      is_serif: fontInfo.is_serif,
      is_smallcaps: fontInfo.is_smallcaps,
      font_size: fontInfo.pointsize,
      font_id: fontInfo.font_id,
      font_name: fontInfo.font_name,
      };
      const wc = new TessModule.WordChoiceIterator(ri);
      do {
      word.choices.push({
      text: !options.skipRecognition ? wc.GetUTF8Text() : null,
      confidence: !options.skipRecognition ? wc.Confidence() : null,
      });
      } while (wc.Next());
      TessModule.destroy(wc);
      textline.words.push(word);
      }
      // let image = null;
      // var pix = ri.GetBinaryImage(TessModule.RIL_SYMBOL)
      // var image = pix2array(pix);
      // // for some reason it seems that things stop working if you destroy pics
      // TessModule._pixDestroy(TessModule.getPointer(pix));
      if (ri.IsAtBeginningOf(RIL_SYMBOL)) {
      symbol = {
      choices: [],
      image: null,
      text: !options.skipRecognition ? ri.GetUTF8Text(RIL_SYMBOL) : null,
      confidence: !options.skipRecognition ? ri.Confidence(RIL_SYMBOL) : null,
      baseline: ri.getBaseline(RIL_SYMBOL),
      bbox: ri.getBoundingBox(RIL_SYMBOL),
      is_superscript: !!ri.SymbolIsSuperscript(),
      is_subscript: !!ri.SymbolIsSubscript(),
      is_dropcap: !!ri.SymbolIsDropcap(),
      };
      word.symbols.push(symbol);
      const ci = new TessModule.ChoiceIterator(ri);
      do {
      symbol.choices.push({
      text: !options.skipRecognition ? ci.GetUTF8Text() : null,
      confidence: !options.skipRecognition ? ci.Confidence() : null,
      });
      } while (ci.Next());
      // TessModule.destroy(i);
      }
      } while (ri.Next(RIL_SYMBOL));
      TessModule.destroy(ri);
      }
    • Something about calling the Tesseract (WebAssembly) iterators from JavaScript seems to cause issues that do not impact other formats (which are constructed in Tesseract and sent to JavaScript using a single function).
  2. Promises are not explicitly deleted when they are resolved.
    • onMessage(worker, ({
      workerId, jobId, status, action, data,
      }) => {
      const promiseId = `${action}-${jobId}`;
      if (status === 'resolve') {
      log(`[${workerId}]: Complete ${jobId}`);
      let d = data;
      if (action === 'recognize') {
      d = circularize(data);
      } else if (action === 'getPDF') {
      d = Array.from({ ...data, length: Object.keys(data).length });
      }
      resolves[promiseId]({ jobId, data: d });
      } else if (status === 'reject') {
      rejects[promiseId](data);
      if (action === 'load') workerResReject(data);
      if (errorHandler) {
      errorHandler(data);
      } else {
      throw Error(data);
      }
      } else if (status === 'progress') {
      logger({ ...data, userJobId: jobId });
      }
      });
    • The fact that the resolves and rejects arrays are never cleared of old elements appears to prevent the garbage collector from clearing internal data in Node.js. Interestingly, this does not appear to impact the browser.

Additionally, memory issues are made worse by the fact that so many formats are enabled by default. This significantly increases both the amount of memory allocation/deallocation that occurs as well as making the issues with blocks apply to all users. This would be resolved by #916.

@Balearica
Copy link
Member Author

#977 made a significant dent, as seen below re-running the benchmark.

Iteration Time Heap Used Heap Total RSS Non-Heap RSS Total External
1 10.35s 6 MB 39 MB 717 MB 755 MB 2 MB
2 9.70s 6 MB 39 MB 783 MB 822 MB 2 MB
3 9.40s 6 MB 40 MB 805 MB 846 MB 2 MB
4 9.67s 6 MB 40 MB 852 MB 892 MB 2 MB
5 9.58s 6 MB 40 MB 878 MB 918 MB 2 MB
6 9.73s 6 MB 40 MB 884 MB 924 MB 2 MB
7 9.38s 6 MB 39 MB 911 MB 951 MB 2 MB
8 9.48s 6 MB 39 MB 952 MB 992 MB 2 MB
9 9.66s 6 MB 40 MB 937 MB 977 MB 2 MB
10 9.66s 6 MB 40 MB 948 MB 988 MB 2 MB

@Balearica
Copy link
Member Author

After all changes are implemented, this is the final result. The memory leak appears to be resolved.

Iteration Time Heap Used Heap Total RSS Non-Heap RSS Total External
1 10.21s 6 MB 39 MB 622 MB 661 MB 2 MB
2 9.44s 6 MB 39 MB 656 MB 695 MB 2 MB
3 9.47s 6 MB 39 MB 667 MB 706 MB 2 MB
4 9.49s 6 MB 40 MB 666 MB 706 MB 2 MB
5 9.55s 6 MB 40 MB 674 MB 714 MB 2 MB
6 9.66s 6 MB 40 MB 675 MB 715 MB 2 MB
7 9.79s 6 MB 41 MB 675 MB 716 MB 2 MB
8 9.45s 6 MB 39 MB 677 MB 716 MB 2 MB
9 9.63s 6 MB 40 MB 676 MB 715 MB 2 MB
10 9.46s 6 MB 41 MB 675 MB 715 MB 2 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant