-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak when Converting Long PDFs to Markdown #205
Comments
Are you converting one PDF or multiple? From the worker count, etc, guessing multiple. If multiple, how many/how many pages per? |
I encountered the same issue. When the file is larger,or a 7M/8M file will also encounter a situation where memory is exhausted. If possible, can a configuration be provided to control the memory usage? |
I've built a Docker image and run the tool within a container with limited memory, which should work well. |
I met a similar problem. When I set the number workers to 2 for each gpu in a 8-A100 machine, the machine will be restarted by cluster management system after a while. |
I noticed the same thing with this 200 page PDF. Memory usage hit 75GB within Google Colab despite VRAM usage staying low. I ran the most basic prompt: Marker did run until the end so I got some good output, but this leak is a bit of a bottleneck. Would be great to know why this is happening! Thank you @VikParuchuri for these amazing libraries! |
I also encountered this problem, and my server froze as a result. I had to restart the server. |
I'm planning to look into this soon - working on some improved models first, but this is high priority for me |
I am experiencing similar issues, let me elaborate: I am running code in Google Colab and experimenting with the workers option. When using an A100 GPU, I set more than 10 workers. This worked well yesterday, but today it sometimes overloaded the CPU, causing me to lose the connection and GPU access. After (luckily) reconnecting to an A100, I used the default setting (2 workers) without problems, but resources were not fully used. I increased the workers again, achieving a good processing rate of ~3 PDFs/min for around 10 min, but then the CPU started to overload again. I really do not know why, but I guess it is connected to the files I am transferring (?) I do not have any logs to show. I would be thankful if someone could share their optimal number of workers for the Google Colab options and their experience. Right now I run it with a L4 GPU. This makes the code a bit unstable for me. I would prefer to use it on a computer cluster, but due to the instability, I do not dare to use up resources. I hope this comment can help. I also want to thank you for this great package 💯 |
Found the problem. It's surya that causes the memory leak. In the suyra/recognition.py file, within the batch_recognition() function, there's a line: processed_batches = processor(text=[""] * len(images), images=images, lang=languages) This line processes all images at once, consuming too much memory. To address this issue, I modified the code to process a smaller number of images (batch size) at a time, as follows:
By processing less images at a time, this approach reduces memory consumption. |
Thanks for finding this! I was looking at this thread to ask if anyone has noticed these issues while running OCR, since that's what some of my testing showed, but you beat me to it :) I'll work on a fix to release shortly |
It looks like we might have more memory leaks, I made the following changes to recognition.py, but still get OOM killed:
|
I have a fix that appears to work here - VikParuchuri/surya@04d8a32 . Note that it is on a branch that I'm still working on, so I won't be merging for a few days. |
@VikParuchuri Hello, thanks a lot for your work, is there any update on the merge? |
This was merged a few days ago, and marker + surya updated |
Description:
I’m encountering a significant memory leak when using Marker to convert long PDFs to Markdown. During the conversion process, the memory usage increases substantially, eventually consuming up to 256GB of RAM and 256GB of SWAP space. This issue occurs consistently with larger PDF files and does not resolve until the process is forcibly terminated.
Steps to Reproduce:
Environment:
The text was updated successfully, but these errors were encountered: