Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak when Converting Long PDFs to Markdown #205

Closed
cpa2001 opened this issue Jun 24, 2024 · 16 comments
Closed

Memory Leak when Converting Long PDFs to Markdown #205

cpa2001 opened this issue Jun 24, 2024 · 16 comments

Comments

@cpa2001
Copy link

cpa2001 commented Jun 24, 2024

Description:

I’m encountering a significant memory leak when using Marker to convert long PDFs to Markdown. During the conversion process, the memory usage increases substantially, eventually consuming up to 256GB of RAM and 256GB of SWAP space. This issue occurs consistently with larger PDF files and does not resolve until the process is forcibly terminated.

Steps to Reproduce:

1.	Use Marker to convert long PDF documents to Markdown.
•	OCR_ALL_PAGES=True TORCH_DEVICE=cuda marker ./input/folder ./output/folder --workers 32 --min_length 10000
2.	Monitor memory usage during the conversion process.

Environment:

•	Marker Version: 0.2.14
•	Operating System: Ubuntu 20.04.6 LTS, CUDA 12.3
•	PDF Size: 75.1MB PDF
•	Command Used: OCR_ALL_PAGES=True TORCH_DEVICE=cuda marker ./input/folder ./output/folder --workers 32 --min_length 10000

image

@VikParuchuri
Copy link
Owner

Are you converting one PDF or multiple? From the worker count, etc, guessing multiple. If multiple, how many/how many pages per?

@Degfy
Copy link

Degfy commented Jun 30, 2024

I had the same problem. Marker used all my memory, so I lost control of the server through ssh. 😭

image

@xbloom
Copy link

xbloom commented Jul 4, 2024

I encountered the same issue. When the file is larger,or a 7M/8M file will also encounter a situation where memory is exhausted. If possible, can a configuration be provided to control the memory usage?

@Degfy
Copy link

Degfy commented Jul 12, 2024

I've built a Docker image and run the tool within a container with limited memory, which should work well.

@JarvisUSTC
Copy link

JarvisUSTC commented Jul 17, 2024

I met a similar problem. When I set the number workers to 2 for each gpu in a 8-A100 machine, the machine will be restarted by cluster management system after a while.

@dldx
Copy link

dldx commented Jul 22, 2024

I noticed the same thing with this 200 page PDF. Memory usage hit 75GB within Google Colab despite VRAM usage staying low. I ran the most basic prompt:
marker_single AES_FY23_AR.pdf ./ --langs English

image

AES_FY23_AR.pdf

Marker did run until the end so I got some good output, but this leak is a bit of a bottleneck. Would be great to know why this is happening!

Thank you @VikParuchuri for these amazing libraries!

@zqqian
Copy link

zqqian commented Jul 24, 2024

I also encountered this problem, and my server froze as a result. I had to restart the server.

@VikParuchuri
Copy link
Owner

I'm planning to look into this soon - working on some improved models first, but this is high priority for me

@Marco-Almbauer
Copy link

I am experiencing similar issues, let me elaborate: I am running code in Google Colab and experimenting with the workers option. When using an A100 GPU, I set more than 10 workers. This worked well yesterday, but today it sometimes overloaded the CPU, causing me to lose the connection and GPU access. After (luckily) reconnecting to an A100, I used the default setting (2 workers) without problems, but resources were not fully used. I increased the workers again, achieving a good processing rate of ~3 PDFs/min for around 10 min, but then the CPU started to overload again. I really do not know why, but I guess it is connected to the files I am transferring (?) I do not have any logs to show. I would be thankful if someone could share their optimal number of workers for the Google Colab options and their experience. Right now I run it with a L4 GPU.

image

This makes the code a bit unstable for me. I would prefer to use it on a computer cluster, but due to the instability, I do not dare to use up resources.

I hope this comment can help. I also want to thank you for this great package 💯

@JY9087
Copy link

JY9087 commented Aug 2, 2024

Found the problem. It's surya that causes the memory leak.

In the suyra/recognition.py file, within the batch_recognition() function, there's a line:

processed_batches = processor(text=[""] * len(images), images=images, lang=languages)

This line processes all images at once, consuming too much memory.

To address this issue, I modified the code to process a smaller number of images (batch size) at a time, as follows:

for i in tqdm(range(0, len(images), batch_size), desc="Recognizing Text"):
    batch_langs = languages[i:i+batch_size]
    has_math = ["_math" in lang for lang in batch_langs]
    batch_images = images[i:i+batch_size]
    processed_batches = processor(text=[""] * len(batch_images), images=batch_images, lang=languages[i:i+batch_size])
    batch_pixel_values = processed_batches["pixel_values"][:batch_size]
    batch_langs = processed_batches["langs"][:batch_size]

By processing less images at a time, this approach reduces memory consumption.

@VikParuchuri
Copy link
Owner

Thanks for finding this! I was looking at this thread to ask if anyone has noticed these issues while running OCR, since that's what some of my testing showed, but you beat me to it :)

I'll work on a fix to release shortly

@FireMasterK
Copy link

image

Profiler Screenshot
I can confirm this was the issue!

@FireMasterK
Copy link

It looks like we might have more memory leaks, I made the following changes to recognition.py, but still get OOM killed:

    # Initialize processed batches
    processed_batches = {
        "pixel_values": [],
        "langs": [],
    }

    # Preprocess images
    for i in tqdm(range(0, len(images), batch_size), desc="Preprocessing Images"):
        batch_images = images[i:i+batch_size]
        batch_langs = languages[i:i+batch_size]

        processed_batch = processor(text=[""] * len(batch_images), images=batch_images, lang=batch_langs)

        processed_batches["pixel_values"].extend(processed_batch["pixel_values"])
        processed_batches["langs"].extend(processed_batch["langs"])

image
image

Profiler Screenshots

@VikParuchuri
Copy link
Owner

I have a fix that appears to work here - VikParuchuri/surya@04d8a32 . Note that it is on a branch that I'm still working on, so I won't be merging for a few days.

@aprozo
Copy link

aprozo commented Aug 22, 2024

@VikParuchuri Hello, thanks a lot for your work, is there any update on the merge?

@VikParuchuri
Copy link
Owner

This was merged a few days ago, and marker + surya updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests