Memory Leak when Converting Long PDFs to Markdown #205

cpa2001 · 2024-06-24T09:19:04Z

Description:

I’m encountering a significant memory leak when using Marker to convert long PDFs to Markdown. During the conversion process, the memory usage increases substantially, eventually consuming up to 256GB of RAM and 256GB of SWAP space. This issue occurs consistently with larger PDF files and does not resolve until the process is forcibly terminated.

Steps to Reproduce:

1.	Use Marker to convert long PDF documents to Markdown.
•	OCR_ALL_PAGES=True TORCH_DEVICE=cuda marker ./input/folder ./output/folder --workers 32 --min_length 10000
2.	Monitor memory usage during the conversion process.

Environment:

•	Marker Version: 0.2.14
•	Operating System: Ubuntu 20.04.6 LTS, CUDA 12.3
•	PDF Size: 75.1MB PDF
•	Command Used: OCR_ALL_PAGES=True TORCH_DEVICE=cuda marker ./input/folder ./output/folder --workers 32 --min_length 10000

The text was updated successfully, but these errors were encountered:

VikParuchuri · 2024-06-30T15:02:46Z

Are you converting one PDF or multiple? From the worker count, etc, guessing multiple. If multiple, how many/how many pages per?

Degfy · 2024-06-30T15:49:53Z

I had the same problem. Marker used all my memory, so I lost control of the server through ssh. 😭

xbloom · 2024-07-04T18:05:48Z

I encountered the same issue. When the file is larger,or a 7M/8M file will also encounter a situation where memory is exhausted. If possible, can a configuration be provided to control the memory usage?

Degfy · 2024-07-12T23:59:02Z

I've built a Docker image and run the tool within a container with limited memory, which should work well.

JarvisUSTC · 2024-07-17T07:17:06Z

I met a similar problem. When I set the number workers to 2 for each gpu in a 8-A100 machine, the machine will be restarted by cluster management system after a while.

dldx · 2024-07-22T18:37:29Z

I noticed the same thing with this 200 page PDF. Memory usage hit 75GB within Google Colab despite VRAM usage staying low. I ran the most basic prompt:
marker_single AES_FY23_AR.pdf ./ --langs English

AES_FY23_AR.pdf

Marker did run until the end so I got some good output, but this leak is a bit of a bottleneck. Would be great to know why this is happening!

Thank you @VikParuchuri for these amazing libraries!

zqqian · 2024-07-24T11:09:43Z

I also encountered this problem, and my server froze as a result. I had to restart the server.

VikParuchuri · 2024-07-31T04:40:41Z

I'm planning to look into this soon - working on some improved models first, but this is high priority for me

Marco-Almbauer · 2024-08-01T13:28:58Z

I am experiencing similar issues, let me elaborate: I am running code in Google Colab and experimenting with the workers option. When using an A100 GPU, I set more than 10 workers. This worked well yesterday, but today it sometimes overloaded the CPU, causing me to lose the connection and GPU access. After (luckily) reconnecting to an A100, I used the default setting (2 workers) without problems, but resources were not fully used. I increased the workers again, achieving a good processing rate of ~3 PDFs/min for around 10 min, but then the CPU started to overload again. I really do not know why, but I guess it is connected to the files I am transferring (?) I do not have any logs to show. I would be thankful if someone could share their optimal number of workers for the Google Colab options and their experience. Right now I run it with a L4 GPU.

This makes the code a bit unstable for me. I would prefer to use it on a computer cluster, but due to the instability, I do not dare to use up resources.

I hope this comment can help. I also want to thank you for this great package 💯

JY9087 · 2024-08-02T02:54:18Z

Found the problem. It's surya that causes the memory leak.

In the suyra/recognition.py file, within the batch_recognition() function, there's a line:

processed_batches = processor(text=[""] * len(images), images=images, lang=languages)

This line processes all images at once, consuming too much memory.

To address this issue, I modified the code to process a smaller number of images (batch size) at a time, as follows:

for i in tqdm(range(0, len(images), batch_size), desc="Recognizing Text"):
    batch_langs = languages[i:i+batch_size]
    has_math = ["_math" in lang for lang in batch_langs]
    batch_images = images[i:i+batch_size]
    processed_batches = processor(text=[""] * len(batch_images), images=batch_images, lang=languages[i:i+batch_size])
    batch_pixel_values = processed_batches["pixel_values"][:batch_size]
    batch_langs = processed_batches["langs"][:batch_size]

By processing less images at a time, this approach reduces memory consumption.

VikParuchuri · 2024-08-04T04:34:10Z

Thanks for finding this! I was looking at this thread to ask if anyone has noticed these issues while running OCR, since that's what some of my testing showed, but you beat me to it :)

I'll work on a fix to release shortly

FireMasterK · 2024-08-04T05:38:34Z

Profiler Screenshot

I can confirm this was the issue!

FireMasterK · 2024-08-04T17:07:16Z

It looks like we might have more memory leaks, I made the following changes to recognition.py, but still get OOM killed:

    # Initialize processed batches
    processed_batches = {
        "pixel_values": [],
        "langs": [],
    }

    # Preprocess images
    for i in tqdm(range(0, len(images), batch_size), desc="Preprocessing Images"):
        batch_images = images[i:i+batch_size]
        batch_langs = languages[i:i+batch_size]

        processed_batch = processor(text=[""] * len(batch_images), images=batch_images, lang=batch_langs)

        processed_batches["pixel_values"].extend(processed_batch["pixel_values"])
        processed_batches["langs"].extend(processed_batch["langs"])

Profiler Screenshots

VikParuchuri · 2024-08-05T15:56:41Z

I have a fix that appears to work here - VikParuchuri/surya@04d8a32 . Note that it is on a branch that I'm still working on, so I won't be merging for a few days.

aprozo · 2024-08-22T15:23:55Z

@VikParuchuri Hello, thanks a lot for your work, is there any update on the merge?

VikParuchuri · 2024-08-22T16:28:26Z

This was merged a few days ago, and marker + surya updated

joeamroo mentioned this issue Jul 17, 2024

Script stops running after a while with big file sizes (~80-100MB). #230

Open

VikParuchuri closed this as completed Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak when Converting Long PDFs to Markdown #205

Memory Leak when Converting Long PDFs to Markdown #205

cpa2001 commented Jun 24, 2024

VikParuchuri commented Jun 30, 2024

Degfy commented Jun 30, 2024 •

edited

Loading

xbloom commented Jul 4, 2024

Degfy commented Jul 12, 2024

JarvisUSTC commented Jul 17, 2024 •

edited

Loading

dldx commented Jul 22, 2024

zqqian commented Jul 24, 2024

VikParuchuri commented Jul 31, 2024

Marco-Almbauer commented Aug 1, 2024

JY9087 commented Aug 2, 2024 •

edited

Loading

VikParuchuri commented Aug 4, 2024

FireMasterK commented Aug 4, 2024

FireMasterK commented Aug 4, 2024

VikParuchuri commented Aug 5, 2024

aprozo commented Aug 22, 2024

VikParuchuri commented Aug 22, 2024

Memory Leak when Converting Long PDFs to Markdown #205

Memory Leak when Converting Long PDFs to Markdown #205

Comments

cpa2001 commented Jun 24, 2024

VikParuchuri commented Jun 30, 2024

Degfy commented Jun 30, 2024 • edited Loading

xbloom commented Jul 4, 2024

Degfy commented Jul 12, 2024

JarvisUSTC commented Jul 17, 2024 • edited Loading

dldx commented Jul 22, 2024

zqqian commented Jul 24, 2024

VikParuchuri commented Jul 31, 2024

Marco-Almbauer commented Aug 1, 2024

JY9087 commented Aug 2, 2024 • edited Loading

VikParuchuri commented Aug 4, 2024

FireMasterK commented Aug 4, 2024

FireMasterK commented Aug 4, 2024

VikParuchuri commented Aug 5, 2024

aprozo commented Aug 22, 2024

VikParuchuri commented Aug 22, 2024

Degfy commented Jun 30, 2024 •

edited

Loading

JarvisUSTC commented Jul 17, 2024 •

edited

Loading

JY9087 commented Aug 2, 2024 •

edited

Loading