what is causing the bottleneck in large files? #418

Ali-Razmjoo · 2024-12-06T19:40:11Z

Hi guys,

Thanks for developing this awesome project. The Markdown conversion works great!

While working on this project, I noticed some big files in my PDFs, and those take a lot of time. I am not talking about x2 or x3; it's x100, at least compared to others. For example, a 300-page PDF can take up to 3-7 days, while a smaller file size is just like 2-3 minutes.

Since GPU servers are brutally expensive, I used pdftk test/test.pdf burst output test/test_page_%02d.pdf to convert my PDFs into pages. Then, I used NUM_DEVICES=1 NUM_WORKERS=3 marker_chunk_convert test test_mds to convert them all individually and merge them via a script,

1. but I was wondering if it affected the quality in any way?

This PDF was around 300 pages long and took 7 days to process. Now, it only takes 3-4 minutes, and it's done.

2. Is there a way to optimize the core engine to do the same? Maybe this functionality already exists, and I missed it.

In addition, is it known what's taking so long in big files? I tried to use your paid API, and after a while, it kept returning 500 errors for some files; for others, I couldn't submit because PDFs were bigger than 2500 pages. I didn't keep the request_id to follow up on the API, and it seems there is no way to get them back.

3. or is it?

4. Would you please share some insights?

Bests, Ali.

The text was updated successfully, but these errors were encountered:

VikParuchuri · 2024-12-08T19:35:26Z

Hi Ali - this is not behavior we've seen - are you converting on CPU? Do you mind sharing an example file?

Ali-Razmjoo · 2024-12-13T14:56:20Z

Hi,

Here are some samples that takes too long:

I tried on different machines:

CPU AMD between 250-370 cores
GPUs 18x RTX 4090, or 8x H200, or 8x H100
RAM 1TB to 2TB
I used dedicated NVME SSD All the time, sometimes raid 0 which is usually faster than single one.

Bests, Ali.

Ali-Razmjoo changed the title ~~what is causing the bottleneck?~~ what is causing the bottleneck in large files? Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what is causing the bottleneck in large files? #418

what is causing the bottleneck in large files? #418

Ali-Razmjoo commented Dec 6, 2024

VikParuchuri commented Dec 8, 2024

Ali-Razmjoo commented Dec 13, 2024

what is causing the bottleneck in large files? #418

what is causing the bottleneck in large files? #418

Comments

Ali-Razmjoo commented Dec 6, 2024

1. but I was wondering if it affected the quality in any way?

2. Is there a way to optimize the core engine to do the same? Maybe this functionality already exists, and I missed it.

3. or is it?

4. Would you please share some insights?

VikParuchuri commented Dec 8, 2024

Ali-Razmjoo commented Dec 13, 2024