Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what is causing the bottleneck in large files? #418

Open
Ali-Razmjoo opened this issue Dec 6, 2024 · 2 comments
Open

what is causing the bottleneck in large files? #418

Ali-Razmjoo opened this issue Dec 6, 2024 · 2 comments

Comments

@Ali-Razmjoo
Copy link

Hi guys,

Thanks for developing this awesome project. The Markdown conversion works great!

While working on this project, I noticed some big files in my PDFs, and those take a lot of time. I am not talking about x2 or x3; it's x100, at least compared to others. For example, a 300-page PDF can take up to 3-7 days, while a smaller file size is just like 2-3 minutes.

Since GPU servers are brutally expensive, I used pdftk test/test.pdf burst output test/test_page_%02d.pdf to convert my PDFs into pages. Then, I used NUM_DEVICES=1 NUM_WORKERS=3 marker_chunk_convert test test_mds to convert them all individually and merge them via a script,

1. but I was wondering if it affected the quality in any way?

This PDF was around 300 pages long and took 7 days to process. Now, it only takes 3-4 minutes, and it's done.

2. Is there a way to optimize the core engine to do the same? Maybe this functionality already exists, and I missed it.

In addition, is it known what's taking so long in big files? I tried to use your paid API, and after a while, it kept returning 500 errors for some files; for others, I couldn't submit because PDFs were bigger than 2500 pages. I didn't keep the request_id to follow up on the API, and it seems there is no way to get them back.

3. or is it?

4. Would you please share some insights?

Bests, Ali.

@Ali-Razmjoo Ali-Razmjoo changed the title what is causing the bottleneck? what is causing the bottleneck in large files? Dec 6, 2024
@VikParuchuri
Copy link
Owner

Hi Ali - this is not behavior we've seen - are you converting on CPU? Do you mind sharing an example file?

@Ali-Razmjoo
Copy link
Author

Hi,

Here are some samples that takes too long:

  1. https://www.nist.gov/publications/report-technical-investigation-station-nightclub-fire-appendices-nist-ncstar-2-volume-2
  2. https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication340.pdf
  3. https://www.etsi.org/deliver/etsi_ts/138100_138199/13810103/17.09.00_60/ts_13810103v170900p.pdf
  4. https://www.etsi.org/deliver/etsi_TS/136500_136599/13652303/17.05.00_60/ts_13652303v170500p.pdf

I tried on different machines:

  1. CPU AMD between 250-370 cores
  2. GPUs 18x RTX 4090, or 8x H200, or 8x H100
  3. RAM 1TB to 2TB
  4. I used dedicated NVME SSD All the time, sometimes raid 0 which is usually faster than single one.

Bests, Ali.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants