You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for developing this awesome project. The Markdown conversion works great!
While working on this project, I noticed some big files in my PDFs, and those take a lot of time. I am not talking about x2 or x3; it's x100, at least compared to others. For example, a 300-page PDF can take up to 3-7 days, while a smaller file size is just like 2-3 minutes.
Since GPU servers are brutally expensive, I used pdftk test/test.pdf burst output test/test_page_%02d.pdf to convert my PDFs into pages. Then, I used NUM_DEVICES=1 NUM_WORKERS=3 marker_chunk_convert test test_mds to convert them all individually and merge them via a script,
1. but I was wondering if it affected the quality in any way?
This PDF was around 300 pages long and took 7 days to process. Now, it only takes 3-4 minutes, and it's done.
2. Is there a way to optimize the core engine to do the same? Maybe this functionality already exists, and I missed it.
In addition, is it known what's taking so long in big files? I tried to use your paid API, and after a while, it kept returning 500 errors for some files; for others, I couldn't submit because PDFs were bigger than 2500 pages. I didn't keep the request_id to follow up on the API, and it seems there is no way to get them back.
3. or is it?
4. Would you please share some insights?
Bests, Ali.
The text was updated successfully, but these errors were encountered:
Ali-Razmjoo
changed the title
what is causing the bottleneck?
what is causing the bottleneck in large files?
Dec 6, 2024
Hi guys,
Thanks for developing this awesome project. The Markdown conversion works great!
While working on this project, I noticed some big files in my PDFs, and those take a lot of time. I am not talking about x2 or x3; it's x100, at least compared to others. For example, a 300-page PDF can take up to 3-7 days, while a smaller file size is just like 2-3 minutes.
Since GPU servers are brutally expensive, I used
pdftk test/test.pdf burst output test/test_page_%02d.pdf
to convert my PDFs into pages. Then, I usedNUM_DEVICES=1 NUM_WORKERS=3 marker_chunk_convert test test_mds
to convert them all individually and merge them via a script,1. but I was wondering if it affected the quality in any way?
This PDF was around 300 pages long and took 7 days to process. Now, it only takes 3-4 minutes, and it's done.
2. Is there a way to optimize the core engine to do the same? Maybe this functionality already exists, and I missed it.
In addition, is it known what's taking so long in big files? I tried to use your paid API, and after a while, it kept returning 500 errors for some files; for others, I couldn't submit because PDFs were bigger than 2500 pages. I didn't keep the
request_id
to follow up on the API, and it seems there is no way to get them back.3. or is it?
4. Would you please share some insights?
Bests, Ali.
The text was updated successfully, but these errors were encountered: