Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abrubt Termination (Without any error) on Google Colab, AWS EC2 #270

Closed
G999n opened this issue Aug 25, 2024 · 4 comments
Closed

Abrubt Termination (Without any error) on Google Colab, AWS EC2 #270

G999n opened this issue Aug 25, 2024 · 4 comments

Comments

@G999n
Copy link

G999n commented Aug 25, 2024

The conversion process abruptly terminates at random intervals in the Detecting Boxes Stage on Google Colab and AWS EC2 instance (WIndows). The percentage value varies randomly.

AWS EC2
image

Google Colab
image

Document size is 198 pages with a mixture of selectable text, scanned text, screenshots of certificates, tables, scanned images of printed tables, etc.

@frankbaele
Copy link

what was the file size? how many pages? i can be that the instances runs out of memory.

@G999n
Copy link
Author

G999n commented Aug 26, 2024

what was the file size? how many pages? i can be that the instances runs out of memory.

Document size is 198 pages with a mixture of selectable text, scanned text, screenshots of certificates, tables, scanned images of printed tables, etc.

The file size is 15.6 MB.
As per the instructions, I was using freeRAM//3 as the batch_multiplier
--batch_multiplier 3 on Colab (which had 11 GB of free RAM)
--batch_multiplier 2 (and then tried 1 too) on AWS EC2 (which had 8 GB of RAM)
However, both of the above were CPU instances. I wasn't using any GPU in colab or EC2.

The conversion worked fine on vast.ai's jupyter lab instance with RTX 4090 (24 GB VRAM) and 32 GB RAM. I had used --batch_multiplier 7 here.

Apart from the memory required for the batches (which is ~3GB per batch), I had assumed that a minimal memory will be required by the program that would be constant regardless of the pdf size. Is it not the case?

@frankbaele
Copy link

frankbaele commented Aug 27, 2024

The vram is limited and will not go up with page size, but you ram will.

A workaround would be slicing your pdf with PyMuPDF in smaller batches and merging the results.

@G999n
Copy link
Author

G999n commented Aug 27, 2024

All right
Thanks a lot

@G999n G999n closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants