Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak on inference #1418

Closed
TomekPro opened this issue Jan 3, 2024 · 12 comments
Closed

Memory Leak on inference #1418

TomekPro opened this issue Jan 3, 2024 · 12 comments
Labels
type: bug Something isn't working

Comments

@TomekPro
Copy link

TomekPro commented Jan 3, 2024

Bug description

Running doctr for multiple images in a loop causes massive memory leak.
image

Code snippet to reproduce the bug

import os
import tqdm
from pathlib import Path
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)

path = Path("/path/with/jpgs")
for file in tqdm.tqdm(os.listdir(path)[0:20]):
    file_path = path / file
    doc = DocumentFile.from_images(file_path)
    result = model(doc)

Runned in the following way:
mprof run python test.py
mprof plot

When I modified a loop so that model was initialized as well in the loop problem was still present.

Diving into the code it seems that the problem is caused by the actual pytorch inference. For example here:
image

Error traceback

As showed in the plot above.

Environment

Tested on empty poetry environment with just 2 packages installed:
pip install "python-doctr[torch]"
pip install memory_profiler

python 3.8.10
python-doctr 0.7.0

Ubuntu 20.04
Running on cpu

Deep Learning backend

is_tf_available: False
is_torch_available: True

@TomekPro TomekPro added the type: bug Something isn't working label Jan 3, 2024
@felixdittrich92
Copy link
Contributor

Hi @TomekPro 👋,
Thanks for reporting this 👍
It should be already fixed on the main branch (v0.8.0a) -> #1357

@TomekPro
Copy link
Author

TomekPro commented Jan 3, 2024

Hi @felixdittrich92, unfortunately this problem still occurs.

pip uninstall python_doctr
git clone https://github.com/mindee/doctr.git
pip install -e doctr/.

Then when I do pip list | grep doctr:
python-doctr 0.8.0a0

The plot still looks the same:
image

@TomekPro
Copy link
Author

TomekPro commented Jan 3, 2024

Moving to torch 2.1 cpuonly: pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cpu slightly helps but still leakage is clear:
image

@felixdittrich92
Copy link
Contributor

Mh yeah i see does this leak only exits for the CRNN models could you test it also with vitstr_small / parseq or the master models ?
This would be helpful to maybe limit the bug

@TomekPro
Copy link
Author

TomekPro commented Jan 3, 2024

I just tested that the problem occurs for other recognition architectures as well, what makes me thinking that this is sth around how pytorch is used in doctr. I'm looking for a solution as well.
Parseq - just slightly better than crnn.
image

vitstr_small - the same:
image

master - the same:
image

torch 2.1.1+cpu

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Jan 3, 2024

@TomekPro Have you tried to pass the paths as list to doc = DocumentFile.from_images([os.path.join(root, file) for file in os.listdir(root)])
And specify the batch size depending on your hardware:
model = ocr_predictor(pretrained=True, det_bs=4, reco_bs=512) for example ?

I agree after posting your plots that's still a bug (maybe on pytorch) so only an idea you could try in the meanwhile

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Jan 3, 2024

And another thing you can try: #1356 (comment)

@TomekPro
Copy link
Author

TomekPro commented Jan 3, 2024

Yes, I tried env variables like ONEDNN_PRIMITIVE_CACHE_CAPACITY and alike without success:/
Regarding approach you suggested:

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images([os.path.join(path, file) for file in os.listdir(path)[0:20]])
result = model(doc)

It produces really strange result, overall memory consumption is the same, it just hits max very quickly and then it is stable Maybe this could be a clue? When increasing batch size it get's even higher. Overall, I agree that this is probably pytorch thing but still it makes really hard to use doctr in real life as application would need to be restarted very often to avoid crash due to exceeding memory.
image

@felixdittrich92
Copy link
Contributor

You can also disable multiprocessing which should also lower the RAM usage a bit
See point 1 : https://mindee.github.io/doctr/using_doctr/running_on_aws.html

@felixdittrich92
Copy link
Contributor

But yeah i think we need to profile it more detailed again to find the real bottleneck

@TomekPro
Copy link
Author

TomekPro commented Jan 3, 2024

@felixdittrich92 finally, three things are needed to fix this memory leak:

  1. export DOCTR_MULTIPROCESSING_DISABLE=TRUE
  2. export ONEDNN_PRIMITIVE_CACHE_CAPACITY=1
  3. Upgrade torch to 2.1 (in my case cpu-only version): pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cpu
    image

Thanks for your help :)

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Jan 3, 2024

Nice 👍 Btw. using a smaller detection model will again reduce the mem usage for example db_mobilenet_v3_large if it still works well enough for your use case

@mindee mindee locked and limited conversation to collaborators Jan 4, 2024
@felixdittrich92 felixdittrich92 converted this issue into discussion #1422 Jan 4, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants