Inference stuck in ...d2.evaluation.evaluator]: Start inference on X batches #90

marcelogdeandrade · 2024-10-02T03:00:02Z

Hello, I'm trying to run the project locally using docker on a 5 page PDF.

I basically ran:

$ git clone https://github.com/huridocs/pdf-document-layout-analysis
$ cd pdf-document-layout-analysis
$ docker run --rm --name pdf-document-layout-analysis -p 5060:5060 --entrypoint ./start.sh huridocs/pdf-document-layout-analysis:v0.0.14.1

The docker container started normally, and after trying to do a simple request:

$ curl -X POST -F 'file=@./my_file.pdf' localhost:5060

The docker container gets stuck on this part:

[10/02 02:50:12 d2.evaluation.evaluator]: Start inference on 5 batches

Here are some more logs

[10/02 02:49:11 detectron2]: Full config saved to /app/model_output_doclaynet/config.yaml
[10/02 02:49:43 detectron2]: Merge using: Sum
/app/src/ditod/Wordnn_embedding.py:48: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(bros_embedding_path + "pytorch_model.bin", map_location="cpu")
use_pretrain_weight: load model from: ../models/layoutlm-base-uncased/
[10/02 02:49:51 detectron2]: Model: Trainable network params num : 243,296,319
[10/02 02:49:51 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /app/models/doclaynet_VGT_model.pth ...
[10/02 02:49:51 fvcore.common.checkpoint]: [Checkpointer] Loading from /app/models/doclaynet_VGT_model.pth ...
/app/.venv/lib/python3.11/site-packages/fvcore-0.1.5.post20221221-py3.11.egg/fvcore/common/checkpoint.py:252: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
2024-10-02 02:49:57,081 [INFO] Is PyTorch using GPU: False
[2024-10-02 02:49:57 +0000] [11] [INFO] Started server process [11]
[2024-10-02 02:49:57 +0000] [11] [INFO] Waiting for application startup.
[2024-10-02 02:49:57 +0000] [11] [INFO] Application startup complete.
2024-10-02 02:49:57,224 [INFO] Calling endpoint: run
2024-10-02 02:49:57,224 [INFO] Processing file: prova_a1_split.pdf
2024-10-02 02:49:57,239 [INFO] Creating PDF images
Page-1
Page-2
Page-3
Page-4
Page-5
2024-10-02 02:50:12,361 [INFO] Full TransformGens used in training: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')], crop: None
WARNING [10/02 02:50:12 d2.data.datasets.coco]: /app/jsons/test.json contains 5165 annotations, but only 0 of them match to images in the file.
[10/02 02:50:12 d2.data.datasets.coco]: Loaded 5 images in COCO format from /app/jsons/test.json
[10/02 02:50:12 d2.data.build]: Distribution of instances among all 11 categories:
|  category  | #instances   |   category    | #instances   |  category   | #instances   |
|:----------:|:-------------|:-------------:|:-------------|:-----------:|:-------------|
|  Caption   | 0            |   Footnote    | 0            |   Formula   | 0            |
| List_Item  | 0            |  Page_Footer  | 0            | Page_Header | 0            |
|  Picture   | 0            | Section_Hea.. | 0            |    Table    | 0            |
|    Text    | 0            |     Title     | 0            |             |              |
|   total    | 0            |               |              |             |              |
[10/02 02:50:12 d2.data.common]: Serializing the dataset using: <class 'detectron2.data.common._TorchSerializedList'>
[10/02 02:50:12 d2.data.common]: Serializing 5 elements to byte tensors and concatenating them all ...
[10/02 02:50:12 d2.data.common]: Serialized dataset takes 0.00 MiB
[10/02 02:50:12 d2.evaluation.evaluator]: Start inference on 5 batches

I'm not seeing any issues with container resources (memory or cpu). Can you help me debug this issue? Thanks

The text was updated successfully, but these errors were encountered:

ali6parmak · 2024-10-02T03:39:47Z

Hi, the commands you used should be fine. Can you try to curl the test_pdfs/regular.pdf and let's see if it works.

marcelogdeandrade · 2024-10-02T14:00:59Z

Yes, it also gets stuck on [10/02 13:12:27 d2.evaluation.evaluator]: Start inference on 2 batches step - running it for more than 30 min

When I ran docker stats I see that the CPU utilization of the container is really high, but it shouldn't take that long to run:

ali6parmak · 2024-10-03T12:19:52Z

I have tried to reproduce what you are experiencing but unfortunately I wasn't able to, the service works fine on my end.

One thing you can try is, use the "fast" models to run a non-visual analysis:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' -F "fast=true" localhost:5060

You should be getting the response in a few seconds. Let's try and see if it works.

Also can you tell which OS are you using?

Thanks

marcelogdeandrade · 2024-10-03T14:48:47Z

Running it with "fast = True" works fine.

I'm running it in a MacOS - M2 version (ARM arch)

OS: macOS Sonoma 14.4.1 arm64
Host: MacBook Pro (16-inch, 2023)
CPU: Apple M2 Pro (12) @ 3.50 GHz
GPU: Apple M2 Pro (19) @ 1.40 GHz [Integrated]

marcelogdeandrade · 2024-10-04T01:16:26Z

I also ran colima - docker runtime for MacOS - in rosetta mode to run it properly on x86 architecture. The result is the same, however it is using a lot more CPU for quite some time:

CONTAINER ID   NAME                           CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O        PIDS
efa6aacd511e   pdf-document-layout-analysis   730.65%   2.929GiB / 15.61GiB   18.76%    104MB / 612kB   77.3MB / 108MB   31

marcelogdeandrade · 2024-10-04T01:26:04Z

Update: after 20 minuets of running it with 1000% CPU, it seems that 1 inference ran:

[10/04 01:24:41 d2.evaluation.evaluator]: Inference done 1/2. Dataloading: 1.7084 s/iter. Inference: 1345.5779 s/iter. Eval: 0.0077 s/iter. Total: 1347.3115 s/iter. ETA=0:22:27

Seems weird that it is really slow for the regular.pdf test file, right?

marcelogdeandrade · 2024-10-04T01:51:19Z

It finished running after 45 minutes

[10/04 01:48:01 d2.evaluation.evaluator]: Inference done 2/2. Dataloading: 0.0000 s/iter. Inference: 1399.7729 s/iter. Eval: 0.0030 s/iter. Total: 1399.7759 s/iter. ETA=0:00:00
[10/04 01:48:01 d2.evaluation.evaluator]: Total inference time: 0:23:20.084540 (1400.084540 s / iter per device, on 1 devices)
[10/04 01:48:01 d2.evaluation.evaluator]: Total inference pure compute time: 0:23:19 (1399.772861 s / iter per device, on 1 devices)

ali6parmak · 2024-10-04T05:24:18Z

The visual model can be quite slow when it runs on CPU, but yes, for a simple two-page document 45 minutes is definitely too much. On my setup with an Intel® Core™ i7-8700 CPU @ 3.20GHz, it takes around 36 seconds to finish.

We haven't tested the service on macOS yet, so I'm not sure what might be causing this issue. Since the service runs but takes a long time, it could be related to hardware or system-specific optimizations. If we discover any solutions, we’ll be sure to update you. Similarly, if you find anything on your end, please let us know, and maybe we can do some improvements.

Thanks again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference stuck in ...d2.evaluation.evaluator]: Start inference on X batches #90

Inference stuck in ...d2.evaluation.evaluator]: Start inference on X batches #90

marcelogdeandrade commented Oct 2, 2024

ali6parmak commented Oct 2, 2024

marcelogdeandrade commented Oct 2, 2024

ali6parmak commented Oct 3, 2024 •

edited

Loading

marcelogdeandrade commented Oct 3, 2024

marcelogdeandrade commented Oct 4, 2024

marcelogdeandrade commented Oct 4, 2024 •

edited

Loading

marcelogdeandrade commented Oct 4, 2024

ali6parmak commented Oct 4, 2024

Inference stuck in ...d2.evaluation.evaluator]: Start inference on X batches #90

Inference stuck in ...d2.evaluation.evaluator]: Start inference on X batches #90

Comments

marcelogdeandrade commented Oct 2, 2024

ali6parmak commented Oct 2, 2024

marcelogdeandrade commented Oct 2, 2024

ali6parmak commented Oct 3, 2024 • edited Loading

marcelogdeandrade commented Oct 3, 2024

marcelogdeandrade commented Oct 4, 2024

marcelogdeandrade commented Oct 4, 2024 • edited Loading

marcelogdeandrade commented Oct 4, 2024

ali6parmak commented Oct 4, 2024

ali6parmak commented Oct 3, 2024 •

edited

Loading

marcelogdeandrade commented Oct 4, 2024 •

edited

Loading