Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing error while trying to run inference #14

Closed
SriRamGovardhanam opened this issue Apr 25, 2020 · 3 comments
Closed

Facing error while trying to run inference #14

SriRamGovardhanam opened this issue Apr 25, 2020 · 3 comments

Comments

@SriRamGovardhanam
Copy link

Hello

I am facing OS Error issue

Traceback (most recent call last): File "DensePose/detectron/tests/test_zero_even_op.py", line 117, in c2_utils.import_custom_ops() File "/home/sriram/DensePose/detectron/utils/c2.py", line 40, in import_custom_ops dyndep.InitOpsLibrary(custom_ops_lib) File "/home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/dyndep.py", line 35, in InitOpsLibrary _init_impl(name) File "/home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/dyndep.py", line 48, in _init_impl ctypes.CDLL(path) File "/home/sriram/anaconda2/lib/python2.7/ctypes/init.py", line 366, in init self._handle = _dlopen(self._name, mode) OSError: /home/sriram/DensePose/build/libcaffe2_detectron_custom_ops_gpu.so: undefined symbol: _ZN6caffe219CPUOperatorRegistryB5cxx11Ev
Tried to install gcc-4.9.2 and ended up facing error for so may possible times, also when i ran using colab, there they just installed gcc-4.9.1 but default gcc is 7.5.0

so i just skipped test_zero_even_op.py and tried to run in infer mode , even there i faced some errors like

RuntimeError: [enforce fail at context_gpu.cu:415] error == cudaSuccess. 2 vs 0. Error at: /opt/conda/conda-bld/pytorch_1549617926868/work/caffe2/core/context_gpu.cu:415: out of memory
please check this thread [https://github.com/facebookresearch/DensePose/issues/269]

i was trying from the last 7days , it is really painful
please help me out.

@trrahul
Copy link
Owner

trrahul commented Apr 25, 2020

Was the build successful? undefined symbol: _ZN6caffe219CPUOperatorRegistryB5cxx11Ev could mean there is something wrong with the Caffe build or some of its dependency builds. Undefined errors from shared objects (.so) usually mean the linking process was not completely done.

@SriRamGovardhanam
Copy link
Author

SriRamGovardhanam commented Apr 25, 2020

Was the build successful? undefined symbol: _ZN6caffe219CPUOperatorRegistryB5cxx11Ev could mean there is something wrong with the Caffe build or some of its dependency builds. Undefined errors from shared objects (.so) usually mean the linking process was not completely done.

Hello rahul, @trrahul thank you so much for responding
just now iam able to run test_zero_even_op.py , it needs gcc-4.9.1 to get compiled and i managed to get gcc-4.9.1 successfully

`python2 DensePose/detectron/tests/test_zero_even_op.py
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
............

Ran 12 tests in 1.687s

OK
`
but now the problem is running the inference, i tried to run this

cd DensePose && python2 tools/infer_simple.py --cfg configs/DensePose_ResNet50_FPN_s1x-e2e.yaml --output-dir DensePoseData/infer_out/ --image-ext jpg --wts https://dl.fbaipublicfiles.com/densepose/DensePose_ResNet50_FPN_s1x-e2e.pkl DensePoseData/demo_data/demo_im2.jpg

where demo_im2.jpg is image of size 2.3KB and of height and width -10pxls
my GPU is nvidia-940mx 2GB VRAM and 12GB CPU RAM
CUDA-10.2 CuDNN-7.6.5 (also using ResNet50 backend)

now i got the same RuntimeError output:

`Found Detectron ops lib: /home/sriram/anaconda2/lib/libcaffe2_detectron_ops_gpu.so
[E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.

WARNING cnn.py: 25: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
INFO net.py: 51: Loading weights from: /tmp/detectron-download-cache/DensePose_ResNet50_FPN_s1x-e2e.pkl
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 5.428e-05 secs
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 4.2869e-05 secs
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 1.1066e-05 secs
INFO infer_simple.py: 103: Processing DensePoseData/demo_data/demo_im2.jpg -> DensePoseData/infer_out/demo_im2.jpg.pdf
[I net_async_base.h:211] Using specified CPU pool size: 4; device id: -1
[I net_async_base.h:216] Created new CPU pool, size: 4; device id: -1
[E net_async_base.cc:377] [enforce fail at context_gpu.cu:415] error == cudaSuccess. 2 vs 0. Error at: /opt/conda/conda-bld/pytorch_1549617926868/work/caffe2/core/context_gpu.cu:415: out of memory

Error from operator:
input: "gpu_0/res2_2_branch2c_bn" input: "gpu_0/res2_1_branch2c_bn" output: "gpu_0/res2_2_sum" name: "" type: "Sum" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x59 (0x7f049565f339 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libc10.so)

frame #1: + 0x29581fc (0x7f04984451fc in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #2: + 0x1321c55 (0x7f04baf4ac55 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #3: float* at::TensorImpl::mutable_data() + 0x2a (0x7f04bb1fc8aa in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #4: + 0x2cecfd3 (0x7f04987d9fd3 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #5: caffe2::SumOpcaffe2::CUDAContext::RunOnDevice() + 0x65 (0x7f04987d5b65 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #6: + 0x13ae835 (0x7f0496e9b835 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #7: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7f04bb0b5a24 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #8: + 0x1493dc2 (0x7f04bb0bcdc2 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x258 (0x7f04ba2046f8 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #10: + 0xb8678 (0x7f04cba7d678 in /home/sriram/anaconda2/bin/../lib/libstdc++.so.6)
frame #11: + 0x76db (0x7f04d2bbf6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x3f (0x7f04d214388f in /lib/x86_64-linux-gnu/libc.so.6)
, op Sum
[E net_async_base.cc:129] Rethrowing exception from the run of 'generalized_rcnn'
WARNING workspace.py: 204: Original python traceback for operator 34 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 209: File "tools/infer_simple.py", line 140, in
WARNING workspace.py: 209: File "tools/infer_simple.py", line 91, in main
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/core/test_engine.py", line 334, in initialize_model_from_cfg
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/model_builder.py", line 119, in create
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/model_builder.py", line 84, in generalized_rcnn
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/model_builder.py", line 233, in build_generic_detection_model
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/optimizer.py", line 46, in build_data_parallel_model
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/model_builder.py", line 165, in _single_gpu_build_func
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/FPN.py", line 40, in add_fpn_ResNet50_conv5_body
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/FPN.py", line 96, in add_fpn_onto_conv_body
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/ResNet.py", line 32, in add_ResNet50_conv5_body
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/ResNet.py", line 94, in add_ResNet_convX_body
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/ResNet.py", line 77, in add_stage
WARNING workspace.py: 209: File "/home/sriram/DensePose/detectron/modeling/ResNet.py", line 184, in add_residual_block
Traceback (most recent call last):
File "tools/infer_simple.py", line 140, in
main(args)
File "tools/infer_simple.py", line 109, in main
model, im, None, timers=timers
File "/home/sriram/DensePose/detectron/core/test.py", line 58, in im_detect_all
model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, boxes=box_proposals
File "/home/sriram/DensePose/detectron/core/test.py", line 158, in im_detect_bbox
workspace.RunNet(model.net.Proto().name)
File "/home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 236, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 197, in CallWithExceptionIntercept
return func(args, kwargs)
RuntimeError: [enforce fail at context_gpu.cu:415] error == cudaSuccess. 2 vs 0. Error at: /opt/conda/conda-bld/pytorch_1549617926868/work/caffe2/core/context_gpu.cu:415: out of memory
Error from operator:
input: "gpu_0/res2_2_branch2c_bn" input: "gpu_0/res2_1_branch2c_bn" output: "gpu_0/res2_2_sum" name: "" type: "Sum" device_option { device_type: 1 device_id: 0 }frame #0: c10::ThrowEnforceNotMet(char const
, int, char const
, std::string const&, void const
) + 0x59 (0x7f049565f339 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: + 0x29581fc (0x7f04984451fc in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #2: + 0x1321c55 (0x7f04baf4ac55 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #3: float* at::TensorImpl::mutable_data() + 0x2a (0x7f04bb1fc8aa in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #4: + 0x2cecfd3 (0x7f04987d9fd3 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #5: caffe2::SumOpcaffe2::CUDAContext::RunOnDevice() + 0x65 (0x7f04987d5b65 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #6: + 0x13ae835 (0x7f0496e9b835 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so)
frame #7: caffe2::AsyncNetBase::run(int, int) + 0x144 (0x7f04bb0b5a24 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #8: + 0x1493dc2 (0x7f04bb0bcdc2 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x258 (0x7f04ba2046f8 in /home/sriram/anaconda2/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #10: + 0xb8678 (0x7f04cba7d678 in /home/sriram/anaconda2/bin/../lib/libstdc++.so.6)
frame #11: + 0x76db (0x7f04d2bbf6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x3f (0x7f04d214388f in /lib/x86_64-linux-gnu/libc.so.6)
`

Seriously this is a real pain in the ass, i have been trying to execute this for last 7days
i dont know how a 2.3kb image cannot get through the inference
previously i ran MaskRCNN in my system , Allthough it takes a day for training but finally it shows some result 😂😂, i guess it is its training optimisation
now is there any way to resolve apart from the VRAM
please help me with this thing!

@trrahul
Copy link
Owner

trrahul commented May 1, 2020

I really recommend you use paperspace or any other service to run your tests if you do not have enough GPU memory. 2 GB is not enough.

my GPU is nvidia-940mx 2GB VRAM

@trrahul trrahul closed this as completed Aug 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants