Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert torch.cuda.is_available() ! it will be work without GPU? #20

Open
RedOne88 opened this issue Apr 30, 2021 · 19 comments
Open

assert torch.cuda.is_available() ! it will be work without GPU? #20

RedOne88 opened this issue Apr 30, 2021 · 19 comments

Comments

@RedOne88
Copy link

RedOne88 commented Apr 30, 2021

Hi, and thanks for your code.
I have a question, can your code work without GPU, with CPU!
I can't seem to used the code, with this current version! I have always this error :

    assert torch.cuda.is_available()
AssertionError

Regarding the image basis, your code is able to run on fisheye images in grayscale and not in color.
thank you in advance for your reply !

duanzhiihao added a commit that referenced this issue Apr 30, 2021
@duanzhiihao
Copy link
Owner

Hi, thank you for your interest.
Actually, CUDA is not required to run RAPiD. I updated the api.py file, and could you check if you can run on CPU when passing use_cuda=False argument to the Detector class.

For training, I do not intend to add support for CPU since training on CPU will be super slow and I guess no one will try to do it.

Regarding the image basis, your code is able to run on fisheye images in grayscale and not in color.

I'm afraid I can't understand it. I believe our code runs on colored images. Could you give the error message if you are facing an error?

@RedOne88
Copy link
Author

RedOne88 commented May 3, 2021

Thank you for your reply.
I was able to find your results on your test images. Thank you so much.
However, I tried to test it on gray images, it didn't work. Here is the error:

File "example.py", line 8, in <module>
  detector.detect_one (img_path = '. / images / image1-002.jpg',
File "/home/redmou/Téléchargements/rapid/api.py", line 69, in detect_one
  detections = self._predict_pil (img, ** kwargs)
File "/home/redmou/Téléchargements/rapid/api.py", line 136, in _predict_pil
  dts = self.model (input _). cpu ()
File "/home/redmou/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
  result = self.forward (* input, ** kwargs)
File "/home/redmou/Téléchargements/rapid/models/rapid.py", line 71, in forward
  small, medium, large = self.backbone (x)
File "/home/redmou/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
  result = self.forward (* input, ** kwargs)
File "/home/redmou/Téléchargements/rapid/models/backbones.py", line 80, in forward
  x = self.netlist [i] (x)
File "/home/redmou/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
  result = self.forward (* input, ** kwargs)
File "/home/redmou/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
  input = module (input)
File "/home/redmou/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
  result = self.forward (* input, ** kwargs)
File "/home/redmou/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
  return self._conv_forward (input, self.weight, self.bias)
File "/home/redmou/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
  return F.conv2d (input, weight, bias, self.stride,
RuntimeError: Given groups = 1, weight of size [32, 3, 3, 3], expected input [1, 1, 1024, 1024] to have 3 channels, but got 1 channels instead 

do you have any idea about this type of error?
Thank you

@duanzhiihao
Copy link
Owner

Hello,
Our method is not designed for gray images, but here is a workaround: we expand (repeat) the gray image to an RGB image before feeding it into the CNN:

# im is a torch tensor, and im.shape is (1,1,h,w)
im = im.expand(-1,3,-1,-1)
pred = model(im)

@RedOne88
Copy link
Author

RedOne88 commented Jun 4, 2021

Hello,
and thank you for all your answers.
Could you help me please? I managed to install a graphics card with a size 1g gpu. I started training, but considering the size of my gpu, it didn't want to start training, because part of the gpu is used to install pytorch.

Could you help me to train either in CPU (although if it will spend a lot of time) or with my actual gpu?
thank you very munch.

@duanzhiihao
Copy link
Owner

Hi,

Given that your GPU has only 1G memory, it would be challenging to fit the model into the memory. I recommend you try to use Google Colab Notebook, which gives you a free 4GB GPU. Please check the tutorial here for Google Colab Notebooks.

Alternatively, you can use Kaggle notebooks, which sometimes provide you a free P100 GPU, which is powerful enough to train RAPiD.

Training on the CPU could cost more than 20 days for COCO and fisheye datasets. If you want to do it anyway, please let me know and I can provide a CPU training script in several days.

@RedOne88
Copy link
Author

RedOne88 commented Jun 4, 2021

thank you so much for your response.
I will test what you have offered for me.
I have a last question, considering the incredible work you have done, I have asked so many questions, I am so sorry.
Just to train gray images (not in colors) it will be enough just to convert them in gray or change the code squarely?
My idea is to train the gray images and try to optimize the final model as much as possible (after training), because the one provided is very large (246mb), with just training with gray images it can reduce the size of the model (I think). Could you tell me the piece of code to modify, if it is not complicated!
thank you so much

@duanzhiihao
Copy link
Owner

No problem at all.

If you want to train using gray images, you need to modify the code here

self.netlist.append(ConvBnLeaky(3, 32, k=3, s=1))

to ConvBnLeaky(1,32,k=3,s=1).

However, using gray instead of RGB almost do not reduce the model size because it only affects the very first layer, which is very lightweight compared to the whole model.

An effective way to reduce model size (but sacrificing some accuracy) is to use half precision (ie, float 16) instead of float32. To do this, please try

model = model.half()
x = x.half()
y = model(x)

@RedOne88
Copy link
Author

RedOne88 commented Jun 4, 2021

thank you !
:)

@RedOne88
Copy link
Author

RedOne88 commented Jun 9, 2021 via email

@RedOne88
Copy link
Author

RedOne88 commented Jun 10, 2021 via email

@RedOne88
Copy link
Author

RedOne88 commented Jun 11, 2021

On the other hand, we run the program on kaggle and activate the GPU, it worked at the beginning, then it spits with this error:

effective batch size = 8 * 16
initialing dataloader...
Only train on person images and objects
Loading annotations /kaggle/input/cocods/annotations_trainval2017/annotations/instances_train2017.json into memory...
Training on perspective images; adding angle to BBs
Using backbone Darknet-53. Loading ImageNet weights....
Warning: no ImageNet-pretrained weights found. Please check https://github.com/duanzhiihao/RAPiD for it.
Number of parameters in backbone: 40584928
2021-06-11 12:13:48.188692: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
/opt/conda/conda-bld/pytorch_1603729047590/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/kaggle/input/rapid-training/train.py", line 257, in <module>
    loss = model(imgs, targets, labels_cats=cats)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/kaggle/input/rapid-training/rapid/rapid/models/rapid.py", line 80, in forward
    boxes_M, loss_M = self.pred_M(detect_M, self.img_size, labels)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/kaggle/input/rapid-training/rapid/rapid/models/rapid.py", line 282, in forward
    target[b,best_n,truth_j,truth_i,0] = tx_all[b,:n][valid_mask] - tx_all[b,:n][valid_mask].floor()
RuntimeError: CUDA error: device-side assert triggered

I believe there is an overflow of the size of the boxes! but I don't know where it comes from?

@duanzhiihao
Copy link
Owner

It seems to be related to the COCO dataset format. Please check #11 to see if that solves your problem.

@RedOne88
Copy link
Author

RedOne88 commented Jun 15, 2021

effectively it worked, but after 2 hours of training, it spat! here is the error message:

Total time: 1:52:05.342283, iter: 0:00:13.397096, epoch: 3:26:20.508855
[Iteration 500] [learning rate 0.001] [Total loss 209.47] [img size 512]
level_16 total 8 objects: xy/gt 1.385, wh/gt 0.143, angle/gt 0.627, conf 44.588
level_32 total 1 objects: xy/gt 1.340, wh/gt 0.016, angle/gt 0.638, conf 13.681
level_64 total 12 objects: xy/gt 1.384, wh/gt 0.215, angle/gt 0.748, conf 105.670
Max GPU memory usage: 6.040322303771973 GigaBytes

Traceback (most recent call last):
  File "/kaggle/input/rapid-training/train.py", line 303, in <module>
    dts = api.detect_once(model, eval_img, conf_thres=0.1, input_size=target_size)
  File "/kaggle/input/rapid-training/rapid/rapid/api.py", line 175, in detect_once
    dts = model(input_img[None]).cpu().squeeze()
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/kaggle/input/rapid-training/rapid/rapid/models/rapid.py", line 71, in forward
    small, medium, large = self.backbone(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/kaggle/input/rapid-training/rapid/rapid/models/backbones.py", line 80, in forward
    x = self.netlist[i](x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [32, 3, 3, 3], expected input[1, 1, 608, 608] to have 3 channels, but got 1 channels instead

did you have any idea on the source of the error !
thank you very match.

@duanzhiihao
Copy link
Owner

The error tells that the input is a gray-scale image, but the network expects an RGB image. Did you made any change to the datasets.py script?

if img.mode == 'L':

@RedOne88
Copy link
Author

Regarding the training with COCO, I did not manage to finish it, on kaggle each time the site crashes after 25 hours of execution. My idea is to build my own model with your algorithm.
First, I will start with the existing one, I downloaded the HABOOF, CEPDOF, and MW-R, I want to build an images basis that brings together all its bases by taking for example 1000 images of each base. Do you think that we could have good results while training the model on an images basis that is not large enough?

@duanzhiihao
Copy link
Owner

Do you think that we could have good results while training the model on an images basis that is not large enough?

Yes, as long as you start from the pre-trained model and use a small learning rate.

@RedOne88
Copy link
Author

RedOne88 commented Jul 5, 2021

Hello,
I still can't finish training. On kaggle, I have the right to 9 hours of continuous use.
I chose the MW-R image basis, I reduced the size of the images and the annotations in order to reduce, perhaps, the training time.
Otherwise, I have a 32 processor machine, with old graphics card (Quadro K4200) which is not supported by PyTorch, so I plan to run the training in CPU, could you provide me the version of your code compatible with CPU plz?

@RedOne88
Copy link
Author

RedOne88 commented Jul 7, 2021

Please keep me informed if you have any idea

@RedOne88
Copy link
Author

Hello Sir,
can you provide me a CPU training script
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants