-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train with caffee Vitis-AI GPU fail #691
Comments
Hi there, After changing the gpu parameter to 0 such as
caffee training was able to start but later I got another error down the line as showing below
Does anyone have any idea? |
Hi @mhanuel26 Could you provide your GPU model? It seems that this error is related to GPU model and CUDA version. Besides that, did you change the GPU id to ‘0’ in train.sh? You could have a try. |
Hi @wangxd-xlnx Yes I have change the GPU to only 0 and it starts actually working (before \I was getting an error very early), it is only after sometime that it throw that error. I am working with Nvidia RTX 3060, I am not in front on the PC know to give you driver version (I will post later) but I am using CUDA 11.6 on ubuntu 20.04. I have been training with ssd_mobilenet_v2 and it haven't fail, thou it is really slow compared to the example. What else can I do to debug? Thanks, |
Hi @wanghy-xlnx , Here is the output of nvidia-smi, where you san see the driver verision and Cuda version,
the command I am running in train.sh is
The first part of the log looks normal
How can I debug this further? Thanks, |
Hi @mhanuel26 Please notice that if you could use Vitis-AI docker, you just need to activate the conda env ‘vitis-ai-caffe'. You don't need to compile the caffe-xilinx source code manually. It's just for the environment that is inconvenient to use docker. The caffe-xilinx is precompiled in conda env ‘vitis-ai-caffe'. You can directly use it. So you could have another try that exit and run docker but don't compile the caffe-xilinx. PATH: /opt/vitis_ai/conda/envs/vitis-ai-caffe/bin/ |
Hi @wangxd-xlnx , This is not the same problem, I was able to follow the SSD example for mobilnet_v2 to use the VOC dataset and it works correctly under caffe, the only problem is that runs very slow on my GPU RTX 3060.
This issue #691 is more related to some mathematical operation of the model under the xilinx-caffe branch. I was thinking if there might be some improvement if caffe-xilinx is build on the host. I probably should open another issue to discuss this topic but let me know what do you think. Thanks, |
Hi @wanghy-xlnx , I got similar error when working with the dogs vs cats design example, here is the console output (first part omitted).
It looks very reproducible, do you have any suggestion how to debug this further? The command I run was
There was in fact another error before
|
Hi @mhanuel26 OK, thanks for your feedback. |
Hi @wanghy-xlnx , As you can see on my last comment I am getting errors to get caffe working on my box, i wish this is about efficiency, I have been trying different caffe examples with none of them working at all. Any way to debug this? If you were to do it , how would you approach it? |
Hi @wanghy-xlnx , I found something that might be related to the issues, I check the caffe environment and the cudnn and cudatoolkit seems to be somehow old versions, at least compared with the tensorflow2. Tensorflow has same versions as caffe. Look at below
Coincidentally, I haven't been able to run successfully a single caffe or TensorFlow example, but I was able to run successfully a Tensorflow2 example, look at the same output for Tensorflow2
Do you know how can I build a docker image that uses cuda 11.5 or 11.6 instead of 10 and latest or newer cudnn version? |
Hi @mhanuel26 , nvidia-smi output does not necessarily reflect the actual NVIDIA driver that is being installed on the host machine. On your host machine, please can you list your nvidia driver
You can also use this command to check the NVIDIA driver version
On your host, run
|
If your docker container do not have the |
Hi @hanxue , @wanghy-xlnx , My container was not having the nvidia runtime, I installed. Here are the outputs,
Afetr installing and restarting docker it shows the runtime
But the Vitis-AI model from Zoo still does not work, here is the console output (few lines at start then last few lines)
How can I debug this? |
Hi @mhanuel26 We have re-analyzed your operation process, and we can provide solutions on train.sh problems. Please follow our steps exactly.
Then it will run successfully. |
Hi @wangxd-xlnx , @hanxue , That did NOT work. The data generation is ok and I have followed exactly as you said, I haven't compiled caffe-xilinx in fact, I rename the caffe-xilinx folder, in fact you can see that it is using the pre-build docker as shown below ../../../caffe-xilinx/build/tools/caffe.bin does not exist, try use path in pre-build docker
I have renamed the caffe-xilinx folder in the meantime. The data directory after running those scripts looks like this
train.sh still fails. Do you suggest to create a new docker image? Is there a way I can get the cudnn and cudas-runtime version updated as the tensorflow2 conda environment? Thanks, |
Hi @mhanuel26 , I noticed that you are using Is there a chance that you can try with another NVIDIA GPU? |
Hi @hanxue , @wangxd-xlnx , That probably explain everything. Thanks, |
Hi @hanxue , @wangxd-xlnx , I was looking at the documentation about the compatibility you are mentioning, but cannot eaily find it. Could you point me to it please before closing this? Thanks, |
I found this patch, it looks possible to use with modifications. I know we can disable cudnn for caffe, so maybe it is a matter of looking at cuda 11 support for caffe. |
Hi @mhanuel26 , This page shows that RTX 3060 requires CUDA 11.1 at least. https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ It is not as simple to figure it out directly from the Ampere GPU Architecture Compatibility Guide and CUDA Compatibility index. |
Hi,
I am getting the following issue while doing train on cf_refinedet_coco_360_480_0.96_5.08G_2.0
Here is the output of nvidia-smi
What could I be missing?
The text was updated successfully, but these errors were encountered: