Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something wrong about the training process #10

Open
CRLqinliang opened this issue Apr 29, 2024 · 9 comments
Open

Something wrong about the training process #10

CRLqinliang opened this issue Apr 29, 2024 · 9 comments

Comments

@CRLqinliang
Copy link

CRLqinliang commented Apr 29, 2024

Hi, when I used droid_100 datasets train the model, at the beginning, everything went well, until this error came out:

“Traceback (most recent call last):
File "/home/wrs/miniconda3/envs/droid_policy_learning/lib/python3.10/tkinter/init.py", line 4046, in del
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.del at 0x7f062dea7a30>
Traceback (most recent call last):
File "/home/wrs/miniconda3/envs/droid_policy_learning/lib/python3.10/tkinter/init.py", line 388, in del
if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.del at 0x7f062dea7a30>

and stuck here:
image

And the whole process was killed, so I went to search for the problem, and got this answer:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

However, I found this code in the file visualization.lib, and still met this error during training.
Looking forward to the reply, thanks. @kpertsch

@kpertsch
Copy link
Collaborator

This seems like it may be an issue in the plotting on your server? Could you try just running the three lines you posted in a fresh python shell

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

Does this throw an error? You may need to use a different backend instead of Agg?

Also if your traceback was longer it may be helpful to understand where in the droid_policy_learning code this was triggered.

@CRLqinliang
Copy link
Author

CRLqinliang commented Apr 30, 2024

@kpertsch Thank you for your reply!
I tried to run the three lines in my Python shell, it went well.
image

When I rerun the training code, it stuck here this time:
image

It seems like randomly stuck at random epochs, no matter whether doing the plot or not. (Confused)
And it even stuck there the whole night, without any error message!

Update: more message:
image
one of the CPUs is 100% occupied.

@kpertsch
Copy link
Collaborator

Mmh that's odd -- could you try killing your program that's stuck with Ctrl+C to see where in the program it got stuck? Very hard to debug such silent freezing issues.

@CRLqinliang
Copy link
Author

CRLqinliang commented May 1, 2024

@kpertsch Okay, I tried to Ctrl+C to see what happened, and got this:
image
The program is waiting for the unlock of the thread and is stuck here.

This is my OS(WSL2) configuration:
image
GPUs configuration:
image

I change the parameters of training to those:
"traj_transform_threads": 48, "traj_read_threads": 48, "shuffle_buffer_size": 100000,;
So the RAM is enough to run this code.

@kpertsch
Copy link
Collaborator

kpertsch commented May 1, 2024

@ashwin-balakrishna96 have you seen such behavior before?
It seems that the data parallel wrapper around the visual encoder locks up. Can you try running with only a single GPU visible via export CUDA_VISIBLE_DEVICES=0?

@CRLqinliang
Copy link
Author

CRLqinliang commented May 1, 2024

@kpertsch Okay, I will try it and let you know the result.
Thank you so much.


Update:
Well, a weirder thing happened. After I set export CUDA_VISIBLE_DEVICES=0 to my .bashrc file, and source it.
In the beginning, it went well, but it will suddenly shut down the shell and program, without any error message.

But the way, when I ran the code in this environment, I found this problem happened:

CUDA backend failed to initialize: Found CUDA version 11070, but JAX was built against version 11080, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

If I follow the procedure of installation in this project, and I get this package related to Nvidia:

nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi
nvidia-cuda-nvcc-cu11 11.8.89 pypi_0 pypi
nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi
nvidia-curand-cu11 10.2.10.91 pypi_0 pypi
nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi
nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi
nvidia-nccl-cu11 2.14.3 pypi_0 pypi
nvidia-nvtx-cu11 11.7.91 pypi_0 pypi

When I went back to run the fine-tune code of Octo, it cannot work in this environment.
And if I create another env following this guidance: https://github.com/erikbr01/octo_experiments, it works.
So I just wonder that, is there any problem with the installation step of this project right now?

@ldddddddl
Copy link

ldddddddl commented Jun 11, 2024

Hello, I try to train policy with droid_100, but I always get the following error:
——————
Did you mean: droid_100 -> coil100 ?

The builder directory D:\droid-main\dataset\droid_100\droid_100 doesn't contain any versions.
No builder could be found in the directory: D:\droid-main\dataset\droid_100 for the builder: droid_100.
——————————
My droid_runs_language_conditioned_rlds is set like this:
DATA_PATH = r"D:\droid-main\dataset" # UPDATE WITH PATH TO RLDS DATASETS
EXP_LOG_PATH = r"D:\droid-main\temp_log" # UPDATE WITH PATH TO DESIRED LOGGING DIRECTORY
EXP_NAMES = OrderedDict(
[
# Note: you can add co-training dataset here appending
# a new dataset to "datasets" and adjusting "sample_weights"
# accordingly
("droid_100", {"datasets": ["droid_100"],
"sample_weights": [1]})
])
——————
Is there a problem with that? If you can, please share the following how you set it up?

@ashwin-balakrishna96
Copy link
Contributor

That should work, perhaps something went wrong with preserving the correct directory structure when downloading the dataset? Can you confirm that the directory structure is as follows:

Within DATA_PATH there should be a folder called droid_100, and within that there should be a folder called 1.0.0. Inside the 1.0.0 folder you'll find a number of json and tfrecord files.

@ldddddddl
Copy link

That should work, perhaps something went wrong with preserving the correct directory structure when downloading the dataset? Can you confirm that the directory structure is as follows:

Within DATA_PATH there should be a folder called droid_100, and within that there should be a folder called 1.0.0. Inside the 1.0.0 folder you'll find a number of json and tfrecord files.

Yes, it is work now, but it still comes up sometimes. it's so weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants