Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken Docker recipe and update dependencies #7

Merged
merged 2 commits into from
Dec 13, 2023
Merged

Fix broken Docker recipe and update dependencies #7

merged 2 commits into from
Dec 13, 2023

Conversation

tuncK
Copy link
Contributor

@tuncK tuncK commented Nov 23, 2023

Jax version dictated by colabfold is too old resulting in a docker build error.

Also updated tensorflow to 2.15 & CUDA to 12.2 along with other packages.

seaborn==0.12.2 \
voila==0.4.1 \
"colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" && \
&& \
# As of Nov 2023, colabfold requires 0.3.25 <= jax < 0.4.0, which leads to build errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anuprulez you need to decide if we can remove that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, tensorflow errors have been an issue for the past release as well and we now have another IT with a separate container (v. 0.2) that provides the colabfold service.

Option 2: If we want to keep it, I could try to provide colabfold out of a conda env installed in this container so that Tensorflow et al. still work.

@anuprulez ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some how I do not get this email even though I am subscribed to this repo. Not even in my spam. Sorry for that!

I will have a look at it today :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that Colabfold has issues with the latest versions of TensorFlow, maybe with CUDA as well. We can remove it from v0.3 and later versions of this Docker container. We already have it on v0.2 in case it is needed for my defense.
@tuncK @bgruening

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgruening, shall we merge this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@bgruening bgruening merged commit d8cebdf into main Dec 13, 2023
2 checks passed
@bgruening bgruening deleted the jax branch December 13, 2023 09:02
@bgruening
Copy link
Member

@anuprulez do you want to create a new release?

@tuncK
Copy link
Contributor Author

tuncK commented Dec 13, 2023

@anuprulez is the IT on .eu still broken? Because that is why I had started this.

@anuprulez
Copy link
Member

anuprulez commented Dec 14, 2023

@tuncK yes, its broken I think with the following error message when I run a tensorflow notebook:

2023-12-14 14:09:26.827782: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 188160000 exceeds 10% of free system memory.
2023-12-14 14:09:28.328554: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.2
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-12-14 14:09:28.339603: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-12-14 14:09:28.340107: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc

libdevice not found at ./libdevice.10.bc
	 [[{{node StatefulPartitionedCall_81}}]] [Op:__inference_train_function_14105]

@anuprulez
Copy link
Member

I can create a new release but after verifying all the notebooks (except those using Colabfold) by running this container on my VM. Probably next week.

@anuprulez
Copy link
Member

anuprulez commented Dec 21, 2023

The newly released v0.4 version of this tool throws the same error as above as well as the v0.3:

libdevice not found at ./libdevice.10.bc

It is not possible to train models using Tensorflow. However, TF recognises GPU but fails while training any model.

bgruening added a commit to usegalaxy-eu/infrastructure-playbook that referenced this pull request Dec 21, 2023
bgruening added a commit to usegalaxy-eu/infrastructure-playbook that referenced this pull request Dec 21, 2023
@bgruening
Copy link
Member

Can you please try this again tomorrow: I tried usegalaxy-eu/infrastructure-playbook#1067

@anuprulez
Copy link
Member

I tried it, but unfortunately, it did not work. I tried this solution already directly in the Docker container as well. Still, it does not find libdevice file even though it is present at /opt/conda/nvvm/. I will look into it.

@tuncK
Copy link
Contributor Author

tuncK commented Dec 22, 2023

The last time I was dealing with this, it had something to do with:

  • Paths that it is looking to by default
  • Versions hardcoded in the filenames (/some/path/xlib.v123.so)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants