Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train: TensorFlow version problems and cuBLAS issue #3

Open
necrashter opened this issue Mar 12, 2024 · 0 comments
Open

Unable to train: TensorFlow version problems and cuBLAS issue #3

necrashter opened this issue Mar 12, 2024 · 0 comments

Comments

@necrashter
Copy link

necrashter commented Mar 12, 2024

Firstly, thank you for making this open-source. I've been trying to replicate your paper in order to build on it, but I've failed to train the model using the instructions in the README.

py-aiger-sat dependency fails to install due to this issue, regardless of the environment. But it doesn't seem to be necessary for training the model.

Conda Install

In a new conda environment, I installed TensorFlow using pip. It pulled the version 2.16.1, which is newer than the version specified in setup.py: tensorflow>=2.1.0.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 187, in <module>
    run()
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 142, in run
    model = transformer.create_model(vars(params), training=True, custom_pos_enc=params.tree_pos_enc)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/deepltl/normal/deepltl/models/transformer.py", line 31, in create_model
    predictions, _ = transformer(transformer_inputs, training)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/miniconda3/envs/spotltl/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 123, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ilker/miniconda3/envs/spotltl/lib/python3.11/site-packages/keras/src/layers/layer.py", line 723, in __call__
    raise ValueError(
ValueError: Only input tensors may be passed as positional arguments. The following argument value should be passed as a keyword argument: True (of type <class 'bool'>)

I fixed this issue by passing all training and cache arguments as keyword arguments. Then I encountered this issue:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 187, in <module>
    run()
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 142, in run
    model = transformer.create_model(vars(params), training=True, custom_pos_enc=params.tree_pos_enc)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/deepltl/normal/deepltl/models/transformer.py", line 32, in create_model
    predictions = TransformerMetricsLayer(params)([predictions, target])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/miniconda3/envs/spotltl/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 123, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ilker/deepltl/normal/deepltl/models/transformer.py", line 80, in call
    self.add_metric(accuracy)
TypeError: Exception encountered when calling TransformerMetricsLayer.call().

Layer.add_metric() takes 1 positional argument but 2 were given

Arguments received by TransformerMetricsLayer.call():
  • args=(['<KerasTensor shape=(None, None, 16), dtype=float32, sparse=False, name=keras_tensor_67>', '<KerasTensor shape=(None, None), dtype=int32, sparse=None, name=target>'],)
  • kwargs=<class 'inspect._empty'>

At this point, I gave up trying to run the model with the new TensorFlow version. However, TensorFlow 2.1.0 is not available on pip anymore. Therefore, I installed the official docker image of that TensorFlow version.

Docker

Here are the commands I executed:

docker pull tensorflow/tensorflow:2.1.0-gpu-py3-jupyter
# For using GPU in docker:
sudo apt install nvidia-container-toolkit
docker run -u $(id -u):$(id -g) -it --mount type=bind,source=.,target=/tf/deepltl --gpus=all tensorflow/tensorflow:2.1.0-gpu-py3-jupyter bash
# Run inside docker:
python -m deepltl.train.train_transformer --problem='ltl' --ds-name='ltl-35' --epochs=5

It started training, but at the end of the first epoch, it gave an error:

2024-03-12 07:50:20.147556: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_NOT_SUPPORTED
2024-03-12 07:50:20.147617: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2024-03-12 07:50:20.147665: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: Blas xGEMMBatched launch failed : a.shape=[400,35,32], b.shape=[400,35,3
2], m=35, n=35, k=32, batch_size=400
         [[{{node model/transformer/transformer_encoder/transformer_encoder_layer/multi_head_attention/MatMul}}]]
         [[Reshape_640/_568]]
2024-03-12 07:50:20.147711: F tensorflow/core/common_runtime/gpu/gpu_util.cc:291] GPU->CPU Memcpy failed
2024-03-12 07:50:20.147792: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: Blas xGEMMBatched launch failed : a.shape=[400,35,32], b.shape=[400,35,3
2], m=35, n=35, k=32, batch_size=400
         [[{{node model/transformer/transformer_encoder/transformer_encoder_layer/multi_head_attention/MatMul}}]]
Aborted (core dumped)

I couldn't find anything useful about this error on the internet.


I don't know whether you want to maintain this repository or not, but I would appreciate it if you could update the repository for the new TensorFlow version or provide instructions on how to run it with the old version.

If you don't want to do that, TensorBoard logs of your training runs would be useful for me as well. I'm trying to port the code to PyTorch (UPDATE: My PyTorch port is available here), and I would like to compare loss and accuracy values to make sure that my implementation is correct.

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant