Unable to train: TensorFlow version problems and cuBLAS issue #3

necrashter · 2024-03-12T09:04:42Z

Firstly, thank you for making this open-source. I've been trying to replicate your paper in order to build on it, but I've failed to train the model using the instructions in the README.

py-aiger-sat dependency fails to install due to this issue, regardless of the environment. But it doesn't seem to be necessary for training the model.

Conda Install

In a new conda environment, I installed TensorFlow using pip. It pulled the version 2.16.1, which is newer than the version specified in setup.py: tensorflow>=2.1.0.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 187, in <module>
    run()
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 142, in run
    model = transformer.create_model(vars(params), training=True, custom_pos_enc=params.tree_pos_enc)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/deepltl/normal/deepltl/models/transformer.py", line 31, in create_model
    predictions, _ = transformer(transformer_inputs, training)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/miniconda3/envs/spotltl/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 123, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ilker/miniconda3/envs/spotltl/lib/python3.11/site-packages/keras/src/layers/layer.py", line 723, in __call__
    raise ValueError(
ValueError: Only input tensors may be passed as positional arguments. The following argument value should be passed as a keyword argument: True (of type <class 'bool'>)

I fixed this issue by passing all training and cache arguments as keyword arguments. Then I encountered this issue:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 187, in <module>
    run()
  File "/home/ilker/deepltl/normal/deepltl/train/train_transformer.py", line 142, in run
    model = transformer.create_model(vars(params), training=True, custom_pos_enc=params.tree_pos_enc)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/deepltl/normal/deepltl/models/transformer.py", line 32, in create_model
    predictions = TransformerMetricsLayer(params)([predictions, target])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ilker/miniconda3/envs/spotltl/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 123, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/ilker/deepltl/normal/deepltl/models/transformer.py", line 80, in call
    self.add_metric(accuracy)
TypeError: Exception encountered when calling TransformerMetricsLayer.call().

Layer.add_metric() takes 1 positional argument but 2 were given

Arguments received by TransformerMetricsLayer.call():
  • args=(['<KerasTensor shape=(None, None, 16), dtype=float32, sparse=False, name=keras_tensor_67>', '<KerasTensor shape=(None, None), dtype=int32, sparse=None, name=target>'],)
  • kwargs=<class 'inspect._empty'>

At this point, I gave up trying to run the model with the new TensorFlow version. However, TensorFlow 2.1.0 is not available on pip anymore. Therefore, I installed the official docker image of that TensorFlow version.

Docker

Here are the commands I executed:

docker pull tensorflow/tensorflow:2.1.0-gpu-py3-jupyter
# For using GPU in docker:
sudo apt install nvidia-container-toolkit
docker run -u $(id -u):$(id -g) -it --mount type=bind,source=.,target=/tf/deepltl --gpus=all tensorflow/tensorflow:2.1.0-gpu-py3-jupyter bash
# Run inside docker:
python -m deepltl.train.train_transformer --problem='ltl' --ds-name='ltl-35' --epochs=5

It started training, but at the end of the first epoch, it gave an error:

2024-03-12 07:50:20.147556: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_NOT_SUPPORTED
2024-03-12 07:50:20.147617: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2024-03-12 07:50:20.147665: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: Blas xGEMMBatched launch failed : a.shape=[400,35,32], b.shape=[400,35,3
2], m=35, n=35, k=32, batch_size=400
         [[{{node model/transformer/transformer_encoder/transformer_encoder_layer/multi_head_attention/MatMul}}]]
         [[Reshape_640/_568]]
2024-03-12 07:50:20.147711: F tensorflow/core/common_runtime/gpu/gpu_util.cc:291] GPU->CPU Memcpy failed
2024-03-12 07:50:20.147792: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: Blas xGEMMBatched launch failed : a.shape=[400,35,32], b.shape=[400,35,3
2], m=35, n=35, k=32, batch_size=400
         [[{{node model/transformer/transformer_encoder/transformer_encoder_layer/multi_head_attention/MatMul}}]]
Aborted (core dumped)

I couldn't find anything useful about this error on the internet.

I don't know whether you want to maintain this repository or not, but I would appreciate it if you could update the repository for the new TensorFlow version or provide instructions on how to run it with the old version.

If you don't want to do that, TensorBoard logs of your training runs would be useful for me as well. I'm trying to port the code to PyTorch (UPDATE: My PyTorch port is available here), and I would like to compare loss and accuracy values to make sure that my implementation is correct.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train: TensorFlow version problems and cuBLAS issue #3

Unable to train: TensorFlow version problems and cuBLAS issue #3

necrashter commented Mar 12, 2024 •

edited

Loading

Unable to train: TensorFlow version problems and cuBLAS issue #3

Unable to train: TensorFlow version problems and cuBLAS issue #3

Comments

necrashter commented Mar 12, 2024 • edited Loading

Conda Install

Docker

necrashter commented Mar 12, 2024 •

edited

Loading