Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow 2.0 - Segmentation fault #168

Closed
ffent opened this issue Nov 29, 2019 · 9 comments
Closed

Tensorflow 2.0 - Segmentation fault #168

ffent opened this issue Nov 29, 2019 · 9 comments

Comments

@ffent
Copy link

ffent commented Nov 29, 2019

Receiving a segmentation fault when running any of the tf_ops functions as part of a model.

Docker container

  • Ubuntu 18.04.2 LTS
  • Kernel 4.15.0-66-generic
  • cuDNN 7.4.1.5-1
  • CUDA 10.0
  • Python 3.6.8
  • tensorflow-gpu 2.0.0
  • gcc/g++ 4.8.5

Host System

  • Ubuntu 16.04.6 LTS
  • Kernel 4.15.0-66-generic
  • NVIDIA driver 418.87.01

Issue details
I successfully compiled the tf_ops functions inside the docker container following the instructions and comment of the pull request #154 . I tried to follow the instructions of this post #152 , but was not able to compile the files with gcc 7.4.0 and was not able to install gcc 7.3.1, so i downgraded to gcc/g++ 4.8.

After compiling the tf_ops function, I was able to run the functions in python and was even able to integrate them in a custom keras.layers.Layer class and running this successfully. But if I try to run this layer or function in a tf.model I will receive a segmentation fault.

See code samples below.

Works with a single layer

# Imports
from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import os
import tensorflow as tf
from tensorflow.python.framework import ops
import numpy as np
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(BASE_DIR)
sampling_module = tf.load_op_library(os.path.join(BASE_DIR, 'tf_sampling_so.so'))

# Custom keras layer
class Farthest_Point_Sample(tf.keras.layers.Layer):
    def __init__(self,
                 npoint,
                 trainable=True,
                 name=None,
                 dtype=None,
                 **kwargs):
        super(Farthest_Point_Sample, self).__init__(name=name, **kwargs)
        self.npoint = npoint
    
    def call(self,inputs):
        return sampling_module.farthest_point_sample(inputs, self.npoint)

if __name__ == "__main__":
    xyz = tf.constant(np.random.random((1,264,3)).astype('float32'))
    npoint = 16
    out = Farthest_Point_Sample(npoint=npoint)(xyz)
    print(out)

Output

root@668553f390b7:/RadarDeepLearning# /usr/bin/python3 /RadarDeepLearning/tf_ops/sampling/temp.py
2019-11-29 09:50:43.205934: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-11-29 09:50:43.263470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:21:00.0
2019-11-29 09:50:43.263510: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 09:50:43.264522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-29 09:50:43.265440: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-11-29 09:50:43.265679: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-11-29 09:50:43.266814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-11-29 09:50:43.267705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-11-29 09:50:43.270468: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-29 09:50:43.271931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-29 09:50:43.272236: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-11-29 09:50:43.294688: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-11-29 09:50:43.295488: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4b0c430 executing computations on platform Host. Devices:
2019-11-29 09:50:43.295544: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-11-29 09:50:43.388524: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4b0e2c0 executing computations on platform CUDA. Devices:
2019-11-29 09:50:43.388573: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-11-29 09:50:43.390315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:21:00.0
2019-11-29 09:50:43.390383: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 09:50:43.390410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-29 09:50:43.390432: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-11-29 09:50:43.390483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-11-29 09:50:43.390506: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-11-29 09:50:43.390529: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-11-29 09:50:43.390552: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-29 09:50:43.393793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-29 09:50:43.393848: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 09:50:43.397133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-29 09:50:43.397167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-11-29 09:50:43.397183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-11-29 09:50:43.400539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6791 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:21:00.0, compute capability: 7.5)
tf.Tensor([[  0  41  45 210 160 157 129 221 126 177 242 107   6 146 216 256]], shape=(1, 16), dtype=int32)

Fails with a model:

# Imports
from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import os
import tensorflow as tf
from tensorflow.python.framework import ops
import numpy as np
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(BASE_DIR)
sampling_module = tf.load_op_library(os.path.join(BASE_DIR, 'tf_sampling_so.so'))

# Custom keras layer
class Farthest_Point_Sample(tf.keras.layers.Layer):
    def __init__(self,
                 npoint,
                 trainable=True,
                 name=None,
                 dtype=None,
                 **kwargs):
        super(Farthest_Point_Sample, self).__init__(name=name, **kwargs)
        self.npoint = npoint
    
    def call(self,inputs):
        return sampling_module.farthest_point_sample(inputs, self.npoint)

if __name__ == "__main__":
    xyz = tf.keras.Input(shape=(1,264,3), name='xyz', dtype= tf.float32)
    out = Farthest_Point_Sample(npoint=16)(xyz)
    model = tf.keras.Model(inputs=[xyz], outputs=[out])
    model.summary()

Output:

root@668553f390b7:/RadarDeepLearning# /usr/bin/python3 /RadarDeepLearning/tf_ops/sampling/temp.py
Segmentation fault (core dumped)

Has anyone an idea how to fix this in tensorflow 2.0.0?

Thanks

@ffent
Copy link
Author

ffent commented Nov 29, 2019

Further information from the core dump file:

#0  0x00007f5c3efbac62 in tensorflow::shape_inference::InferenceContext::WithRank(tensorflow::shape_inference::ShapeHandle, long long, tensorflow::shape_inference::ShapeHandle*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#1  0x00007f5c2b252e51 in {lambda(tensorflow::shape_inference::InferenceContext*)#2}::_FUN(tensorflow::shape_inference::InferenceContext*) () from /RadarDeepLearning/tf_ops/sampling/tf_sampling_so.so
#2  0x00007f5c2b25374c in std::_Function_handler<tensorflow::Status (tensorflow::shape_inference::InferenceContext*), tensorflow::Status (*)(tensorflow::shape_inference::InferenceContext*)>::_M_invoke(std::_Any_data const&, tensorflow::shape_inference::InferenceContext*) () from /RadarDeepLearning/tf_ops/sampling/tf_sampling_so.so
#3  0x00007f5c3efb6822 in tensorflow::shape_inference::InferenceContext::Run(std::function<tensorflow::Status (tensorflow::shape_inference::InferenceContext*)> const&) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#4  0x00007f5c483af974 in tensorflow::ShapeRefiner::RunShapeFn(tensorflow::Node const*, tensorflow::OpRegistrationData const*, tensorflow::ExtendedInferenceContext*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#5  0x00007f5c483b1685 in tensorflow::ShapeRefiner::AddNode(tensorflow::Node const*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#6  0x00007f5c425a7572 in TF_FinishOperation () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#7  0x00007f5c421d4936 in _wrap_TF_FinishOperation () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#8  0x00000000005097cf in ?? ()
#9  0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#10 0x0000000000508c69 in ?? ()
#11 0x000000000050999d in ?? ()
#12 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#13 0x0000000000507125 in ?? ()
#14 0x0000000000508794 in _PyFunction_FastCallDict ()
#15 0x00000000005940d1 in ?? ()
#16 0x000000000054945f in ?? ()
#17 0x0000000000550b91 in ?? ()
#18 0x00000000005a95fc in _PyObject_FastCallKeywords ()
#19 0x0000000000509ad3 in ?? ()
#20 0x000000000050c36e in _PyEval_EvalFrameDefault ()
#21 0x0000000000507125 in ?? ()
#22 0x0000000000508fa0 in ?? ()
#23 0x000000000050999d in ?? ()
#24 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#25 0x0000000000507125 in ?? ()
#26 0x0000000000508fa0 in ?? ()
#27 0x000000000050999d in ?? ()
#28 0x000000000050c36e in _PyEval_EvalFrameDefault ()
#29 0x0000000000507125 in ?? ()
#30 0x0000000000508fa0 in ?? ()
#31 0x000000000050999d in ?? ()
#32 0x000000000050c36e in _PyEval_EvalFrameDefault ()
#33 0x0000000000507125 in ?? ()
#34 0x000000000058822d in ?? ()
#35 0x000000000059f50e in PyObject_Call ()
#36 0x000000000050c854 in _PyEval_EvalFrameDefault ()
#37 0x0000000000507125 in ?? ()
#38 0x0000000000508fa0 in ?? ()
#39 0x000000000050999d in ?? ()
#40 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#41 0x0000000000507125 in ?? ()
#42 0x0000000000508fa0 in ?? ()
#43 0x000000000050999d in ?? ()
#44 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#45 0x0000000000507125 in ?? ()
#46 0x00000000005881aa in ?? ()
#47 0x000000000059f50e in PyObject_Call ()
#48 0x000000000050c854 in _PyEval_EvalFrameDefault ()
#49 0x0000000000507125 in ?? ()
#50 0x0000000000508fa0 in ?? ()
#51 0x000000000050999d in ?? ()
#52 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#53 0x0000000000507125 in ?? ()
#54 0x00000000005881aa in ?? ()
#55 0x000000000059f50e in PyObject_Call ()
#56 0x000000000050c854 in _PyEval_EvalFrameDefault ()
#57 0x0000000000507125 in ?? ()
#58 0x0000000000508537 in _PyFunction_FastCallDict ()
#59 0x00000000005940d1 in ?? ()
#60 0x0000000000549f41 in ?? ()
#61 0x00000000005a95fc in _PyObject_FastCallKeywords ()
#62 0x0000000000509ad3 in ?? ()
#63 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#64 0x0000000000507125 in ?? ()
#65 0x000000000050a3b3 in PyEval_EvalCode ()
#66 0x00000000006349e2 in ?? ()
#67 0x0000000000634a97 in PyRun_FileExFlags ()
#68 0x000000000063824f in PyRun_SimpleFileExFlags ()
#69 0x0000000000638df1 in Py_Main ()
#70 0x00000000004b0de0 in main ()

@dgriffiths3
Copy link

If it helps, I have written a wrapper for the pointnet++ layers to work as tf.keras.layer layers. https://github.com/dgriffiths3/pointnet2-tensorflow2

@ffent
Copy link
Author

ffent commented Dec 10, 2019

@dgriffiths3 Thank you very much for sharing your repository - great work. But could you may give me some more information about your system setup, because I think it's a problem with the compiled tf_ops functions.

So, could you may provide me the following information:

  • Operating system
  • cuDNN version
  • Python version
  • gcc version
  • bazel version

Many thanks

@dgriffiths3
Copy link

@ffent I am using:
Ubuntu 18.04
Cudnn 7.6.4
Cuda 10.0
Python 3.7
GCC 7.4.0

I am installed tensorflow with the precompiled binaries through pip. If you are having an error with the compile script feel free to raise an issue on the repository.

@akloss
Copy link

akloss commented Dec 13, 2019

Not sure if this is related or helpful, but I found that handing wrong tensor shapes to the tf_ops can cause segmentation faults (e.g. if the higher features you hand into the fp_module only has shape [batch_size, npoints] instead of [batch_size, npoints, nchannels]). So it might be worth it checking if the shapes of your tensors are correct.

Also, for me, the compiled tf_ops worked under tensorflow 2 and tensorflow 1.13, but not with 1.14. However, without using the model api.

@dgriffiths3
Copy link

Usually a 'segmentation fault' error when running a c++ ops in python is just an error in the c++ code which isn't specified, so could be anything. As @akloss says, wrong shape or wrong data type is a likely cause.

@ffent
Copy link
Author

ffent commented Dec 27, 2019

Tanks to the suggestions from @dgriffiths3 I was able to compile the tf_ops and integrate them under tensorflow 2.0. Please have a look at his repository.

My final docker setup is the following:

  • Ubuntu 18.04.3 LTS
  • Kernel 5.0.0-37-generic
  • cuDNN 7.6.2.24-1
  • CUDA 10.0 (V10.0.130)
  • Python 3.6.8
  • Tensorflow 2.0
  • gcc 7.4.0

Host System

  • Ubuntu 18.04.3 LTS
  • Kernel 5.0.0.37-generic
  • Driver NVIDIA 435.21

Therefore I will close this issue.

@ffent ffent closed this as completed Dec 27, 2019
@chongma
Copy link

chongma commented Aug 3, 2021

Could you provide a copy of a working docker configuration?

@ffent
Copy link
Author

ffent commented Aug 8, 2021

@chongma you can have a look at the RadarSeg repository. There you can find a working Dockerfile as well as some further information on how to compile the tf_ops on different environments (here).

This repository also provides a complete implementation of all PointNet++ layers as TensorFlow 2.x compatible Keras layers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants