Tensorflow 2.0 - Segmentation fault #168

ffent · 2019-11-29T10:05:58Z

Receiving a segmentation fault when running any of the tf_ops functions as part of a model.

Docker container

Ubuntu 18.04.2 LTS
Kernel 4.15.0-66-generic
cuDNN 7.4.1.5-1
CUDA 10.0
Python 3.6.8
tensorflow-gpu 2.0.0
gcc/g++ 4.8.5

Host System

Ubuntu 16.04.6 LTS
Kernel 4.15.0-66-generic
NVIDIA driver 418.87.01

Issue details
I successfully compiled the tf_ops functions inside the docker container following the instructions and comment of the pull request #154 . I tried to follow the instructions of this post #152 , but was not able to compile the files with gcc 7.4.0 and was not able to install gcc 7.3.1, so i downgraded to gcc/g++ 4.8.

After compiling the tf_ops function, I was able to run the functions in python and was even able to integrate them in a custom keras.layers.Layer class and running this successfully. But if I try to run this layer or function in a tf.model I will receive a segmentation fault.

See code samples below.

Works with a single layer

# Imports
from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import os
import tensorflow as tf
from tensorflow.python.framework import ops
import numpy as np
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(BASE_DIR)
sampling_module = tf.load_op_library(os.path.join(BASE_DIR, 'tf_sampling_so.so'))

# Custom keras layer
class Farthest_Point_Sample(tf.keras.layers.Layer):
    def __init__(self,
                 npoint,
                 trainable=True,
                 name=None,
                 dtype=None,
                 **kwargs):
        super(Farthest_Point_Sample, self).__init__(name=name, **kwargs)
        self.npoint = npoint
    
    def call(self,inputs):
        return sampling_module.farthest_point_sample(inputs, self.npoint)

if __name__ == "__main__":
    xyz = tf.constant(np.random.random((1,264,3)).astype('float32'))
    npoint = 16
    out = Farthest_Point_Sample(npoint=npoint)(xyz)
    print(out)

Output

root@668553f390b7:/RadarDeepLearning# /usr/bin/python3 /RadarDeepLearning/tf_ops/sampling/temp.py
2019-11-29 09:50:43.205934: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-11-29 09:50:43.263470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:21:00.0
2019-11-29 09:50:43.263510: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 09:50:43.264522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-29 09:50:43.265440: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-11-29 09:50:43.265679: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-11-29 09:50:43.266814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-11-29 09:50:43.267705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-11-29 09:50:43.270468: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-29 09:50:43.271931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-29 09:50:43.272236: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-11-29 09:50:43.294688: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-11-29 09:50:43.295488: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4b0c430 executing computations on platform Host. Devices:
2019-11-29 09:50:43.295544: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-11-29 09:50:43.388524: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4b0e2c0 executing computations on platform CUDA. Devices:
2019-11-29 09:50:43.388573: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-11-29 09:50:43.390315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:21:00.0
2019-11-29 09:50:43.390383: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 09:50:43.390410: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-29 09:50:43.390432: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-11-29 09:50:43.390483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-11-29 09:50:43.390506: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-11-29 09:50:43.390529: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-11-29 09:50:43.390552: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-29 09:50:43.393793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-29 09:50:43.393848: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 09:50:43.397133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-29 09:50:43.397167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-11-29 09:50:43.397183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-11-29 09:50:43.400539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6791 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:21:00.0, compute capability: 7.5)
tf.Tensor([[  0  41  45 210 160 157 129 221 126 177 242 107   6 146 216 256]], shape=(1, 16), dtype=int32)

Fails with a model:

# Imports
from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import os
import tensorflow as tf
from tensorflow.python.framework import ops
import numpy as np
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(BASE_DIR)
sampling_module = tf.load_op_library(os.path.join(BASE_DIR, 'tf_sampling_so.so'))

# Custom keras layer
class Farthest_Point_Sample(tf.keras.layers.Layer):
    def __init__(self,
                 npoint,
                 trainable=True,
                 name=None,
                 dtype=None,
                 **kwargs):
        super(Farthest_Point_Sample, self).__init__(name=name, **kwargs)
        self.npoint = npoint
    
    def call(self,inputs):
        return sampling_module.farthest_point_sample(inputs, self.npoint)

if __name__ == "__main__":
    xyz = tf.keras.Input(shape=(1,264,3), name='xyz', dtype= tf.float32)
    out = Farthest_Point_Sample(npoint=16)(xyz)
    model = tf.keras.Model(inputs=[xyz], outputs=[out])
    model.summary()

Output:

root@668553f390b7:/RadarDeepLearning# /usr/bin/python3 /RadarDeepLearning/tf_ops/sampling/temp.py
Segmentation fault (core dumped)

Has anyone an idea how to fix this in tensorflow 2.0.0?

Thanks

The text was updated successfully, but these errors were encountered:

ffent · 2019-11-29T10:41:43Z

Further information from the core dump file:

#0  0x00007f5c3efbac62 in tensorflow::shape_inference::InferenceContext::WithRank(tensorflow::shape_inference::ShapeHandle, long long, tensorflow::shape_inference::ShapeHandle*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#1  0x00007f5c2b252e51 in {lambda(tensorflow::shape_inference::InferenceContext*)#2}::_FUN(tensorflow::shape_inference::InferenceContext*) () from /RadarDeepLearning/tf_ops/sampling/tf_sampling_so.so
#2  0x00007f5c2b25374c in std::_Function_handler<tensorflow::Status (tensorflow::shape_inference::InferenceContext*), tensorflow::Status (*)(tensorflow::shape_inference::InferenceContext*)>::_M_invoke(std::_Any_data const&, tensorflow::shape_inference::InferenceContext*) () from /RadarDeepLearning/tf_ops/sampling/tf_sampling_so.so
#3  0x00007f5c3efb6822 in tensorflow::shape_inference::InferenceContext::Run(std::function<tensorflow::Status (tensorflow::shape_inference::InferenceContext*)> const&) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#4  0x00007f5c483af974 in tensorflow::ShapeRefiner::RunShapeFn(tensorflow::Node const*, tensorflow::OpRegistrationData const*, tensorflow::ExtendedInferenceContext*) ()
   from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#5  0x00007f5c483b1685 in tensorflow::ShapeRefiner::AddNode(tensorflow::Node const*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#6  0x00007f5c425a7572 in TF_FinishOperation () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#7  0x00007f5c421d4936 in _wrap_TF_FinishOperation () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#8  0x00000000005097cf in ?? ()
#9  0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#10 0x0000000000508c69 in ?? ()
#11 0x000000000050999d in ?? ()
#12 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#13 0x0000000000507125 in ?? ()
#14 0x0000000000508794 in _PyFunction_FastCallDict ()
#15 0x00000000005940d1 in ?? ()
#16 0x000000000054945f in ?? ()
#17 0x0000000000550b91 in ?? ()
#18 0x00000000005a95fc in _PyObject_FastCallKeywords ()
#19 0x0000000000509ad3 in ?? ()
#20 0x000000000050c36e in _PyEval_EvalFrameDefault ()
#21 0x0000000000507125 in ?? ()
#22 0x0000000000508fa0 in ?? ()
#23 0x000000000050999d in ?? ()
#24 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#25 0x0000000000507125 in ?? ()
#26 0x0000000000508fa0 in ?? ()
#27 0x000000000050999d in ?? ()
#28 0x000000000050c36e in _PyEval_EvalFrameDefault ()
#29 0x0000000000507125 in ?? ()
#30 0x0000000000508fa0 in ?? ()
#31 0x000000000050999d in ?? ()
#32 0x000000000050c36e in _PyEval_EvalFrameDefault ()
#33 0x0000000000507125 in ?? ()
#34 0x000000000058822d in ?? ()
#35 0x000000000059f50e in PyObject_Call ()
#36 0x000000000050c854 in _PyEval_EvalFrameDefault ()
#37 0x0000000000507125 in ?? ()
#38 0x0000000000508fa0 in ?? ()
#39 0x000000000050999d in ?? ()
#40 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#41 0x0000000000507125 in ?? ()
#42 0x0000000000508fa0 in ?? ()
#43 0x000000000050999d in ?? ()
#44 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#45 0x0000000000507125 in ?? ()
#46 0x00000000005881aa in ?? ()
#47 0x000000000059f50e in PyObject_Call ()
#48 0x000000000050c854 in _PyEval_EvalFrameDefault ()
#49 0x0000000000507125 in ?? ()
#50 0x0000000000508fa0 in ?? ()
#51 0x000000000050999d in ?? ()
#52 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#53 0x0000000000507125 in ?? ()
#54 0x00000000005881aa in ?? ()
#55 0x000000000059f50e in PyObject_Call ()
#56 0x000000000050c854 in _PyEval_EvalFrameDefault ()
#57 0x0000000000507125 in ?? ()
#58 0x0000000000508537 in _PyFunction_FastCallDict ()
#59 0x00000000005940d1 in ?? ()
#60 0x0000000000549f41 in ?? ()
#61 0x00000000005a95fc in _PyObject_FastCallKeywords ()
#62 0x0000000000509ad3 in ?? ()
#63 0x000000000050b4a9 in _PyEval_EvalFrameDefault ()
#64 0x0000000000507125 in ?? ()
#65 0x000000000050a3b3 in PyEval_EvalCode ()
#66 0x00000000006349e2 in ?? ()
#67 0x0000000000634a97 in PyRun_FileExFlags ()
#68 0x000000000063824f in PyRun_SimpleFileExFlags ()
#69 0x0000000000638df1 in Py_Main ()
#70 0x00000000004b0de0 in main ()

dgriffiths3 · 2019-12-09T20:29:34Z

If it helps, I have written a wrapper for the pointnet++ layers to work as tf.keras.layer layers. https://github.com/dgriffiths3/pointnet2-tensorflow2

ffent · 2019-12-10T09:21:54Z

@dgriffiths3 Thank you very much for sharing your repository - great work. But could you may give me some more information about your system setup, because I think it's a problem with the compiled tf_ops functions.

So, could you may provide me the following information:

Operating system
cuDNN version
Python version
gcc version
bazel version

Many thanks

dgriffiths3 · 2019-12-10T12:09:23Z

@ffent I am using:
Ubuntu 18.04
Cudnn 7.6.4
Cuda 10.0
Python 3.7
GCC 7.4.0

I am installed tensorflow with the precompiled binaries through pip. If you are having an error with the compile script feel free to raise an issue on the repository.

akloss · 2019-12-13T15:00:47Z

Not sure if this is related or helpful, but I found that handing wrong tensor shapes to the tf_ops can cause segmentation faults (e.g. if the higher features you hand into the fp_module only has shape [batch_size, npoints] instead of [batch_size, npoints, nchannels]). So it might be worth it checking if the shapes of your tensors are correct.

Also, for me, the compiled tf_ops worked under tensorflow 2 and tensorflow 1.13, but not with 1.14. However, without using the model api.

dgriffiths3 · 2019-12-13T15:03:48Z

Usually a 'segmentation fault' error when running a c++ ops in python is just an error in the c++ code which isn't specified, so could be anything. As @akloss says, wrong shape or wrong data type is a likely cause.

ffent · 2019-12-27T08:48:24Z

Tanks to the suggestions from @dgriffiths3 I was able to compile the tf_ops and integrate them under tensorflow 2.0. Please have a look at his repository.

My final docker setup is the following:

Ubuntu 18.04.3 LTS
Kernel 5.0.0-37-generic
cuDNN 7.6.2.24-1
CUDA 10.0 (V10.0.130)
Python 3.6.8
Tensorflow 2.0
gcc 7.4.0

Host System

Ubuntu 18.04.3 LTS
Kernel 5.0.0.37-generic
Driver NVIDIA 435.21

Therefore I will close this issue.

chongma · 2021-08-03T08:06:57Z

Could you provide a copy of a working docker configuration?

ffent · 2021-08-08T08:46:55Z

@chongma you can have a look at the RadarSeg repository. There you can find a working Dockerfile as well as some further information on how to compile the tf_ops on different environments (here).

This repository also provides a complete implementation of all PointNet++ layers as TensorFlow 2.x compatible Keras layers.

ffent closed this as completed Dec 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow 2.0 - Segmentation fault #168

Tensorflow 2.0 - Segmentation fault #168

ffent commented Nov 29, 2019

ffent commented Nov 29, 2019

dgriffiths3 commented Dec 9, 2019

ffent commented Dec 10, 2019

dgriffiths3 commented Dec 10, 2019

akloss commented Dec 13, 2019

dgriffiths3 commented Dec 13, 2019

ffent commented Dec 27, 2019

chongma commented Aug 3, 2021

ffent commented Aug 8, 2021

Tensorflow 2.0 - Segmentation fault #168

Tensorflow 2.0 - Segmentation fault #168

Comments

ffent commented Nov 29, 2019

ffent commented Nov 29, 2019

dgriffiths3 commented Dec 9, 2019

ffent commented Dec 10, 2019

dgriffiths3 commented Dec 10, 2019

akloss commented Dec 13, 2019

dgriffiths3 commented Dec 13, 2019

ffent commented Dec 27, 2019

chongma commented Aug 3, 2021

ffent commented Aug 8, 2021