Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Exit spark-sql will cause core dump in libhdfs.so #8072

Open
liujiayi771 opened this issue Nov 28, 2024 · 11 comments
Open

[VL] Exit spark-sql will cause core dump in libhdfs.so #8072

liujiayi771 opened this issue Nov 28, 2024 · 11 comments
Labels
bug Something isn't working triage

Comments

@liujiayi771
Copy link
Contributor

liujiayi771 commented Nov 28, 2024

Backend

VL (Velox)

Bug description

After executing the SQL, if I exit the spark-sql command line using Ctrl+C or quit command, a core dump occurs. #6172

Spark version

Spark-3.4.x

Spark configurations

No response

System information

No response

Relevant logs

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000154777e0dcb6, pid=1258650, tid=0x00001547a437f640
#
# JRE version: OpenJDK Runtime Environment (8.0_432-b06) (build 1.8.0_432-b06)
# Java VM: OpenJDK 64-Bit Server VM (25.432-b06 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libhdfs.so+0x2cb6]  globalClassReference+0xb6
#
# Core dump written. Default location: /root/core or core.1258650
#
# An error report file with more information is saved as:
# /root/hs_err_pid1258650.log
#
# If you would like to submit a bug report, please visit:
#   https://access.redhat.com/support/cases/
#
#0  0x00001544daa78005 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x1544c157f640 (LWP 3935772))]
(gdb) bt
#0  0x00001544daa78005 in raise () from /lib64/libc.so.6
#1  0x00001544daa4a894 in abort () from /lib64/libc.so.6
#2  0x00001544d8c144d7 in os::abort(bool) [clone .cold] () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#3  0x00001544d95dceca in VMError::report_and_die() () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#4  0x00001544d93c839a in JVM_handle_linux_signal () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#5  0x00001544d93bb49c in signalHandler(int, siginfo_t*, void*) () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.432.b06-2.0.2.1.al8.x86_64/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  0x0000154495624cb6 in globalClassReference (className=className@entry=0x15449562d5c8 "org/apache/hadoop/fs/FileSystem", env=env@entry=0x1544d87b72e8, out=out@entry=0x1544c157cbf8)
    at xxx/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c:299
#8  0x0000154495624ef4 in invokeMethod (env=0x1544d87b72e8, retval=0x0, methType=INSTANCE, instObj=0x1544d86c1bd0, className=0x15449562d5c8 "org/apache/hadoop/fs/FileSystem", methName=0x15449562d9da "close",
    methSignature=0x15449562d41a "()V") at xxx/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c:123
#9  0x0000154495627e80 in hdfsDisconnect (fs=0x1544d86c1bd0) at xxx/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/hdfs.c:880
@liujiayi771 liujiayi771 added bug Something isn't working triage labels Nov 28, 2024
@liujiayi771
Copy link
Contributor Author

cc @zhouyuan.

@zhouyuan
Copy link
Contributor

@liujiayi771
thanks for reporting, it looks like unload the libhhdfs.so is not working properly in your testing env.
would it be convenient to also paste the detail log in /root/hs_err_pid1258650.log ?

thanks, -yuan

@liujiayi771
Copy link
Contributor Author

@zhouyuan I have added the error stack in the description.

@zhouyuan
Copy link
Contributor

@liujiayi771 is the libhdfs.so from a vanilla HDFS project or it's been customized?
Based on the stack, it looks like HDFS is trying to invoke the close method
https://github.com/apache/hadoop/blob/branch-3.0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c#L123
but can not find the right symbol(?) then call the clean up function
https://github.com/apache/hadoop/blob/branch-3.0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/jni_helper.c#L299

thanks,
-yuan

@liujiayi771
Copy link
Contributor Author

liujiayi771 commented Nov 29, 2024

@zhouyuan We have our customized HDFS. But the hdfs.c and jni_helper.c is same as branch-3.0 in hadoop repo. We have never modified the libhdfs code. But I will test it with the vanilla HDFS. Can you reproduce this issue?

@liujiayi771
Copy link
Contributor Author

I checked the code for FileSystem in our code, and the close() method, which is a basic interface, definitely hasn't been modified. It's strange that JNI couldn't find this method.

@zhouyuan
Copy link
Contributor

Hi @liujiayi771 , I tried locally but seems not able to trigger it. I think we may need to add more guards in Velox filesystem close()
CC @JkSelf for her comments

Thanks,
-yuan

@JkSelf
Copy link
Contributor

JkSelf commented Nov 29, 2024

@liujiayi771 Can you help to test adding this command before running your application? export CLASSPATH=$HADOOP_HOME/bin/hdfs classpath --glob

@liujiayi771
Copy link
Contributor Author

liujiayi771 commented Nov 29, 2024

@JkSelf I have tested it, and it still results in a core dump. I will investigate this issue further in the next few days.

@liujiayi771
Copy link
Contributor Author

@JkSelf
It is likely related to the CLASSPATH environment variable. I also encountered the following error message.

Environment variable CLASSPATH not set!
getJNIEnv: getGlobalJNIEnv failed
W20241202 09:39:05.753589 279510 HdfsFileSystem.cpp:58] hdfs disconnect failure in HdfsReadFile close: 255

However, I tried setting the CLASSPATH using the following methods, but none of them worked.

  • Just export CLASSPATH=$(hdfs classpath --glob) before run spark-sql.
  • Add export CLASSPATH=$(hdfs classpath --glob) in spark-env.sh.
  • Set spark.driver/executorEnv.CLASSPATH=$(hdfs classpath --glob) in spark-sql --conf.
  • Set export CLASSPATH=$(hdfs classpath --glob) in /etc/profile

@JkSelf
Copy link
Contributor

JkSelf commented Dec 2, 2024

@liujiayi771 Can you try this command?
image

And can you show me the classpath after setting above command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

3 participants