Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL]Why is Gluten significantly slower than Spark with the same configuration? #3576

Closed
bbaiyhaor opened this issue Oct 31, 2023 · 11 comments
Closed
Labels
bug Something isn't working performance triage

Comments

@bbaiyhaor
Copy link

Backend

VL (Velox)

Bug description

I've set up a Spark-3.4.1 cluster in the cloud and ran a 100GB TPC-DS benchmark. I used Databricks' tpcds-perf tool for testing, without using Gluten's built-in test code. I expected that Gluten+Velox would outperform Spark, but in reality, it's five times slower than Spark. The top image shows Spark, and the bottom image is Gluten. It's evident that the 'input' column in Gluten has 5 times the amount of data compared to Spark. Could this be the reason why Gluten is five times slower than Spark? Is it due to not using libhdfs?

image image

Spark version

None

Spark configurations

version 3.4.1
Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 1.8.0_382
Branch HEAD
Compiled by user centos on 2023-06-19T23:01:01Z
Revision 6b1ff22dde1ead51cbf370be6e48a802daae58b6
Url https://github.com/apache/spark
Type --help for more information.

System information

Velox System Info v0.0.2
Commit: Not in a git repo.
CMake Version: 3.16.3
System: Linux-5.4.0-164-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 9.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

@bbaiyhaor bbaiyhaor added bug Something isn't working triage labels Oct 31, 2023
@bbaiyhaor bbaiyhaor changed the title Why is Gluten significantly slower than Spark with the same configuration? [VL]Why is Gluten significantly slower than Spark with the same configuration? Oct 31, 2023
@Yohahaha
Copy link
Contributor

Yohahaha commented Nov 1, 2023

You need check and compare metrics of scan operator, total input bytes is abnormal.

@bbaiyhaor
Copy link
Author

After updating the Hadoop configurations, the performance of Gluten+Velox has significantly improved. When running the 1 TB TPC-DS benchmark, Gluten+Velox completed the task in 3891 seconds, whereas Spark took 5134 seconds. This means that Gluten+Velox achieved a speed-up of 1.32 times faster than Spark without using libhdfs3 and hdfs-client.xml. However, when attempting to configure libhdfs3 and hdfs-client.xml with Gluten+Velox, an error was encountered, which is detailed as follows:
Problematic frame:

# C  [libhdfs3.so.1+0x17de09]  hdfsBuilderSetNameNode+0x19
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/byrhs_err_pid2509801.log
image

@bbaiyhaor
Copy link
Author

You need check and compare metrics of scan operator, total input bytes is abnormal.

I have checked it and updated the configurations. A new error was encountered. Could you please help me solve it? Thank you!

@bbaiyhaor
Copy link
Author

It's fixed by the latest commit.

@github-project-automation github-project-automation bot moved this from Todo to Done in Gluten 1.1.0 Nov 10, 2023
@hailiang9615
Copy link

Hello, can you send a copy of the configuration information of Gluten for your startup task? The effect of TPC-DS on my side is significantly slower than that of Apache Spark. Thank you so much

@hailiang9615
Copy link

If so, would you like to share information about your environment? For example, whether it is a 10 Gigabit network card and so on,thank you!

@bbaiyhaor
Copy link
Author

If so, would you like to share information about your environment? For example, whether it is a 10 Gigabit network card and so on,thank you!

My cluster consists of four instances of ecs.g6.4xlarge on Alibaba Cloud. You can view the instance details in the figure. I am using the default configuration of Gluten. In the latest commit version, the performance of Gluten is 1.32 times that of Spark.

image

@Yohahaha
Copy link
Contributor

@bbaiyhaor Curious are you using EMR?

@bbaiyhaor
Copy link
Author

@bbaiyhaor Curious are you using EMR?

No, I set up a Hadoop 3.3.3 + Hive 4.0.0 + Spark 3.4.1 + Gluten cluster by myself. The TPC-DS tests are run on YARN-client mode, with one driver on one instance and 15 executors on three instances.

@Yohahaha
Copy link
Contributor

@bbaiyhaor Curious are you using EMR?

No, I set up a Hadoop 3.3.3 + Hive 4.0.0 + Spark 3.4.1 + Gluten cluster by myself. The TPC-DS tests are run on YARN-client mode, with one driver on one instance and 15 executors on three instances.

thank you, we are supporting EMR product, if you interested about that, please contact us.

@bbaiyhaor
Copy link
Author

@bbaiyhaor Curious are you using EMR?

No, I set up a Hadoop 3.3.3 + Hive 4.0.0 + Spark 3.4.1 + Gluten cluster by myself. The TPC-DS tests are run on YARN-client mode, with one driver on one instance and 15 executors on three instances.

thank you, we are supporting EMR product, if you interested about that, please contact us.

Sure, I'm very interested in that. We plan to test EMR + Gluten in the next stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance triage
Projects
None yet
Development

No branches or pull requests

4 participants