[VL]Why is Gluten significantly slower than Spark with the same configuration? #3576

bbaiyhaor · 2023-10-31T10:53:29Z

Backend

VL (Velox)

Bug description

I've set up a Spark-3.4.1 cluster in the cloud and ran a 100GB TPC-DS benchmark. I used Databricks' tpcds-perf tool for testing, without using Gluten's built-in test code. I expected that Gluten+Velox would outperform Spark, but in reality, it's five times slower than Spark. The top image shows Spark, and the bottom image is Gluten. It's evident that the 'input' column in Gluten has 5 times the amount of data compared to Spark. Could this be the reason why Gluten is five times slower than Spark? Is it due to not using libhdfs?

Spark version

None

Spark configurations

version 3.4.1
Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 1.8.0_382
Branch HEAD
Compiled by user centos on 2023-06-19T23:01:01Z
Revision 6b1ff22dde1ead51cbf370be6e48a802daae58b6
Url https://github.com/apache/spark
Type --help for more information.

System information

Velox System Info v0.0.2
Commit: Not in a git repo.
CMake Version: 3.16.3
System: Linux-5.4.0-164-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 9.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

Yohahaha · 2023-11-01T01:42:57Z

You need check and compare metrics of scan operator, total input bytes is abnormal.

bbaiyhaor · 2023-11-03T03:36:10Z

After updating the Hadoop configurations, the performance of Gluten+Velox has significantly improved. When running the 1 TB TPC-DS benchmark, Gluten+Velox completed the task in 3891 seconds, whereas Spark took 5134 seconds. This means that Gluten+Velox achieved a speed-up of 1.32 times faster than Spark without using libhdfs3 and hdfs-client.xml. However, when attempting to configure libhdfs3 and hdfs-client.xml with Gluten+Velox, an error was encountered, which is detailed as follows:
Problematic frame:

# C  [libhdfs3.so.1+0x17de09]  hdfsBuilderSetNameNode+0x19
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/byrhs_err_pid2509801.log

bbaiyhaor · 2023-11-06T02:16:03Z

You need check and compare metrics of scan operator, total input bytes is abnormal.

I have checked it and updated the configurations. A new error was encountered. Could you please help me solve it? Thank you!

bbaiyhaor · 2023-11-10T03:43:26Z

It's fixed by the latest commit.

hailiang9615 · 2023-11-15T06:16:58Z

Hello, can you send a copy of the configuration information of Gluten for your startup task? The effect of TPC-DS on my side is significantly slower than that of Apache Spark. Thank you so much

hailiang9615 · 2023-11-15T06:18:45Z

If so, would you like to share information about your environment? For example, whether it is a 10 Gigabit network card and so on，thank you！

bbaiyhaor · 2023-11-20T03:02:20Z

If so, would you like to share information about your environment? For example, whether it is a 10 Gigabit network card and so on，thank you！

My cluster consists of four instances of ecs.g6.4xlarge on Alibaba Cloud. You can view the instance details in the figure. I am using the default configuration of Gluten. In the latest commit version, the performance of Gluten is 1.32 times that of Spark.

Yohahaha · 2023-11-20T03:13:32Z

@bbaiyhaor Curious are you using EMR?

bbaiyhaor · 2023-11-20T03:24:47Z

@bbaiyhaor Curious are you using EMR?

No, I set up a Hadoop 3.3.3 + Hive 4.0.0 + Spark 3.4.1 + Gluten cluster by myself. The TPC-DS tests are run on YARN-client mode, with one driver on one instance and 15 executors on three instances.

Yohahaha · 2023-11-20T04:04:51Z

@bbaiyhaor Curious are you using EMR?

No, I set up a Hadoop 3.3.3 + Hive 4.0.0 + Spark 3.4.1 + Gluten cluster by myself. The TPC-DS tests are run on YARN-client mode, with one driver on one instance and 15 executors on three instances.

thank you, we are supporting EMR product, if you interested about that, please contact us.

bbaiyhaor · 2023-11-20T06:20:17Z

@bbaiyhaor Curious are you using EMR?

No, I set up a Hadoop 3.3.3 + Hive 4.0.0 + Spark 3.4.1 + Gluten cluster by myself. The TPC-DS tests are run on YARN-client mode, with one driver on one instance and 15 executors on three instances.

thank you, we are supporting EMR product, if you interested about that, please contact us.

Sure, I'm very interested in that. We plan to test EMR + Gluten in the next stage.

bbaiyhaor added bug Something isn't working triage labels Oct 31, 2023

github-project-automation bot added this to Gluten 1.1.0 Oct 31, 2023

github-project-automation bot moved this to Todo in Gluten 1.1.0 Oct 31, 2023

bbaiyhaor changed the title ~~Why is Gluten significantly slower than Spark with the same configuration?~~ [VL]Why is Gluten significantly slower than Spark with the same configuration? Oct 31, 2023

bbaiyhaor closed this as completed Nov 10, 2023

github-project-automation bot moved this from Todo to Done in Gluten 1.1.0 Nov 10, 2023

zhouyuan added the performance label Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL]Why is Gluten significantly slower than Spark with the same configuration? #3576

[VL]Why is Gluten significantly slower than Spark with the same configuration? #3576

bbaiyhaor commented Oct 31, 2023

Yohahaha commented Nov 1, 2023

bbaiyhaor commented Nov 3, 2023

bbaiyhaor commented Nov 6, 2023

bbaiyhaor commented Nov 10, 2023

hailiang9615 commented Nov 15, 2023

hailiang9615 commented Nov 15, 2023

bbaiyhaor commented Nov 20, 2023

Yohahaha commented Nov 20, 2023

bbaiyhaor commented Nov 20, 2023

Yohahaha commented Nov 20, 2023

bbaiyhaor commented Nov 20, 2023

[VL]Why is Gluten significantly slower than Spark with the same configuration? #3576

[VL]Why is Gluten significantly slower than Spark with the same configuration? #3576

Comments

bbaiyhaor commented Oct 31, 2023

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

Yohahaha commented Nov 1, 2023

bbaiyhaor commented Nov 3, 2023

bbaiyhaor commented Nov 6, 2023

bbaiyhaor commented Nov 10, 2023

hailiang9615 commented Nov 15, 2023

hailiang9615 commented Nov 15, 2023

bbaiyhaor commented Nov 20, 2023

Yohahaha commented Nov 20, 2023

bbaiyhaor commented Nov 20, 2023

Yohahaha commented Nov 20, 2023

bbaiyhaor commented Nov 20, 2023