-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL]Why is Gluten significantly slower than Spark with the same configuration? #3576
Comments
You need check and compare metrics of scan operator, total input bytes is abnormal. |
I have checked it and updated the configurations. A new error was encountered. Could you please help me solve it? Thank you! |
It's fixed by the latest commit. |
Hello, can you send a copy of the configuration information of Gluten for your startup task? The effect of TPC-DS on my side is significantly slower than that of Apache Spark. Thank you so much |
If so, would you like to share information about your environment? For example, whether it is a 10 Gigabit network card and so on,thank you! |
@bbaiyhaor Curious are you using EMR? |
No, I set up a Hadoop 3.3.3 + Hive 4.0.0 + Spark 3.4.1 + Gluten cluster by myself. The TPC-DS tests are run on YARN-client mode, with one driver on one instance and 15 executors on three instances. |
thank you, we are supporting EMR product, if you interested about that, please contact us. |
Sure, I'm very interested in that. We plan to test EMR + Gluten in the next stage. |
Backend
VL (Velox)
Bug description
I've set up a Spark-3.4.1 cluster in the cloud and ran a 100GB TPC-DS benchmark. I used Databricks' tpcds-perf tool for testing, without using Gluten's built-in test code. I expected that Gluten+Velox would outperform Spark, but in reality, it's five times slower than Spark. The top image shows Spark, and the bottom image is Gluten. It's evident that the 'input' column in Gluten has 5 times the amount of data compared to Spark. Could this be the reason why Gluten is five times slower than Spark? Is it due to not using libhdfs?
Spark version
None
Spark configurations
version 3.4.1
Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 1.8.0_382
Branch HEAD
Compiled by user centos on 2023-06-19T23:01:01Z
Revision 6b1ff22dde1ead51cbf370be6e48a802daae58b6
Url https://github.com/apache/spark
Type --help for more information.
System information
Velox System Info v0.0.2
Commit: Not in a git repo.
CMake Version: 3.16.3
System: Linux-5.4.0-164-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 9.4.0
CMake Prefix Path: /usr/local;/usr;/;/usr;/usr/local;/usr/X11R6;/usr/pkg;/opt
Relevant logs
No response
The text was updated successfully, but these errors were encountered: