Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Avoid using WriteFilesSpec which is not serialzable #6144

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

jackylee-ch
Copy link
Contributor

@jackylee-ch jackylee-ch commented Jun 19, 2024

What changes were proposed in this pull request?

The concurrentOutputWriterSpecFunc in WriteFilesSpec is a function, it would seek upper class to find Serializable Class and may cause user serialzation problem. Since we won't use it in RDD, we can remove it.

Bellow is the problem we met before this pr fixed.

	- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@31998d54)
	- field (class: org.apache.spark.sql.hivexxt)
	- object (class org.apache.spark.sql.hive.xx)
	- element of array (index: 2)
	- array (class [Ljava.lang.Object;, size 4)
	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class orxx)
	- object (class org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114, org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7)
	- field (class: org.apache.spark.sql.execution.datasources.WriteFilesSpec, name: concurrentOutputWriterSpecFunc, type: interface scala.Function1)
	- object (class org.apache.spark.sql.execution.datasources.WriteFilesSpec, WriteFilesSpec(org.apache.spark.sql.execution.datasources.WriteJobDescription@30c77b5a,org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol@3f3c5251,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7))
	- field (class: org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, name: writeFilesSpec, type: class org.apache.spark.sql.execution.datasources.WriteFilesSpec)
	- object (class org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342)
	- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
	- object (class scala.Tuple2, (VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$3087/1542323941@7df0da43))

How was this patch tested?

GA

@apache apache deleted a comment from github-actions bot Jun 20, 2024
@ulysses-you
Copy link
Contributor

Why WriteFilesSpec is not serializable, does not it extend case class ?

@jackylee-ch
Copy link
Contributor Author

Why WriteFilesSpec is not serializable, does not it extend case class ?

@ulysses-you The concurrentOutputWriterSpecFunc in WriteFilesSpec is not serializable.

@jackylee-ch jackylee-ch merged commit f12dbef into apache:main Jun 20, 2024
36 checks passed
@jackylee-ch jackylee-ch deleted the mirror_branch branch June 20, 2024 03:04
@ulysses-you
Copy link
Contributor

Are you sure concurrentOutputWriterSpecFunc is not serializable?
image

@jackylee-ch
Copy link
Contributor Author

Are you sure concurrentOutputWriterSpecFunc is not serializable?

concurrentOutputWriterSpecFunc is a function, it would seek upper class to find Serializable. Bellow is the problem we met before this pr fixed.

	- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@31998d54)
	- field (class: org.apache.spark.sql.hivexxt)
	- object (class org.apache.spark.sql.hive.xx)
	- element of array (index: 2)
	- array (class [Ljava.lang.Object;, size 4)
	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class orxx)
	- object (class org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114, org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7)
	- field (class: org.apache.spark.sql.execution.datasources.WriteFilesSpec, name: concurrentOutputWriterSpecFunc, type: interface scala.Function1)
	- object (class org.apache.spark.sql.execution.datasources.WriteFilesSpec, WriteFilesSpec(org.apache.spark.sql.execution.datasources.WriteJobDescription@30c77b5a,org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol@3f3c5251,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7))
	- field (class: org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, name: writeFilesSpec, type: class org.apache.spark.sql.execution.datasources.WriteFilesSpec)
	- object (class org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342)
	- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
	- object (class scala.Tuple2, (VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$3087/1542323941@7df0da43))

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_6144_time.csv log/native_master_06_19_2024_27c32f1b15_time.csv difference percentage
q1 37.54 34.61 -2.923 92.21%
q2 23.63 25.46 1.834 107.76%
q3 38.64 38.59 -0.043 99.89%
q4 31.07 33.98 2.908 109.36%
q5 72.32 70.97 -1.344 98.14%
q6 6.36 6.54 0.185 102.90%
q7 84.74 85.08 0.343 100.41%
q8 86.02 87.11 1.090 101.27%
q9 124.66 123.67 -0.983 99.21%
q10 47.34 44.85 -2.498 94.72%
q11 20.48 20.56 0.079 100.39%
q12 25.70 27.28 1.585 106.17%
q13 40.56 39.58 -0.983 97.58%
q14 18.86 18.81 -0.047 99.75%
q15 33.65 33.35 -0.298 99.12%
q16 13.87 14.28 0.410 102.96%
q17 106.02 105.60 -0.424 99.60%
q18 148.30 144.60 -3.700 97.51%
q19 13.94 13.77 -0.173 98.76%
q20 29.78 29.04 -0.737 97.53%
q21 262.71 265.23 2.515 100.96%
q22 12.34 12.88 0.531 104.30%
total 1278.53 1275.86 -2.671 99.79%

@ulysses-you
Copy link
Contributor

@jackylee-ch I think it's caused by your internal changes... Vanilla Spark does not hold spark context in concurrentOutputWriterSpecFunc. You should add @transient before sparkcontext field.

@jackylee-ch
Copy link
Contributor Author

jackylee-ch commented Jun 20, 2024

@jackylee-ch I think it's caused by your internal changes... Vanilla Spark does not hold spark context in concurrentOutputWriterSpecFunc. You should add @transient before sparkcontext field.

Yes, agree with you. However I think we should avoid use the concurrentOutputWriterSpecFunc since we won't use it in RDD and may cause other user serializable problem.

@ulysses-you
Copy link
Contributor

I do not against this change as it is a code improvement. Just make it clear that this pr title and description should not related to serialization...

@jackylee-ch
Copy link
Contributor Author

I do not against this change as it is a code improvement. Just make it clear that this pr title and description should not related to serialization...

Okey, got it. I have updated the description for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants