[VL] Avoid using WriteFilesSpec which is not serialzable #6144

jackylee-ch · 2024-06-19T07:44:50Z

What changes were proposed in this pull request?

The concurrentOutputWriterSpecFunc in WriteFilesSpec is a function, it would seek upper class to find Serializable Class and may cause user serialzation problem. Since we won't use it in RDD, we can remove it.

Bellow is the problem we met before this pr fixed.

	- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@31998d54)
	- field (class: org.apache.spark.sql.hivexxt)
	- object (class org.apache.spark.sql.hive.xx)
	- element of array (index: 2)
	- array (class [Ljava.lang.Object;, size 4)
	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class orxx)
	- object (class org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114, org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7)
	- field (class: org.apache.spark.sql.execution.datasources.WriteFilesSpec, name: concurrentOutputWriterSpecFunc, type: interface scala.Function1)
	- object (class org.apache.spark.sql.execution.datasources.WriteFilesSpec, WriteFilesSpec(org.apache.spark.sql.execution.datasources.WriteJobDescription@30c77b5a,org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol@3f3c5251,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7))
	- field (class: org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, name: writeFilesSpec, type: class org.apache.spark.sql.execution.datasources.WriteFilesSpec)
	- object (class org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342)
	- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
	- object (class scala.Tuple2, (VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$3087/1542323941@7df0da43))

How was this patch tested?

GA

ulysses-you · 2024-06-20T02:13:58Z

Why WriteFilesSpec is not serializable, does not it extend case class ?

jackylee-ch · 2024-06-20T02:17:25Z

Why WriteFilesSpec is not serializable, does not it extend case class ?

@ulysses-you The concurrentOutputWriterSpecFunc in WriteFilesSpec is not serializable.

ulysses-you · 2024-06-20T03:09:11Z

Are you sure concurrentOutputWriterSpecFunc is not serializable?

jackylee-ch · 2024-06-20T05:28:35Z

Are you sure concurrentOutputWriterSpecFunc is not serializable?

concurrentOutputWriterSpecFunc is a function, it would seek upper class to find Serializable. Bellow is the problem we met before this pr fixed.

	- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@31998d54)
	- field (class: org.apache.spark.sql.hivexxt)
	- object (class org.apache.spark.sql.hive.xx)
	- element of array (index: 2)
	- array (class [Ljava.lang.Object;, size 4)
	- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
	- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class orxx)
	- object (class org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114, org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7)
	- field (class: org.apache.spark.sql.execution.datasources.WriteFilesSpec, name: concurrentOutputWriterSpecFunc, type: interface scala.Function1)
	- object (class org.apache.spark.sql.execution.datasources.WriteFilesSpec, WriteFilesSpec(org.apache.spark.sql.execution.datasources.WriteJobDescription@30c77b5a,org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol@3f3c5251,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$2964/719020114@31839df7))
	- field (class: org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, name: writeFilesSpec, type: class org.apache.spark.sql.execution.datasources.WriteFilesSpec)
	- object (class org.apache.spark.sql.execution.VeloxColumnarWriteFilesRDD, VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342)
	- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
	- object (class scala.Tuple2, (VeloxColumnarWriteFilesRDD[18] at saveAsTable at SparkJobRunner.scala:342,org.apache.spark.sql.execution.datasources.UserFileFormatWriter$$$Lambda$3087/1542323941@7df0da43))

GlutenPerfBot · 2024-06-20T05:30:01Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_6144_time.csv	log/native_master_06_19_2024_27c32f1b15_time.csv	difference	percentage
q1	37.54	34.61	-2.923	92.21%
q2	23.63	25.46	1.834	107.76%
q3	38.64	38.59	-0.043	99.89%
q4	31.07	33.98	2.908	109.36%
q5	72.32	70.97	-1.344	98.14%
q6	6.36	6.54	0.185	102.90%
q7	84.74	85.08	0.343	100.41%
q8	86.02	87.11	1.090	101.27%
q9	124.66	123.67	-0.983	99.21%
q10	47.34	44.85	-2.498	94.72%
q11	20.48	20.56	0.079	100.39%
q12	25.70	27.28	1.585	106.17%
q13	40.56	39.58	-0.983	97.58%
q14	18.86	18.81	-0.047	99.75%
q15	33.65	33.35	-0.298	99.12%
q16	13.87	14.28	0.410	102.96%
q17	106.02	105.60	-0.424	99.60%
q18	148.30	144.60	-3.700	97.51%
q19	13.94	13.77	-0.173	98.76%
q20	29.78	29.04	-0.737	97.53%
q21	262.71	265.23	2.515	100.96%
q22	12.34	12.88	0.531	104.30%
total	1278.53	1275.86	-2.671	99.79%

ulysses-you · 2024-06-20T05:41:04Z

@jackylee-ch I think it's caused by your internal changes... Vanilla Spark does not hold spark context in concurrentOutputWriterSpecFunc. You should add @transient before sparkcontext field.

jackylee-ch · 2024-06-20T05:45:11Z

@jackylee-ch I think it's caused by your internal changes... Vanilla Spark does not hold spark context in concurrentOutputWriterSpecFunc. You should add @transient before sparkcontext field.

Yes, agree with you. However I think we should avoid use the concurrentOutputWriterSpecFunc since we won't use it in RDD and may cause other user serializable problem.

ulysses-you · 2024-06-20T05:48:02Z

I do not against this change as it is a code improvement. Just make it clear that this pr title and description should not related to serialization...

jackylee-ch · 2024-06-20T06:00:35Z

I do not against this change as it is a code improvement. Just make it clear that this pr title and description should not related to serialization...

Okey, got it. I have updated the description for more details.

[VL] Avoid use WriteFilesSpec which is not serialzable

72e5106

zhztheplayer approved these changes Jun 20, 2024

View reviewed changes

apache deleted a comment from github-actions bot Jun 20, 2024

jackylee-ch merged commit f12dbef into apache:main Jun 20, 2024
36 checks passed

jackylee-ch deleted the mirror_branch branch June 20, 2024 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Avoid using WriteFilesSpec which is not serialzable #6144

[VL] Avoid using WriteFilesSpec which is not serialzable #6144

jackylee-ch commented Jun 19, 2024 •

edited

Loading

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024

GlutenPerfBot commented Jun 20, 2024

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024 •

edited

Loading

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024

[VL] Avoid using WriteFilesSpec which is not serialzable #6144

[VL] Avoid using WriteFilesSpec which is not serialzable #6144

Conversation

jackylee-ch commented Jun 19, 2024 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024

GlutenPerfBot commented Jun 20, 2024

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024 • edited Loading

ulysses-you commented Jun 20, 2024

jackylee-ch commented Jun 20, 2024

jackylee-ch commented Jun 19, 2024 •

edited

Loading

jackylee-ch commented Jun 20, 2024 •

edited

Loading