fix: Support partition values in feature branch comet-parquet-exec #1106

viirya · 2024-11-20T16:34:04Z

Which issue does this PR close?

Closes #1102.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

mbutrovich · 2024-11-20T17:27:09Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -2521,9 +2524,12 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim
            new SparkToParquetSchemaConverter(conf).convert(scan.requiredSchema)
          val dataSchemaParquet =
            new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema)
+          val partitionSchemaParquet =
+            new SparkToParquetSchemaConverter(conf).convert(scan.relation.partitionSchema)


#1103 discusses how the schemas have already lost necessary information at this point. Should we construct a new partition schema from the true Parquet schema rather than the partitionSchema that may have lost/converted type information already?

This copies from existing code.

Actually, I can just convert the Spark schema to Arrow types in JVM and serialize it to native side. I did similar thing in shuffle writer. Then we won't lose any information.

parthchandra · 2024-11-21T02:13:57Z

native/proto/src/proto/operator.proto

@@ -52,6 +52,7 @@ message SparkPartitionedFile {
  int64 start = 2;
  int64 length = 3;
  int64 file_size = 4;
+  repeated spark.spark_expression.Expr partition_values = 5;


aren't the partition values just strings?

No. Although for Hive partitioned table, partition values are dictionary names which are strings, but once Spark reads these strings back, they are casted to corresponding data types of partition columns.

Ah makes sense.

viirya added 3 commits November 19, 2024 12:54

init

51dd628

more

58fb6d2

more

460a4a6

viirya changed the title ~~Support partition values in feature branch comet-parquet-exec~~ fix: Support partition values in feature branch comet-parquet-exec Nov 20, 2024

fix clippy

f8d4d97

mbutrovich reviewed Nov 20, 2024

View reviewed changes

Use Spark and Arrow types for partition schema

a68ac54

parthchandra reviewed Nov 21, 2024

View reviewed changes

mbutrovich approved these changes Nov 22, 2024

View reviewed changes

viirya merged commit c3ad26e into apache:comet-parquet-exec Nov 22, 2024
23 of 74 checks passed

viirya deleted the partition_values branch November 22, 2024 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Support partition values in feature branch comet-parquet-exec #1106

fix: Support partition values in feature branch comet-parquet-exec #1106

viirya commented Nov 20, 2024

mbutrovich Nov 20, 2024

viirya Nov 20, 2024

parthchandra Nov 21, 2024

viirya Nov 21, 2024

parthchandra Nov 21, 2024

fix: Support partition values in feature branch comet-parquet-exec #1106

fix: Support partition values in feature branch comet-parquet-exec #1106

Conversation

viirya commented Nov 20, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

mbutrovich Nov 20, 2024

Choose a reason for hiding this comment

viirya Nov 20, 2024

Choose a reason for hiding this comment

parthchandra Nov 21, 2024

Choose a reason for hiding this comment

viirya Nov 21, 2024

Choose a reason for hiding this comment

parthchandra Nov 21, 2024

Choose a reason for hiding this comment