-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Support partition values in feature branch comet-parquet-exec #1106
Conversation
@@ -2521,9 +2524,12 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim | |||
new SparkToParquetSchemaConverter(conf).convert(scan.requiredSchema) | |||
val dataSchemaParquet = | |||
new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema) | |||
val partitionSchemaParquet = | |||
new SparkToParquetSchemaConverter(conf).convert(scan.relation.partitionSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1103 discusses how the schemas have already lost necessary information at this point. Should we construct a new partition schema from the true Parquet schema rather than the partitionSchema that may have lost/converted type information already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copies from existing code.
Actually, I can just convert the Spark schema to Arrow types in JVM and serialize it to native side. I did similar thing in shuffle writer. Then we won't lose any information.
@@ -52,6 +52,7 @@ message SparkPartitionedFile { | |||
int64 start = 2; | |||
int64 length = 3; | |||
int64 file_size = 4; | |||
repeated spark.spark_expression.Expr partition_values = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't the partition values just strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Although for Hive partitioned table, partition values are dictionary names which are strings, but once Spark reads these strings back, they are casted to corresponding data types of partition columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense.
Which issue does this PR close?
Closes #1102.
Rationale for this change
What changes are included in this PR?
How are these changes tested?