Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Support partition values in feature branch comet-parquet-exec #1106

Merged
merged 5 commits into from
Nov 22, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 20, 2024

Which issue does this PR close?

Closes #1102.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@viirya viirya changed the title Support partition values in feature branch comet-parquet-exec fix: Support partition values in feature branch comet-parquet-exec Nov 20, 2024
@@ -2521,9 +2524,12 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim
new SparkToParquetSchemaConverter(conf).convert(scan.requiredSchema)
val dataSchemaParquet =
new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema)
val partitionSchemaParquet =
new SparkToParquetSchemaConverter(conf).convert(scan.relation.partitionSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1103 discusses how the schemas have already lost necessary information at this point. Should we construct a new partition schema from the true Parquet schema rather than the partitionSchema that may have lost/converted type information already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copies from existing code.

Actually, I can just convert the Spark schema to Arrow types in JVM and serialize it to native side. I did similar thing in shuffle writer. Then we won't lose any information.

@@ -52,6 +52,7 @@ message SparkPartitionedFile {
int64 start = 2;
int64 length = 3;
int64 file_size = 4;
repeated spark.spark_expression.Expr partition_values = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't the partition values just strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Although for Hive partitioned table, partition values are dictionary names which are strings, but once Spark reads these strings back, they are casted to corresponding data types of partition columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense.

@viirya viirya merged commit c3ad26e into apache:comet-parquet-exec Nov 22, 2024
23 of 74 checks passed
@viirya viirya deleted the partition_values branch November 22, 2024 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants