-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-5471][VL]feat: Support read Hudi COW table #6049
Conversation
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
@@ -30,13 +30,13 @@ trait DataSourceScanTransformerRegister { | |||
/** | |||
* The class name that used to identify what kind of datasource this is。 | |||
* | |||
* For DataSource V1, it should be the child class name of | |||
* [[org.apache.spark.sql.execution.datasources.FileIndex]]. | |||
* For DataSource V1, it should be relation.fileFormat like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YannByron For org.apache.spark.sql.execution.datasources.FileIndex
, it can be used to distinguish different datasources but it's too general that all kinds of files read will pass, such as meta data/log files used for query plan analysis. It is not necessary and may trigger failures in some corner cases. So here, we limit it to the parquet format, are you okay for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this change - why is it necessary to update the docs here in this PR which is only meant for hudi support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scanClassName
is a flag to decide which data source transformer should be triggered during register, such as delta, iceberg, hudi, etc. It's used to be determined by value of org.apache.spark.sql.execution.datasources.FileIndex
, now in this PR we change to relation.fileFormat
.
override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat | ||
|
||
override protected def doValidateInternal(): ValidationResult = { | ||
if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I am not fully understand the logic here, why schema contains _hoodie do not support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question. can you add doc to the method to clarify under what condition the validation should fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the discussion from the previous PR, we can support
- COW table, regardless of
populateMetaField
being true or false - MOR table for read-optimized query
We can start with by just doing COW support in this PR, and add other supports like for MOR RO query and ORC support in later iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We exclude support for this schema for safety. As I tested, the schemas like _hoodie_commit_time
, _hoodie_commit_seqno
, _hoodie_record_key
, _hoodie_partition_path
, _hoodie_file_name
can work fine. But I am not sure whether other schemas are okay. We used to have problem in this part in other formats. I'm okay to remove this check if you can confirm there is no problem in other schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xushiyan any more questions?
@yma11 Thanks for the MR, do we also tested MOR table with Read Optimized Queries or only support COW? please refer to https://hudi.apache.org/docs/next/table_types/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yma11 can you please address the comments? TY
override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat | ||
|
||
override protected def doValidateInternal(): ValidationResult = { | ||
if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question. can you add doc to the method to clarify under what condition the validation should fail?
pom.xml
Outdated
@@ -155,7 +156,8 @@ | |||
<iceberg.version>1.3.1</iceberg.version> | |||
<delta.package.name>delta-core</delta.package.name> | |||
<delta.version>2.3.0</delta.version> | |||
<delta.binary.version>23</delta.binary.version> | |||
<delta.binary.version>23</delta.binary.version> | |||
<hudi.version>0.14.1</hudi.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can update this to 0.15.0
disableBucketedScan | ||
) { | ||
|
||
override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it be easy to add orc support too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. As I know velox doesn't support ORC yet.
@@ -30,13 +30,13 @@ trait DataSourceScanTransformerRegister { | |||
/** | |||
* The class name that used to identify what kind of datasource this is。 | |||
* | |||
* For DataSource V1, it should be the child class name of | |||
* [[org.apache.spark.sql.execution.datasources.FileIndex]]. | |||
* For DataSource V1, it should be relation.fileFormat like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this change - why is it necessary to update the docs here in this PR which is only meant for hudi support?
override lazy val fileFormat: ReadFileFormat = ReadFileFormat.ParquetReadFormat | ||
|
||
override protected def doValidateInternal(): ValidationResult = { | ||
if (requiredSchema.fields.exists(_.name.startsWith("_hoodie"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the discussion from the previous PR, we can support
- COW table, regardless of
populateMetaField
being true or false - MOR table for read-optimized query
We can start with by just doing COW support in this PR, and add other supports like for MOR RO query and ORC support in later iterations.
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
@Amar1404 yes we should land this soon. cc @leesf @YannByron please also give input on the PR when you have a chance. |
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Run Gluten Clickhouse CI |
* [GLUTEN-5471][VL]feat: Support read Hudi COW table * Refine code and fix UTs --------- Co-authored-by: Shiyan Xu <[email protected]>
What changes were proposed in this pull request?
Support read hudi COW tables. This PR is updated based on PR. Thanks @xushiyan for contribution.
(Fixes: #5471)
How was this patch tested?
new UTs added