-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8631] Support of hoodie.populate.meta.fields
for Flink append mode
#12516
Merged
danny0405
merged 2 commits into
apache:master
from
geserdugarov:master-flink-populate-meta
Dec 21, 2024
Merged
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the flag
preserveHoodieMetadata
already control this behavior, there is another PR raised by @usberkeley for fixing all the scenarios BTW: #12404There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw #12404 and was confused by ticket name, which mention Flink table config
hoodie.populate.meta.fields
, and didn't find any changes inhudi-flink-client
orhudi-flink-datasource
that will change current behavior. My point of view I described in #12404 (comment)And to support my comment, I've created this MR that shows the lack of support of
hoodie.populate.meta.fields
in Flink.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preserveHoodieMetadata
actually used in the code as an indicator to get metadata from row data or generate it by calling corresponding methods. And it actually a little bit confusing naming then. I believe that values forpreserveHoodieMetadata
could be described by this schema:Looks like
preserveHoodieMetadata
could be true only for clustering operator.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the flag
preserveHoodieMetadata
indicates whether the source row includes the metadata fields already, for table service like clustering, this should be true by default(because clustering is just a rewrite). For regular write, the metadata fields should be generated on the fly.Let's check in which case the option
hoodie.populate.meta.fields
could be false.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User sets value for
hoodie.populate.meta.fields
option, which istrue
by default. And in description for this config, "append only/immutable data" is mentioned as use case:hudi/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
Lines 261 to 265 in 9da3221
For this reason, in this MR I supported
hoodie.populate.meta.fields
in Flink only for append mode.For quick check I use SQL queries like the following ones, which used for append mode:
Expected results: there is no exceptions during
and corresponding parquet files in HDFS don't contain columns with metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also found that we could write MOR table in upsert mode without metadata. Call stack in this case will include
HoodieAppendHandle
. But we couldn't read result MOR table by Flink later due to exception thrown during:I've created separate bug for MOR without meta columns read: HUDI-8785, will fix it in a separate MR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we atleast add an integration test in
ITTestHoodieDataSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405
I've added
ITTestHoodieDataSource::testWriteWithoutMetaColumns
. But for proper checking, it would be great to write data by Flink, and then read it by Spark, because in Sparkwill return all columns including those with metadata.
And it would be really useful check of engines interoperability. I've created a corresponding task HUDI-8788.