Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8631] Fix the bug where the Flink table config hoodie.populate.meta.fields is not effective and optimize write performance #12404

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

usberkeley
Copy link
Contributor

Change Logs

1. Fix the bug

  • Resolve missing hoodie.populate.meta.fields in Table Config (hoodie.properties)
  • Enhance validity checks

2. Optimize write performance

  • Reduce unnecessary decoding overhead when disabling meta fields

Impact

Improve write performance. After optimization, the write speed with hoodie.populate.meta.fields=false is 42.9% faster than with hoodie.populate.meta.fields=true.

Testing method
Consume from the earliest position in Kafka until all messages are consumed (Kafka Lag = 0), and compare the time taken for both.

1)populate meta fields
time taken: 21hours and 25mins
image

2)no meta fields
time taken: 12hours and 14mins
image

Risk level (write none, low medium or high below)

medium

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

…meta.fields is not effective and optimize write performance
@usberkeley usberkeley marked this pull request as draft December 3, 2024 06:28
@usberkeley
Copy link
Contributor Author

usberkeley commented Dec 3, 2024

Dear Danny, Could you please take a look at this PR? @danny0405

Summary

1. When enable populateMetaFields, why the number of record key fields must be equal to one?

the number of record key fields must be equal to one, also known as KeyGeneratorType.SIMPLE. The Log Scanner needs to regenerate the Record Key from HoodieRecord.

2. Why not support KeyGeneratorType.COMPLEX in this PR?

The implementation classes for KeyGenerator are distributed across the hudi-client-common and hudi-spark-client packages, while The Log Scanner is in the hudi-common package. This makes it impossible to directly introduce the KeyGenerator implementation classes into the Log Scanner.

Another approach would be to pass the KeyGenerator implementation class object when constructing the Log Scanner. However, this would require modifying all modules that use the Log Scanner, which are over 30 instances. Some of these instances (such as Hudi Reader, which does not depend on the hudi-client-common package) cannot be implemented, and it would also increase the complexity of using the Log Scanner.

3. When and how to support KeyGeneratorType.COMPLEX in the future?

The next PR will support it by consolidating the KeyGenerator implementation classes from the hudi-client-common package (Flink) and the hudi-spark-client package (Spark) into the hudi-common package. I think that the KeyGenerator implementation classes are fundamental and should belong to the hudi-common package. This will also allow the Log Scanner to support KeyGeneratorType.COMPLEX when meta fields are disabled.

4. Why are allowOperationMetadataField allowed only when populateMetaFields is enabled?

  1. Disabling populateMetaFields can reduce the performance overhead of decoding HoodieRecords. However, if allowOperationMetadataField is enabled, decoding performance is still affected even if populateMetaFields is disabled. Therefore, the impact of these two settings on performance is interconnected.
  2. Both are metadata fields. populateMetaFields is the main switch, while allowOperationMetadataField just controls the activation of specific metadata fields. When the main switch is off, the sub-switches should have no effect.

5. In #11028, we already fixed the unnecessary rewrite when the schemas are exactly the same, is your benchmark based on the fix then?

Yes

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 3, 2024
@hudi-bot
Copy link

hudi-bot commented Dec 3, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@usberkeley usberkeley marked this pull request as ready for review December 3, 2024 11:11
@yihua yihua self-assigned this Dec 13, 2024
@geserdugarov
Copy link
Contributor

geserdugarov commented Dec 18, 2024

@usberkeley, @yihua , @danny0405
I would like to enquire about why we need to optimize any performance, when for Flink first thing to do is support of hoodie.populate.meta.fields. Please, correct me if I'm wrong.

hoodie.populate.meta.fields option meant to be used only for append mode, so we need to look at:

if (OptionsResolver.isAppendMode(conf)) {
// close compaction for append mode
conf.set(FlinkOptions.COMPACTION_SCHEDULE_ENABLED, false);
DataStream<Object> pipeline = Pipelines.append(conf, rowType, dataStream);

which will lead to

But at this stage file writer is already initialized with row type:
this.rowType = preserveHoodieMetadata ? rowType : addMetadataFields(rowType, writeConfig.allowOperationMetadataField()); // patch up with metadata fields

So we can use row type without metadata columns, and write in append mode only initial Flink row data.

@geserdugarov
Copy link
Contributor

geserdugarov commented Dec 18, 2024

@usberkeley, @yihua , @danny0405
Sorry to be difficult, but if I understand correctly, and this MR proposes optimization of Flink write by hoodie.populate.meta.fields = false. Then, I suppose, we should do it only, and not mix it with other optimizations.

For support of hoodie.populate.meta.fields = false in Flink, it's enough to implement changes like #12516.

By any chance, could this MR be refactoring and optimization everything related to hoodie.populate.meta.fields everywhere, but not related to support of hoodie.populate.meta.fields in Flink?

@geserdugarov
Copy link
Contributor

geserdugarov commented Dec 19, 2024

@usberkeley, if you don't mind, it would be better to reopen and use already created HUDI-8308 for optimizations.

@usberkeley
Copy link
Contributor Author

@usberkeley, if you don't mind, it would be better to reopen and use already created HUDI-8308 for optimizations.

Dear geserdugarov, look amazing.
Sure, as long as it helps the community, we'll choose the right solution

I've been busy with a major feature recently, so I'll take a closer look at your PR a bit later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants