Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8680] Enable partition stats by default and add support for additional data types with col stats and partition stats #12511

Draft
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Dec 18, 2024

Change Logs

  • Enabling partition stats by default on the writer for SPARK engine.
  • Added support for timestamp, Date, LocalDate, Decimal.

Few features/tests where PSI is disabled:

Impact

  • Enables partition stats by default on writer (SPARK Engine)

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 18, 2024
@nsivabalan nsivabalan force-pushed the enablePartitionStatsByDefault2 branch from 43c0a07 to 9ae8233 Compare December 18, 2024 15:02
Option<HoodieRecordType> recordType) {
List<String> columnsToIndex = metadataConfig.getColumnsEnabledForColumnStatsIndex();
if (!columnsToIndex.isEmpty()) {
if (freshTable) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a very fresh table, we may not have table schema to validate against. So, just honor the columns to index overriden. Subsequent commit will auto correct the list of cols to index if need be,

.map(colStatsOpt -> colStatsOpt.get())
.filter(stats -> fileNames.contains(stats.getFileName()))
.map(HoodieColumnRangeMetadata::fromColumnStats).collectAsList();
if (!partitionColumnMetadata.isEmpty()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no changes as such. just added this if block.

@nsivabalan nsivabalan changed the title [HUDI-8680][DNM] Enable partition stats by default 2 [HUDI-8680][DNM] Enable partition stats by default and add support for additional data types with col stats and partition stats Dec 30, 2024
@nsivabalan nsivabalan changed the title [HUDI-8680][DNM] Enable partition stats by default and add support for additional data types with col stats and partition stats [HUDI-8680] Enable partition stats by default and add support for additional data types with col stats and partition stats Dec 30, 2024
@@ -1487,4 +1488,20 @@ public static Comparable<?> unwrapAvroValueWrapper(Object avroValueWrapper) {
throw new UnsupportedOperationException(String.format("Unsupported type of the value (%s)", avroValueWrapper.getClass()));
}
}

public static Comparable<?> unwrapAvroValueWrapper(Object avroValueWrapper, String wrapperClassName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's UT this method

@hudi-bot
Copy link

hudi-bot commented Jan 7, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor Author

Raised a separate patch for Enabling col stats #12595

PSI needs more tests to be added. I don't want to drag this PR. so, lets review and land col stats for now and follow up w/ PSI post that.

@nsivabalan nsivabalan marked this pull request as draft January 8, 2025 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants