Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8232] Adding table config to track list of cols indexed with col stats partition in MDT #12493

Closed
wants to merge 1 commit into from

Conversation

nsivabalan
Copy link
Contributor

Change Logs

  • Added a table config ("hoodie.table.col.stats.indexed.columns") to track the list of columns indexed with column stats partition in Metadata table. This will be later leveraged during pruning to only lookup in col stats partition for indexed cols (HUDI-8663)

Impact

This will help track the actual list of columns indexed with col stats partition in MDT.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 16, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan Wdyt about tracking the indexed columns in index definition file instead of table config? Tracking in table config increases the complexity of the table config, especially if there are frequent updates or large lists of columns.
It also fits well with separation of concern. We can just keep the available indexes in table config and any index specific metadata in index definition file.

There would be some work required though. Currently, we do not create index definition file upront. It gets created only when secondary or expression index gets created. Dev effort aside, from a design standpoint, do you think tracking in index definition file is a good idea?

@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, String instantTime, HoodieC
}
}
}

public static void updateColsToIndex(HoodieTable dataTable, HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {
if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snce this gets called after writing the completed instant, the table config hoodie.table.metadata.partitions must have been updated. So, instead of using the metadata config, can we use metaClient().getTableConfig().getMetadataPartitions().contains(MetadataPartitionType.COLUMN_STATS.getPartitionPath()) ?

latestInstant,
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(),
HoodieCommitMetadata.class);
if (mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check needed? I think the check on line 309 whether columns indexed has changed or not is sufficient.

latestInstant,
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(),
HoodieCommitMetadata.class);
if (mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that the check is costly. If not for this check, then we don't need mdtCommitMetadata and hence don't need to create mdtMetaClient. If we strictly need this check, please consider caching the meta client or reusing an existing instance if possible.

.setStorage(dataTable.getStorage())
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
.build();
HoodieInstant latestInstant = mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,b) -> b).get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call reduce? Wouldn't mdtMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant() be sufficient?

// update data table's table config for list of columns indexed.
List<String> columnsToIndex = HoodieTableMetadataUtil.getColumnsToIndex(commitMetadata, dataTable.getMetaClient(), config.getMetadataConfig());
// if col stats is getting updated, lets also update list of columns indexed if changed.
if (!dataTable.getMetaClient().getTableConfig().getTableColStatsIndexedColumns().equals(columnsToIndex)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are nested fields getting tracked? Is it like a.b? Asking bcoz if a nested field has same name as any other non-nested field, then equals check could return unexpected result.

@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, String instantTime, HoodieC
}
}
}

public static void updateColsToIndex(HoodieTable dataTable, HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it is safe to update in multi-writer scenario as this is still called within a transaction, I am thinking should we do all table config updates in one shot instead opening and closing the file multiple times? Let's say if this was the first commit building the colstats, then table config would be update multiple times, isn't it?

@@ -1364,24 +1342,47 @@ public static HoodieData<HoodieRecord> convertMetadataToColumnStatsRecords(Hoodi
}
}

public static List<String> getColumnsToIndex(HoodieCommitMetadata commitMetadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this is getting exercised in the functional tests in TestColumnStatsIndex, it would be nice to make it visible for testing and add a UT for this method in TestHoodieTableMetadataUtil if possible.

@@ -336,6 +338,12 @@ public class HoodieTableConfig extends HoodieConfig {
.withDocumentation("Comma-separated list of metadata partitions that have been completely built and in-sync with data table. "
+ "These partitions are ready for use by the readers");

public static final ConfigProperty<String> TABLE_COL_STATS_INDEXED_COLUMNS = ConfigProperty
.key("hoodie.table.col.stats.indexed.columns")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.key("hoodie.table.col.stats.indexed.columns")
.key("hoodie.table.metadata.column.stats.indexed.columns")

@@ -477,7 +482,14 @@ private void initializeFromFilesystem(String initializationTime, List<MetadataPa
HoodieData<HoodieRecord> records = fileGroupCountAndRecordsPair.getValue();
bulkCommit(instantTimeForPartition, partitionName, records, fileGroupCount);
metadataMetaClient.reloadActiveTimeline();

if (partitionType == COLUMN_STATS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use MetadataPartitionType.COLUMN_STATS enum as much as possible.

if (!dataMetaClient.getTableConfig().getTableColStatsIndexedColumns().equals(columnsToIndex)) {
LOG.info(String.format("List of columns to index is changing. Old value %s. New value %s", dataMetaClient.getTableConfig().getTableColStatsIndexedColumns(),
columnsToIndex));
dataMetaClient.getTableConfig().setColStatsIndexedColumns(dataMetaClient, columnsToIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're anyway going to check and update post commit, why doing it here?

@nsivabalan
Copy link
Contributor Author

Re-worked on this patch here #12529

@nsivabalan nsivabalan marked this pull request as draft December 20, 2024 04:24
@nsivabalan nsivabalan closed this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants