-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8232] Adding table config to track list of cols indexed with col stats partition in MDT #12493
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan Wdyt about tracking the indexed columns in index definition file instead of table config? Tracking in table config increases the complexity of the table config, especially if there are frequent updates or large lists of columns.
It also fits well with separation of concern. We can just keep the available indexes in table config and any index specific metadata in index definition file.
There would be some work required though. Currently, we do not create index definition file upront. It gets created only when secondary or expression index gets created. Dev effort aside, from a design standpoint, do you think tracking in index definition file is a good idea?
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, String instantTime, HoodieC | |||
} | |||
} | |||
} | |||
|
|||
public static void updateColsToIndex(HoodieTable dataTable, HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) { | |||
if (config.getMetadataConfig().isColumnStatsIndexEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snce this gets called after writing the completed instant, the table config hoodie.table.metadata.partitions
must have been updated. So, instead of using the metadata config, can we use metaClient().getTableConfig().getMetadataPartitions().contains(MetadataPartitionType.COLUMN_STATS.getPartitionPath())
?
latestInstant, | ||
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(), | ||
HoodieCommitMetadata.class); | ||
if (mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check needed? I think the check on line 309 whether columns indexed has changed or not is sufficient.
latestInstant, | ||
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(), | ||
HoodieCommitMetadata.class); | ||
if (mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that the check is costly. If not for this check, then we don't need mdtCommitMetadata
and hence don't need to create mdtMetaClient
. If we strictly need this check, please consider caching the meta client or reusing an existing instance if possible.
.setStorage(dataTable.getStorage()) | ||
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath())) | ||
.build(); | ||
HoodieInstant latestInstant = mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,b) -> b).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why call reduce
? Wouldn't mdtMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant()
be sufficient?
// update data table's table config for list of columns indexed. | ||
List<String> columnsToIndex = HoodieTableMetadataUtil.getColumnsToIndex(commitMetadata, dataTable.getMetaClient(), config.getMetadataConfig()); | ||
// if col stats is getting updated, lets also update list of columns indexed if changed. | ||
if (!dataTable.getMetaClient().getTableConfig().getTableColStatsIndexedColumns().equals(columnsToIndex)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are nested fields getting tracked? Is it like a.b
? Asking bcoz if a nested field has same name as any other non-nested field, then equals
check could return unexpected result.
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, String instantTime, HoodieC | |||
} | |||
} | |||
} | |||
|
|||
public static void updateColsToIndex(HoodieTable dataTable, HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it is safe to update in multi-writer scenario as this is still called within a transaction, I am thinking should we do all table config updates in one shot instead opening and closing the file multiple times? Let's say if this was the first commit building the colstats, then table config would be update multiple times, isn't it?
@@ -1364,24 +1342,47 @@ public static HoodieData<HoodieRecord> convertMetadataToColumnStatsRecords(Hoodi | |||
} | |||
} | |||
|
|||
public static List<String> getColumnsToIndex(HoodieCommitMetadata commitMetadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this is getting exercised in the functional tests in TestColumnStatsIndex
, it would be nice to make it visible for testing and add a UT for this method in TestHoodieTableMetadataUtil
if possible.
@@ -336,6 +338,12 @@ public class HoodieTableConfig extends HoodieConfig { | |||
.withDocumentation("Comma-separated list of metadata partitions that have been completely built and in-sync with data table. " | |||
+ "These partitions are ready for use by the readers"); | |||
|
|||
public static final ConfigProperty<String> TABLE_COL_STATS_INDEXED_COLUMNS = ConfigProperty | |||
.key("hoodie.table.col.stats.indexed.columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.key("hoodie.table.col.stats.indexed.columns") | |
.key("hoodie.table.metadata.column.stats.indexed.columns") |
@@ -477,7 +482,14 @@ private void initializeFromFilesystem(String initializationTime, List<MetadataPa | |||
HoodieData<HoodieRecord> records = fileGroupCountAndRecordsPair.getValue(); | |||
bulkCommit(instantTimeForPartition, partitionName, records, fileGroupCount); | |||
metadataMetaClient.reloadActiveTimeline(); | |||
|
|||
if (partitionType == COLUMN_STATS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use MetadataPartitionType.COLUMN_STATS
enum as much as possible.
if (!dataMetaClient.getTableConfig().getTableColStatsIndexedColumns().equals(columnsToIndex)) { | ||
LOG.info(String.format("List of columns to index is changing. Old value %s. New value %s", dataMetaClient.getTableConfig().getTableColStatsIndexedColumns(), | ||
columnsToIndex)); | ||
dataMetaClient.getTableConfig().setColStatsIndexedColumns(dataMetaClient, columnsToIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're anyway going to check and update post commit, why doing it here?
Re-worked on this patch here #12529 |
Change Logs
Impact
This will help track the actual list of columns indexed with col stats partition in MDT.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist