Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow HiveSplit info columns like '$file_size' and '$file_modified_time' to be queried in SQL #8800

Closed
wants to merge 1 commit into from

Conversation

aditi-pandit
Copy link
Collaborator

@aditi-pandit aditi-pandit commented Feb 19, 2024

$file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (#7880).

The columns are passed by the co-ordinator to the worker in the HiveSplit.

i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo.
ii) These values should be populated by SplitReader into TableScanOperator output buffers.

This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965

Fixes prestodb/presto#21867

@gaoyangxiaozhu will have a follow up PR on the Spark integration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2024
Copy link

netlify bot commented Feb 19, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 86ab66c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65e63b3958c088000893f7ad

@gaoyangxiaozhu
Copy link
Contributor

hey @aditi-pandit I also have a similar PR #7880 to let velox support query spark engine supported file metadata for hiveTables (file_path, file_size, file_name, file_modify_time, file_block_start, file_block_end) etc.

Maybe we can work together to see if can let the change support for both engine presto and spark ?

@gaoyangxiaozhu
Copy link
Contributor

hey @aditi-pandit may change the PR title to Allow info columns for HiveSplits to be queried in SQL

@aditi-pandit aditi-pandit changed the title Allow '$file_size' and '$file_modified_time' for HiveSplits to be queried in SQL Allow HiveSplit info columns like '$file_size' and '$file_modified_time' to be queried in SQL Feb 27, 2024
velox/connectors/hive/SplitReader.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/SplitReader.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/SplitReader.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/SplitReader.cpp Outdated Show resolved Hide resolved
velox/connectors/hive/SplitReader.cpp Outdated Show resolved Hide resolved
@aditi-pandit
Copy link
Collaborator Author

@Yuhta @majetideepak : PTAL.

@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@aditi-pandit
Copy link
Collaborator Author

@Yuhta : Do you need help with the linter error ? Please can you give me more info about it.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in b9afa14.

@aditi-pandit aditi-pandit deleted the hive_file_metadata branch March 5, 2024 21:22
PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Mar 7, 2024
PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Mar 7, 2024
…file_modified_time' to be queried in SQL (facebookincubator#8800)""

This reverts commit d3dc172.
@@ -364,11 +378,13 @@ std::shared_ptr<common::ScanSpec> makeScanSpec(
// SelectiveColumnReader doesn't support constant columns with filters,
// hence, we can't have a filter for a $path or $bucket column.
//
// Unfortunately, Presto happens to specify a filter for $path or
// $bucket column. This filter is redundant and needs to be removed.
// Unfortunately, Presto happens to specify a filter for $path, $file_size,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if there is there an issue for this on Presto side?

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…me' to be queried in SQL (facebookincubator#8800)

Summary:
$file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (facebookincubator#7880).

The columns are passed by the co-ordinator to the worker in the HiveSplit.

i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo.
ii) These values should be populated by SplitReader into TableScanOperator output buffers.

This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965

Fixes prestodb/presto#21867

gaoyangxiaozhu will have a follow up PR on the Spark integration.

Pull Request resolved: facebookincubator#8800

Reviewed By: mbasmanova

Differential Revision: D54512245

Pulled By: Yuhta

fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[native] Hidden columns missing in Prestissimo Hive Connector
6 participants