Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file metadata columns support for spark parquet #7880

Conversation

gaoyangxiaozhu
Copy link
Contributor

@gaoyangxiaozhu gaoyangxiaozhu commented Dec 5, 2023

Spark support people query file metdata as file_size, file_name, file_path, file_modified_time, file_block_start etc for Hive tables as seperated file metadata column. Checking this for all file metadata supported query by spark https://github.com/apache/spark/blob/081c7a7947a47bf0b2bfd478abdd4b78a1db3ddb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L183C2-L193C56.

The PR extends the HiveSplit with a new parameter metadaColumns to let upsteram computing engine as spark to pass the initialized const file metadata columns (if have) to velox connector split when constructed, it use to fix file metadata columns null issue when intergrating with Velox using Spark.

  1. It let HiveSplit can get file metadata info from spark
  2. The values is populated by by SplitReader into TableScanOperator output.

Checking this issue #8173 for details context.

It is also a dependency of Gluten repository PR apache/incubator-gluten#3870

Copy link

netlify bot commented Dec 5, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 6d8b7d8
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65be35fb7ccc460008a45ce9

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2023
@gaoyangxiaozhu gaoyangxiaozhu changed the title add metadatacolumns support for spark hive connector Add metadata columns support for spark parquet Dec 25, 2023
@gaoyangxiaozhu gaoyangxiaozhu changed the title Add metadata columns support for spark parquet Add file metadata columns support for spark parquet Dec 25, 2023
@yingsu00 yingsu00 requested review from yingsu00 and Yuhta January 18, 2024 15:27
@gaoyangxiaozhu gaoyangxiaozhu requested a review from Yuhta January 31, 2024 01:05
@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Feb 21, 2024

@majetideepak also help for review

and I saw looks @aditi-pandit also submit a similar PR #8800 for addressing presdo engine $file_path and $file_modify_time query issue. We can see if can make the PR cover both.

@aditi-pandit
Copy link
Collaborator

@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces :

i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes.
ii) The metadata columns can be used as filters in Presto queries as well. I would imagine Spark would allow this as well. The code should be enhanced for this.
iii) What is the difference between synthesized and metadata columns ? Its not clear to me.

It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ?

@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Feb 22, 2024

@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces :

i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes. ii) The metadata columns can be used as filters in Presto queries as well. I would imagine Spark would allow this as well. The code should be enhanced for this. iii) What is the difference between synthesized and metadata columns ? Its not clear to me.

It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ?

thanks @aditi-pandit

for i) Yes, a test is needed and I saw your PR already have it.
for ii) Actually I add filter part support and just remove it temporily with this commit df5e279, but we can added it back if it still need for making both presto and spark work.

image

for iii) I actually before don't know what's the synthesized means and also i don't found anywhere code path use it, so i just added a KMetadata to easily mark it is metadata column, but we can still use synthesized it should be ok, I'd like metadata naming due to I don't understand what synthesized means.

Sure @aditi-pandit , it's OK if you can update your PR to move using metadata_columns parameters and remove any hard code checking for know it is metadata or not. For filter part just reference change

image

if (auto name = subfield.toString();
        metadataColumns.count(name) == 0 && partitionKeys.count(name) == 0) {
      filterSubfields[getColumnName(subfield)].push_back(&subfield);
    }

facebook-github-bot pushed a commit that referenced this pull request Mar 5, 2024
…me' to be queried in SQL (#8800)

Summary:
$file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (#7880).

The columns are passed by the co-ordinator to the worker in the HiveSplit.

i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo.
ii) These values should be populated by SplitReader into TableScanOperator output buffers.

This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965

Fixes prestodb/presto#21867

gaoyangxiaozhu will have a follow up PR on the Spark integration.

Pull Request resolved: #8800

Reviewed By: mbasmanova

Differential Revision: D54512245

Pulled By: Yuhta

fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e
@gaoyangxiaozhu
Copy link
Contributor Author

closed with @aditi-pandit 's PR is merged do the same thing.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…me' to be queried in SQL (facebookincubator#8800)

Summary:
$file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (facebookincubator#7880).

The columns are passed by the co-ordinator to the worker in the HiveSplit.

i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo.
ii) These values should be populated by SplitReader into TableScanOperator output buffers.

This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965

Fixes prestodb/presto#21867

gaoyangxiaozhu will have a follow up PR on the Spark integration.

Pull Request resolved: facebookincubator#8800

Reviewed By: mbasmanova

Differential Revision: D54512245

Pulled By: Yuhta

fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants