Add file metadata columns support for spark parquet #7880

gaoyangxiaozhu · 2023-12-05T12:50:41Z

Spark support people query file metdata as file_size, file_name, file_path, file_modified_time, file_block_start etc for Hive tables as seperated file metadata column. Checking this for all file metadata supported query by spark https://github.com/apache/spark/blob/081c7a7947a47bf0b2bfd478abdd4b78a1db3ddb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala#L183C2-L193C56.

The PR extends the HiveSplit with a new parameter metadaColumns to let upsteram computing engine as spark to pass the initialized const file metadata columns (if have) to velox connector split when constructed, it use to fix file metadata columns null issue when intergrating with Velox using Spark.

It let HiveSplit can get file metadata info from spark
The values is populated by by SplitReader into TableScanOperator output.

Checking this issue #8173 for details context.

It is also a dependency of Gluten repository PR apache/incubator-gluten#3870

netlify · 2023-12-05T12:50:47Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`6d8b7d8`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/65be35fb7ccc460008a45ce9

velox/connectors/hive/HiveDataSource.cpp

…umns

gaoyangxiaozhu · 2024-02-21T05:48:34Z

@majetideepak also help for review

and I saw looks @aditi-pandit also submit a similar PR #8800 for addressing presdo engine $file_path and $file_modify_time query issue. We can see if can make the PR cover both.

aditi-pandit · 2024-02-22T05:35:55Z

@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces :

i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes.
ii) The metadata columns can be used as filters in Presto queries as well. I would imagine Spark would allow this as well. The code should be enhanced for this.
iii) What is the difference between synthesized and metadata columns ? Its not clear to me.

It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ?

gaoyangxiaozhu · 2024-02-22T06:28:22Z

@gaoyangxiaozhu : Have couple of high-level comments. There are several missing pieces :

i) A TableScan output operator test like https://github.com/facebookincubator/velox/pull/8800/files#r1498654870. This would introduce the wiring for metadata columns in HiveConnector TestBase classes. ii) The metadata columns can be used as filters in Presto queries as well. I would imagine Spark would allow this as well. The code should be enhanced for this. iii) What is the difference between synthesized and metadata columns ? Its not clear to me.

It might be simpler to change #8800 to use metadata_columns parameters for HiveSplit instead. wdyt ?

thanks @aditi-pandit

for i) Yes, a test is needed and I saw your PR already have it.
for ii) Actually I add filter part support and just remove it temporily with this commit df5e279, but we can added it back if it still need for making both presto and spark work.

for iii) I actually before don't know what's the synthesized means and also i don't found anywhere code path use it, so i just added a KMetadata to easily mark it is metadata column, but we can still use synthesized it should be ok, I'd like metadata naming due to I don't understand what synthesized means.

Sure @aditi-pandit , it's OK if you can update your PR to move using metadata_columns parameters and remove any hard code checking for know it is metadata or not. For filter part just reference change

if (auto name = subfield.toString();
        metadataColumns.count(name) == 0 && partitionKeys.count(name) == 0) {
      filterSubfields[getColumnName(subfield)].push_back(&subfield);
    }

…me' to be queried in SQL (#8800) Summary: $file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (#7880). The columns are passed by the co-ordinator to the worker in the HiveSplit. i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo. ii) These values should be populated by SplitReader into TableScanOperator output buffers. This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965 Fixes prestodb/presto#21867 gaoyangxiaozhu will have a follow up PR on the Spark integration. Pull Request resolved: #8800 Reviewed By: mbasmanova Differential Revision: D54512245 Pulled By: Yuhta fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e

gaoyangxiaozhu · 2024-03-06T10:06:06Z

closed with @aditi-pandit 's PR is merged do the same thing.

…me' to be queried in SQL (facebookincubator#8800) Summary: $file_size and $file_modified_time are queryable synthesized columns for Hive tables in Presto. Spark also has bunch of such queryable synthesized columns (facebookincubator#7880). The columns are passed by the co-ordinator to the worker in the HiveSplit. i) Velox HiveSplit needed to be enhanced to get filesize and file_modified_time metadata in a generic map data-structure of (column name, value) from Prestissimo. ii) These values should be populated by SplitReader into TableScanOperator output buffers. This also needs a Prestissimo change to populate the HiveSplit with this info sent in the fragment prestodb/presto#21965 Fixes prestodb/presto#21867 gaoyangxiaozhu will have a follow up PR on the Spark integration. Pull Request resolved: facebookincubator#8800 Reviewed By: mbasmanova Differential Revision: D54512245 Pulled By: Yuhta fbshipit-source-id: 190a97f9fcb1e869fff82e0a2264d57f9915376e

add metadatacolumns support for spark hive connector

6761689

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2023

code format fix

93cce0d

gaoyangxiaozhu mentioned this pull request Dec 6, 2023

[VL] parquet file metadata columns support in velox apache/incubator-gluten#3870

Merged

gaoyangxiaozhu changed the title ~~add metadatacolumns support for spark hive connector~~ Add metadata columns support for spark parquet Dec 25, 2023

gaoyangxiaozhu changed the title ~~Add metadata columns support for spark parquet~~ Add file metadata columns support for spark parquet Dec 25, 2023

yingsu00 requested review from yingsu00 and Yuhta January 18, 2024 15:27

Yuhta reviewed Jan 18, 2024

View reviewed changes

velox/connectors/hive/HiveDataSource.cpp Outdated Show resolved Hide resolved

gayangya and others added 5 commits January 30, 2024 21:10

address comment

df5e279

Merge branch 'main' into gayangya/spark_metadatacolumns

2ec3a5c

fix build

e461357

format fix

84ae961

Merge branch 'facebookincubator:main' into gayangya/spark_metadatacol…

27915b2

…umns

gaoyangxiaozhu requested a review from Yuhta January 31, 2024 01:05

gaoyangxiaozhu and others added 3 commits January 31, 2024 10:19

Merge branch 'facebookincubator:main' into gayangya/spark_metadatacol…

f078ee2

…umns

Merge branch 'facebookincubator:main' into gayangya/spark_metadatacol…

141e79d

…umns

Merge branch 'main' into gayangya/spark_metadatacolumns

6d8b7d8

gaoyangxiaozhu mentioned this pull request Feb 21, 2024

Allow HiveSplit info columns like '$file_size' and '$file_modified_time' to be queried in SQL #8800

Closed

gaoyangxiaozhu closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file metadata columns support for spark parquet #7880

Add file metadata columns support for spark parquet #7880

gaoyangxiaozhu commented Dec 5, 2023 •

edited

Loading

netlify bot commented Dec 5, 2023 •

edited

Loading

gaoyangxiaozhu commented Feb 21, 2024 •

edited

Loading

aditi-pandit commented Feb 22, 2024

gaoyangxiaozhu commented Feb 22, 2024 •

edited

Loading

gaoyangxiaozhu commented Mar 6, 2024

Add file metadata columns support for spark parquet #7880

Add file metadata columns support for spark parquet #7880

Conversation

gaoyangxiaozhu commented Dec 5, 2023 • edited Loading

netlify bot commented Dec 5, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

gaoyangxiaozhu commented Feb 21, 2024 • edited Loading

aditi-pandit commented Feb 22, 2024

gaoyangxiaozhu commented Feb 22, 2024 • edited Loading

gaoyangxiaozhu commented Mar 6, 2024

gaoyangxiaozhu commented Dec 5, 2023 •

edited

Loading

netlify bot commented Dec 5, 2023 •

edited

Loading

gaoyangxiaozhu commented Feb 21, 2024 •

edited

Loading

gaoyangxiaozhu commented Feb 22, 2024 •

edited

Loading