Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing IcebergSplitReader #7362

Closed
wants to merge 6 commits into from

Conversation

yingsu00
Copy link
Collaborator

@yingsu00 yingsu00 commented Nov 1, 2023

In this PR we introduces IcebergSplitReader which supports reading Iceberg splits with positional delete files.

This PR is to replace #5897

TODO:

  1. Delete file deserialization
  2. Add filter on base file path when reading the delete file

Copy link

netlify bot commented Nov 1, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 6edc1db
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6554473e4e834d000876878e

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 1, 2023
@yingsu00 yingsu00 force-pushed the iceberg7 branch 2 times, most recently from 801e2ce to e8b35ab Compare November 2, 2023 02:58
_length,
_partitionKeys,
_tableBucketNumber) {
// TODO: Deserialize _extraFileInfo to get deleteFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the relevant information of deleteFile be obtained through string deserialization? What are the benefits of doing this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way is to deserialize the incoming split to IcebergSplit with delete files directly, so we don't have to do deserialization from _extraFileInfo again. But this interface was what's decided by Meta.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.

Recently Arrow lib was partially moved into Parquet writer. Some of the
Parquet reader files still depends on some Arrow classes which will be
removed in the near future. The namespaces are now moved from arrow::
to ::arrow because it is still depending on the third_party arrow_ep
installation and this commit fixes it to avoid build failures.
RowContainerSortBenchmark depends on Parquet reader by default, but
Parquet support is controled by VELOX_ENABLE_PARQUET and it's defaulted
to OFF. This caused build failures because it tries to include some
Arrow headers which is not installed if VELOX_ENABLE_PARQUET is off.
This commit guard this benchmark by checking VELOX_ENABLE_PARQUET first.
S3InsertTest.cpp used to include the ParquetReader.h, and caused build
failures if VELOX_ENABLE_PARQUET and VELOX_ENABLE_ARROW were not turned
on. However the include of ParquetReader.h is not needed. This commit
cleans up the includes for S3InsertTest.cpp
The upcoming IcebergSplitReader will need to use fileHandleFactory_,
executor_, connectorQueryCtx_, etc because it needs to create another
HiveDataSource to read the delete files. This PR copies these required
fields to SplitReader. Moreover, since the SplitReader already owns
the baseReader_, the creation and configuration of ReaderOptions was
also moved to SplitReader in a single function configureReaderOptions().
Previously the configuration of ReaderOptions was scattered at multiple
locations in HiveDataSource.
@yingsu00 yingsu00 marked this pull request as ready for review November 13, 2023 04:28
@yingsu00 yingsu00 requested a review from Yuhta November 13, 2023 04:28
In this commit we introduces IcebergSplitReader which supports reading
Iceberg splits with positional delete files.
ConnectorQueryCtx* const connectorQueryCtx_;
uint64_t splitOffset_;

std::unique_ptr<HiveDataSource> deleteDataSource_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Please use a dwio::common::Reader instead of the whole data source here. There are a lot of stuff you don't need in a delta file reader: column manipulation, remaining filter, split prefetch, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Thanks for reviewing. Yes I can do that. I'll move the util functions like makeScanSpec, getColumnName, etc to a new HiveReaderUtil class, so that they can be shared by the HiveDataSource and the DeleteFileReader.

Copy link
Contributor

@Yuhta Yuhta Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. Curious why do you need makeScanSpec though? You just need a scan spec to read all rows and all columns, no?

Copy link
Collaborator Author

@yingsu00 yingsu00 Nov 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta For positional deletes, we will need to build subfield filters on the path column, as an Iceberg positional delete file can contain deleted rows in multiple base files. Will also add filter on the delete row column, since it's sorted and the lower/upperbound are known, so that perf gets better.
For equality deletes, not every column needs to be read sometimes.

Copy link
Contributor

@Yuhta Yuhta Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need makeScanSpec for these, just construct the ScanSpec yourselves (see below code example), it is very small and controllable. I did that for metalake. makeScanSpec is more for translating the SubfieldFilters and prunings from Presto representation.

    auto spec = std::make_shared<common::ScanSpec>("<root>");
    spec->addAllChildFields(*rowType);
    auto* pathSpec = spec->childByName("path");
    pathSpec->setFilter(std::make_unique<common::BytesValues>(std::vector<std::string>{path}, false));
    pathSpec->setProjectOut(false);

uint64_t recordCount;
uint64_t fileSizeInBytes;
// The field ids for the delete columns for equality delete files
std::vector<int32_t> equalityFieldIds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why mixing up the contents of delete files with different types? Can we provide different DeleteFileReader which returns corresponding data structure for delete, like bitmap for positional delete or maybe hashtable for equality delete? and maybe this can be made common for other data sources like delta lake?

Copy link
Collaborator

@majetideepak majetideepak Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this proposal. Looking at the code, I don't see why we need a separate IcebergSplitReader, etc.
We need a DeleteFileReader to the SplitReader and implement an IcebergDeleteFileReader.
The DeleteFileReader will return positional or equality deletes in a format independent of the table format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, so only the delta file readers are implementation specific and the SplitReader logic is shared.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why mixing up the contents of delete files with different types?

@yma11 It's not mixing for different types. This IcebergDeleteFile is just what Presto passes as the metadata for a delete file, not the file or the reader itself. It contains information on both types and is passed to Velox through the split. THere will be separate delete file reader classes. The current DeleteFileReader is for positional delete, and there will be an EqualityDeleteFileReader class that reads equality delete file. I named it as DeleteFileReader because I want to subclass it to PositionalDeleteFileReader EqualityDeleteFileReader in the next PR, with the hope to reuse the logic of creating Reader and RowReaders. Or we can extract the common logic and make them 2 separate classes.

namespace facebook::velox::connector::hive::iceberg {

enum class FileContent {
kData,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any place for FileContent to be used for normal data file? I think we may don't need it if it's for delete file type identification.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet. I kept it there for future use. I can remove it for now.

@yingsu00
Copy link
Collaborator Author

@Yuhta #7847 is the new PR that uses RowReader instead of HiveDataSource. Can you please review that one? I kept both open just for your reference. Thanks!

@Yuhta Yuhta closed this Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants