-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-3378][VL] Feat: Support read iceberg mor table for Velox backend #4779
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
cc @yma11 @YannByron, thanks. |
.setEnableRowGroupMaxminIndex( | ||
GlutenConfig.getConf().enableParquetRowGroupMaxMinIndex()) | ||
.build(); | ||
deleteFileBuilder.setParquet(parquetReadOptions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the deletion files share same read options as the data file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg allows the format of delete file and data file to be different, but in most cases they are consistent.
|insert into table iceberg_mor_tb | ||
|values (1, 'a1', 'p1'), (2, 'a2', 'p1'), (3, 'a3', 'p2'); | ||
|""".stripMargin) | ||
// Delete row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add cases for multi deletions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. At least, should cover UPDATE
operation. MergeInto
is nice to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yma11 @YannByron While adding test cases, I discovered a bug in the Velox code. I will fix this bug first before updating the current PR. Later on, I will add test cases for MERGE INTO and UPDATE, as well as multiple DELETE operations.
// Set Iceberg split. | ||
std::unordered_map<std::string, std::string> customSplitInfo{{"table_format", "hive-iceberg"}}; | ||
auto deleteFilesFind = icebergSplitInfo->deleteFilesMap.find(paths[idx]); | ||
auto deleteFiles = deleteFilesFind != icebergSplitInfo->deleteFilesMap.end() ? deleteFilesFind->second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to pass deleteFiles without this map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For normal SplitInfo
, file information is stored in lists, it is unable to obtain the mapping relationship between the data file and the delete file. I have actually considered placing the delete file map under protobuf's LocalFiles
. The current processing logic is to extract the map information into the IcebergReadOptions
of each FileOrFiles
in Java, and then combine them back into a map in C++. It is actually somewhat redundant, but this approach has the least impact on the current protobuf changes.
We can also add a new definition for table_format
in LocalFiles
, with the default being hive
. It can also be hive-iceberg
. When the format is hive-iceberg
, the delete file map contained within will be read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current code, we put all the fields like paths
, starts
, lengths
, etc of all files for each task together, with each as a list. I think we can refactor it by using a single list of file
or split
which contains path
, start
, length
, a list of deleteFiles
so that we won't need this map. But how will it affect the protobuf part? any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yma11 This is also a feasible solution, but not all data files have a delete file, and some data files may have multiple delete files. We would need a two-dimensional vector to maintain this relationship, and the two-dimensional vector would need to have empty vector for data files that do not have a corresponding delete file. Do you think this is a better solution? I'm OK to change to using this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. scanInfo
will have a list of SplitInfo
and each SplitInfo
contains its own path
, start
, length
as well as a list of deleteFiles
which may be empty. I think it will be clear in conception and without using idx
anywhere. Spark-delta
has a similar structure called AddFile
which organizes like this. It's okay to merge this PR first and we do it as a follow up.
::substrait::ReadRel_LocalFiles_FileOrFiles::IcebergReadOptions::DeleteFile::FileFormatCase; | ||
auto icebergSplitInfo = std::dynamic_pointer_cast<IcebergSplitInfo>(splitInfo) | ||
? std::dynamic_pointer_cast<IcebergSplitInfo>(splitInfo) | ||
: std::make_shared<IcebergSplitInfo>(*splitInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will it happen that it's not a IcebergSplitInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since substrait::ReadRel_LocalFiles_FileOrFiles
contains multiple files, each file will enter this function during iteration. The first time it enters as SplitInfo
, and subsequently, it is replaced by IcebergSplitInfo
.
message DeleteFile { | ||
FileContent fileContent = 1; | ||
string filePath = 2; | ||
uint64 fileSize = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a stupid question: why it skips 3 and 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed some redundant fields and forgot to update the sequence numbers; it shouldn't skip 3 and 4.
4a48906
to
20c5b98
Compare
Run Gluten Clickhouse CI |
20c5b98
to
032bf3e
Compare
Run Gluten Clickhouse CI |
@liujiayi771 Seems code has scala style violations. Please update. |
Run Gluten Clickhouse CI |
@yma11 I have modified the map in |
@liujiayi771 There's a small conflict, could you please help to do a rebase? thanks, |
d2ebfb7
to
73c8eb5
Compare
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
…end (apache#4779) Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.
…end (apache#4779) Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.
…end (apache#4779) Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.
What changes were proposed in this pull request?
Velox add iceberg mor table read support in facebookincubator/velox#7847. This PR supports read iceberg mor table for Velox backend.
How was this patch tested?
Add mor table read test case "iceberg read mor table".