-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252
Comments
@YannByron Great to hear from you. @rmahindra123 is actively exploring this as well. but a lot of work is going on to build a new vectorized read path for all queries. cc @jonvex @yihua @linliu-code . Can you check out some of their recent work? |
We are actually switching to use HadoopFSRelation for all query types. So it sounds like this will make the integration easier |
Hey @YannByron great that you brought this up. @jonvex @linliu-code and I are actively working on improving Spark read and write performance and one aspect is to return |
We need to check if our read and merging logic is compatible with Velox cc @jonvex @linliu-code |
After we support HadoopFsRelation for all queries types, what else has been left for Gluten/Velox integration? |
|
@YannByron Expected. To confirm, CoW snapshot queries should work, after we support HadoopFsRelation for all queries right. We will be happy to work with you on 1 & 2 items, if you have time/interest. let @linliu-code & team know |
@vinothchandar, i'm also glad to work with you guys. Honestly, item 1 (a native reader in velox for mor table) is beyond my ability. |
@linliu-code or @rmahindra123 can help here. being the C++ nerds here.
we are seeing great results with the new read path that the team is implementing. So sth like this will help us make some decisions. I am also working on NVIDIA rapids along similar lines, the snapshot queries are already accelerated there |
@YannByron Pinging on this again. Is there a WIP integration for CoW that we could build as a quick prototype? how hard is that |
Currently, The integration between spark and gluten/velox has made a good performance on parquet or lake format. And @vinothchandar also mentioned this in #8679. So I think Hudi should take part in.
Here is a design I proposed in gluten before and some discussion: apache/incubator-gluten#3378
Now, all the
scan
types that gluten has supported are file based, likeBatchScan
orFileSourceScanExec
. Datasource provides the list of files during planning, then gluten pass them to the native library and the native reader (parquet/orc/...) loads them.For hudi cow table without
hoodie.schema.on.read.enable
, it can returnHadoopFSRelation
(that's file based) when callcreateRelation
. So maybe we can make this integration easily if the native reader can load the hudi files correctly.But for other hudi tables, they return
HoodieBaseRelation
(withBaseRelation
,FileRelation
,PrunedFilteredScan
) that will be transformed toRowDataSourceScanExec
that's not supported in gluten. To solve this, maybe there are two ways:BatchScan
which can use hudi-defined reader to load data. As well as, a native C++ Hudi Reader is required in velox. With these two, hudi mor tables can be queried in native env.Gluten: https://github.com/oap-project/gluten
Velox: https://github.com/facebookincubator/velox
@vinothchandar @xushiyan
The text was updated successfully, but these errors were encountered: