Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

Open
YannByron opened this issue Dec 6, 2023 · 10 comments
Open

[SUPPORT][SPARK][NATIVE] make hudi integrate into gluten/velox #10252

YannByron opened this issue Dec 6, 2023 · 10 comments
Assignees
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance spark-sql

Comments

@YannByron
Copy link
Contributor

YannByron commented Dec 6, 2023

Currently, The integration between spark and gluten/velox has made a good performance on parquet or lake format. And @vinothchandar also mentioned this in #8679. So I think Hudi should take part in.

Here is a design I proposed in gluten before and some discussion: apache/incubator-gluten#3378

Now, all the scan types that gluten has supported are file based, like BatchScan or FileSourceScanExec. Datasource provides the list of files during planning, then gluten pass them to the native library and the native reader (parquet/orc/...) loads them.

For hudi cow table without hoodie.schema.on.read.enable, it can return HadoopFSRelation (that's file based) when call createRelation. So maybe we can make this integration easily if the native reader can load the hudi files correctly.

But for other hudi tables, they return HoodieBaseRelation (with BaseRelation, FileRelation, PrunedFilteredScan) that will be transformed to RowDataSourceScanExec that's not supported in gluten. To solve this, maybe there are two ways:

  1. to make gluten support it. IMO, it's not easy, and not a high-priority thing in gluten.
  2. to make hudi be file-based scan. But mor table needs to merge data, and hudi use Spark DatasourceV1 interface that doesn't have the ability to merge data, I guess we have to migrate to DSV2 to use BatchScan which can use hudi-defined reader to load data. As well as, a native C++ Hudi Reader is required in velox. With these two, hudi mor tables can be queried in native env.

Gluten: https://github.com/oap-project/gluten
Velox: https://github.com/facebookincubator/velox

@vinothchandar @xushiyan

@danny0405 danny0405 added feature-enquiry issue contains feature enquiries/requests or great improvement ideas spark-sql performance labels Dec 7, 2023
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Dec 7, 2023
@vinothchandar vinothchandar self-assigned this Dec 7, 2023
@vinothchandar
Copy link
Member

@YannByron Great to hear from you. @rmahindra123 is actively exploring this as well.

but a lot of work is going on to build a new vectorized read path for all queries. cc @jonvex @yihua @linliu-code . Can you check out some of their recent work?

@jonvex
Copy link
Contributor

jonvex commented Dec 7, 2023

We are actually switching to use HadoopFSRelation for all query types. So it sounds like this will make the integration easier

@yihua
Copy link
Contributor

yihua commented Dec 7, 2023

Hey @YannByron great that you brought this up.

@jonvex @linliu-code and I are actively working on improving Spark read and write performance and one aspect is to return HadoopFSRelation for all query types (including MOR snapshot queries with log merging). As of now, on the latest master, for snapshot, RO, and CDC queries on both COW and MOR tables in Spark the DefaultSource return HadoopFSRelation already.

@yihua
Copy link
Contributor

yihua commented Dec 7, 2023

We need to check if our read and merging logic is compatible with Velox cc @jonvex @linliu-code

@linliu-code
Copy link
Contributor

After we support HadoopFsRelation for all queries types, what else has been left for Gluten/Velox integration?

@YannByron
Copy link
Contributor Author

YannByron commented Dec 8, 2023

After we support HadoopFsRelation for all queries types, what else has been left for Gluten/Velox integration?

  1. A native reader, for cases where the existing parquet reader cannot directly load the data correctly, like iceberg v2(positional deletes) reader Support Iceberg positional deletes facebookincubator/velox#5897.
  2. A gluten-hudi module, as a bridge between spark-hudi and velox.

@vinothchandar
Copy link
Member

@YannByron Expected.

To confirm, CoW snapshot queries should work, after we support HadoopFsRelation for all queries right. We will be happy to work with you on 1 & 2 items, if you have time/interest. let @linliu-code & team know

@YannByron
Copy link
Contributor Author

@vinothchandar, i'm also glad to work with you guys.

Honestly, item 1 (a native reader in velox for mor table) is beyond my ability.
I can implement a gluten-hudi module in gluten and verify the Cow snapshot queries can work with them.

@vinothchandar
Copy link
Member

vinothchandar commented Dec 11, 2023

Honestly, item 1 (a native reader in velox for mor table) is beyond my ability.

@linliu-code or @rmahindra123 can help here. being the C++ nerds here.

I can implement a gluten-hudi module in gluten and verify the Cow snapshot queries can work with them.

we are seeing great results with the new read path that the team is implementing. So sth like this will help us make some decisions. I am also working on NVIDIA rapids along similar lines, the snapshot queries are already accelerated there

@vinothchandar
Copy link
Member

@YannByron Pinging on this again. Is there a WIP integration for CoW that we could build as a quick prototype? how hard is that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance spark-sql
Projects
Status: Awaiting Triage
Development

No branches or pull requests

6 participants