Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-6834][CORE] feat: add other join types from the official Substrait #6835

Merged
merged 6 commits into from
Sep 9, 2024

Conversation

EpsilonPrime
Copy link
Contributor

Adds the join types currently defined in Substrait to the Gluten copy. This is one of a vast set of changes aimed at reducing the differences from the official version in hope that one day there would be no differences.

@github-actions github-actions bot added the CORE works for Gluten Core label Aug 14, 2024
Copy link

#6834

Copy link

Run Gluten Clickhouse CI

@github-actions github-actions bot added the VELOX label Aug 14, 2024
Copy link

Run Gluten Clickhouse CI

@PHILO-HE
Copy link
Contributor

@FelixYBW, @baibaichen

@zhztheplayer
Copy link
Member

zhztheplayer commented Aug 15, 2024

I suspect how it actually helps if we make such changes.

For example both Spark and Velox don't have a join type Right Anti at the moment (despite that whether they will add it in future or not). So it may look confusing to have it defined in the interchange protocol.

As commented in #6833 (comment), physical plans would vary more than logical plans in Gluten. We are already having to follow both Spark's and Velox/CH's plan protocols. It could have chance to start messing things up if we strictly follow another one in the middle layer.

@EpsilonPrime
Copy link
Contributor Author

There are also physical relations in Substrait. The approach I am taking at unforking is to slowly chip away at the differences so we can swap the main instance in seamlessly. A swap of everything at once seems unlikely so I'm tackling what I can -- especially if there are minor code changes related to each small change.

@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Aug 15, 2024

May add a new type for existence join ? existence join is transformed to left semi join at present
existence join is supported both at ch and velox now.

@EpsilonPrime
Copy link
Contributor Author

May add a new type for existence join ? existence join is transformed to left semi join at present existence join is supported both at ch and velox now.

We are in the process of adding new join types to the specification at the moment so adding another is definitely a possibility. (Currently I have a pr out to add mark join.)

Copy link

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member

zhztheplayer commented Aug 17, 2024

@EpsilonPrime Before we decide to take effort on migrating to mainstream Substrait, we may do some study on how much it could help us on next moves.

Thing is if we unfork Substrait, we should use it for supporting more backends in Gluten. Otherwise we get less benefit than cost.

Do you happen to know the progress of Substrait integration of some projects, for example, DuckDB, Arrow and Datafusion, if Gluten decide to add support for these libraries, are their Substrait consumer implementations reliable enough for us to use?

@FelixYBW
Copy link
Contributor

As commented in #6833 (comment), physical plans would vary more than logical plans in Gluten. We are already having to follow both Spark's and Velox/CH's plan protocols. It could have chance to start messing things up if we strictly follow another one in the middle layer.

we should have the same protocol in long term, substrait should be able to fit all requirements from different frameworks, libraries and accelerators. It's substrait's goal. But now for Gluten it's not urgent.

@FelixYBW
Copy link
Contributor

@JkSelf @rui-mo Do you happen to have the list of our modifications in Glute's substrait? I remember we have several pending PRs to upstream substrait but there is no active review, so we paused there.

@EpsilonPrime you may start to review the pending PRs in substrait if we have. Add the missing features Gluten needs to upstream.

@rui-mo
Copy link
Contributor

rui-mo commented Aug 19, 2024

@EpsilonPrime
Copy link
Contributor Author

@JkSelf @rui-mo Do you happen to have the list of our modifications in Glute's substrait? I remember we have several pending PRs to upstream substrait but there is no active review, so we paused there.

@EpsilonPrime you may start to review the pending PRs in substrait if we have. Add the missing features Gluten needs to upstream.

All of the pending changes were merged last year (about the time I became a reviewer). I have been also merging changes upstream (for instance the equivalent of TextReadOptions was added last week).

@EpsilonPrime
Copy link
Contributor Author

@EpsilonPrime Before we decide to take effort on migrating to mainstream Substrait, we may do some study on how much it could help us on next moves.

Thing is if we unfork Substrait, we should use it for supporting more backends in Gluten. Otherwise we get less benefit than cost.

Do you happen to know the progress of Substrait integration of some projects, for example, DuckDB, Arrow and Datafusion, if Gluten decide to add support for these libraries, are their Substrait consumer implementations reliable enough for us to use?

DuckDB has the best Substrait support of the three. Datafusion has a few issues which I'm hoping are addressed in their next release. Acero is in maintenance mode but has a working implementation but it's very strict about what it accepts.

Other benefits include tools which run on Substrait (like the validator and text plan format) which aren't really being used by Gluten at the moment.

The Spark proposal to move the Gluten communication logic (which may or may not have included Substrait) there was the reason I started looking into this effort.

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Sep 6, 2024

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer changed the title [GLUTEN-6834][core] feat: add other join types from the official Substrait [GLUTEN-6834][CORE] feat: add other join types from the official Substrait Sep 6, 2024
zhztheplayer
zhztheplayer previously approved these changes Sep 6, 2024
@zhztheplayer
Copy link
Member

It seems CH CI failing at

15:08:17  /home/jenkins/agent/workspace/gluten/gluten-ci/gluten/cpp-ch/local-engine/Parser/JoinRelParser.cpp:272:80: error: no member named 'JoinRel_JoinType_JOIN_TYPE_ANTI' in namespace 'substrait'
15:08:17    272 |         if (join_opt_info.is_null_aware_anti_join && join.type() == substrait::JoinRel_JoinType_JOIN_TYPE_ANTI)
15:08:17        |                                                                     ~~~~~~~~~~~^
15:08:17  1 error generated.

Copy link

github-actions bot commented Sep 7, 2024

Run Gluten Clickhouse CI

@zhztheplayer zhztheplayer merged commit a4571dd into apache:main Sep 9, 2024
44 of 45 checks passed
dcoliversun pushed a commit to dcoliversun/gluten that referenced this pull request Sep 11, 2024
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants