-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan #3646
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
Moving my comment from old #3376: I think _metadata is not heavily used in 2.2, or not at all, but may be needed to replace the input_file_name UDF. In recent versions like 2.4 it is used for deletion vectors, as it needs the _metadata_row_index. I created this repro:
It fails because we try to replace every column, and _metadata fields are not in the mapping:
|
@@ -0,0 +1,155 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make change to include this project in gluten-<backend_type>-bundle-spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I hope these gluten-lakeformat modules can be used in both backends.
@felipepessoto I have tested this case, and can work correctly. Please check if there is this patch #2563 in gluten you used. And to support metadata column is tracked by #2618. So let this pr focus on the common cases. |
TreeNodeTag[String]("io.glutenproject.delta.column.mapping") | ||
|
||
private def notAppliedColumnMappingRule(plan: SparkPlan): Boolean = { | ||
plan.getTagValue(COLUMN_MAPPING_RULE_TAG).isEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the logic here? Seems COLUMN_MAPPING_RULE_TAG
is not empty at initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, this COLUMN_MAPPING_RULE_TAG
is used to avoid a delta scan applies this rule multiple times.
At initialization, the original transformer can't be tagged, so COLUMN_MAPPING_RULE_TAG
is empty.
@YannByron LGTM except need to add documentation for this support, like additional configurations, etc. |
@yma11 thank you for your review. There is no configuration needed. Users just put the additional gluten-delta jar into the class path, then can query delta table in gluten/velox env. |
Yeah. Then just doc what you said in location like here. You may can also make a short introduction about what cases supported. |
…Scan docs for deltalake
80668a3
to
bc36f8b
Compare
Run Gluten Clickhouse CI |
@yma11 Doc is done. PTAL again. |
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
What changes were proposed in this pull request?
RewriteTransformerRules
to extend if needed.(Fixes: #2891)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)