[CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan #3646

YannByron · 2023-11-08T03:57:08Z

What changes were proposed in this pull request?

support DeltaLake 2.2 scan
support the column mapping mode
provide RewriteTransformerRules to extend if needed.

(Fixes: #2891)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

github-actions · 2023-11-08T03:57:30Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2023-11-08T03:57:38Z

Run Gluten Clickhouse CI

felipepessoto · 2023-11-08T17:03:28Z

Moving my comment from old #3376:

I think _metadata is not heavily used in 2.2, or not at all, but may be needed to replace the input_file_name UDF. In recent versions like 2.4 it is used for deletion vectors, as it needs the _metadata_row_index.

I created this repro:

class TestGluten extends QueryTest
  with SharedSparkSession with DeltaSQLCommandTest {
  test("mytest") {
    withTempDir { inputDir =>
      val testPath = inputDir.getCanonicalPath
      spark.range(10)
        .write
        .format("delta")
        .save(testPath)

      spark.sql(s""" ALTER TABLE delta.`$testPath` SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
  )""")

      spark.sql(s"""SELECT id, _metadata.file_name FROM delta.`$testPath`""").show(false)
    }
  }
}

It fails because we try to replace every column, and _metadata fields are not in the mapping:

[info] - mytest *** FAILED *** (7 seconds, 852 milliseconds)
[info] java.util.NoSuchElementException: key not found: file_name
[info] at scala.collection.MapLike.default(MapLike.scala:236)
[info] at scala.collection.MapLike.default$(MapLike.scala:235)
[info] at scala.collection.AbstractMap.default(Map.scala:65)
[info] at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
[info] at io.glutenproject.extension.RewritePlanIfNeeded.$anonfun$transformColumnMappingPlan$4(ColumnarOverrides.scala:109)
[info] at scala.collection.immutable.List.map(List.scala:297)
[info] at io.glutenproject.extension.RewritePlanIfNeeded.io$glutenproject$extension$RewritePlanIfNeeded$$transformColumnMappingPlan(ColumnarOverrides.scala:107)
[info] at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:64)
[info] at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:59)

val newAttr = o.withName(columnNameMapping(o.name))

felipepessoto · 2023-11-08T20:10:17Z

gluten-delta/pom.xml

@@ -0,0 +1,155 @@
+<?xml version="1.0" encoding="UTF-8"?>


Do we need to make change to include this project in gluten-<backend_type>-bundle-spark?

No. I hope these gluten-lakeformat modules can be used in both backends.

YannByron · 2023-11-09T04:03:28Z

@felipepessoto I have tested this case, and can work correctly. Please check if there is this patch #2563 in gluten you used. And to support metadata column is tracked by #2618. So let this pr focus on the common cases.

yma11 · 2023-11-09T04:52:27Z

gluten-delta/src/main/scala/io/glutenproject/extension/DeltaRewriteTransformerRules.scala

+    TreeNodeTag[String]("io.glutenproject.delta.column.mapping")
+
+  private def notAppliedColumnMappingRule(plan: SparkPlan): Boolean = {
+    plan.getTagValue(COLUMN_MAPPING_RULE_TAG).isEmpty


What's the logic here? Seems COLUMN_MAPPING_RULE_TAG is not empty at initialization?

Firstly, this COLUMN_MAPPING_RULE_TAG is used to avoid a delta scan applies this rule multiple times.
At initialization, the original transformer can't be tagged, so COLUMN_MAPPING_RULE_TAG is empty.

yma11 · 2023-11-09T04:56:12Z

@YannByron LGTM except need to add documentation for this support, like additional configurations, etc.

YannByron · 2023-11-09T07:13:14Z

@YannByron LGTM except need to add documentation for this support, like additional configurations, etc.

@yma11 thank you for your review. There is no configuration needed. Users just put the additional gluten-delta jar into the class path, then can query delta table in gluten/velox env.

yma11 · 2023-11-09T07:21:46Z

@yma11 thank you for your review. There is no configuration needed. Users just put the additional gluten-delta jar into the class path, then can query delta table in gluten/velox env.

Yeah. Then just doc what you said in location like here. You may can also make a short introduction about what cases supported.

…Scan docs for deltalake

github-actions · 2023-11-09T08:44:16Z

Run Gluten Clickhouse CI

YannByron · 2023-11-09T09:35:19Z

@yma11 Doc is done. PTAL again.

GlutenPerfBot · 2023-11-13T00:45:11Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_3646_time.csv	log/native_master_11_12_2023_5b632f64e_time.csv	difference	percentage
q1	33.87	35.06	1.191	103.52%
q2	24.17	24.58	0.412	101.70%
q3	37.63	37.43	-0.200	99.47%
q4	36.33	36.90	0.571	101.57%
q5	70.32	71.39	1.068	101.52%
q6	7.11	7.07	-0.041	99.42%
q7	84.63	82.01	-2.625	96.90%
q8	86.26	86.35	0.097	100.11%
q9	122.48	121.45	-1.036	99.15%
q10	46.12	44.24	-1.876	95.93%
q11	19.75	20.06	0.311	101.57%
q12	26.64	25.25	-1.392	94.77%
q13	45.82	46.66	0.841	101.84%
q14	18.20	14.90	-3.294	81.90%
q15	27.32	27.08	-0.233	99.15%
q16	15.41	14.92	-0.498	96.77%
q17	100.53	99.68	-0.852	99.15%
q18	147.32	147.54	0.211	100.14%
q19	13.09	12.97	-0.115	99.12%
q20	27.35	28.14	0.786	102.87%
q21	219.02	220.95	1.925	100.88%
q22	13.07	13.04	-0.027	99.80%
total	1222.45	1217.68	-4.775	99.61%

felipepessoto reviewed Nov 8, 2023

View reviewed changes

yma11 reviewed Nov 9, 2023

View reviewed changes

zhztheplayer changed the title ~~[Gluten-core][VL] Support RewriteTransformer Rules and DeltaLake Scan~~ [CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan Nov 9, 2023

[Gluten-core][VL] Support RewriteTransformer Rules and DeltaLake 2.2 …

bc36f8b

…Scan docs for deltalake

YannByron force-pushed the github/lake-format branch from 80668a3 to bc36f8b Compare November 9, 2023 08:43

yma11 approved these changes Nov 10, 2023

View reviewed changes

yma11 merged commit 19800f9 into apache:main Nov 13, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan #3646

[CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan #3646

YannByron commented Nov 8, 2023

github-actions bot commented Nov 8, 2023

github-actions bot commented Nov 8, 2023

felipepessoto commented Nov 8, 2023

felipepessoto Nov 8, 2023 •

edited

Loading

YannByron Nov 9, 2023

YannByron commented Nov 9, 2023

yma11 Nov 9, 2023

YannByron Nov 9, 2023

yma11 commented Nov 9, 2023

YannByron commented Nov 9, 2023

yma11 commented Nov 9, 2023

github-actions bot commented Nov 9, 2023

YannByron commented Nov 9, 2023

GlutenPerfBot commented Nov 13, 2023

[CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan #3646

[CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan #3646

Conversation

YannByron commented Nov 8, 2023

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Nov 8, 2023

github-actions bot commented Nov 8, 2023

felipepessoto commented Nov 8, 2023

felipepessoto Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

YannByron Nov 9, 2023

Choose a reason for hiding this comment

YannByron commented Nov 9, 2023

yma11 Nov 9, 2023

Choose a reason for hiding this comment

YannByron Nov 9, 2023

Choose a reason for hiding this comment

yma11 commented Nov 9, 2023

YannByron commented Nov 9, 2023

yma11 commented Nov 9, 2023

github-actions bot commented Nov 9, 2023

YannByron commented Nov 9, 2023

GlutenPerfBot commented Nov 13, 2023

felipepessoto Nov 8, 2023 •

edited

Loading