Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan #3646

Merged
merged 1 commit into from
Nov 13, 2023

Conversation

YannByron
Copy link
Contributor

What changes were proposed in this pull request?

  1. support DeltaLake 2.2 scan
  2. support the column mapping mode
  3. provide RewriteTransformerRules to extend if needed.

(Fixes: #2891)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

github-actions bot commented Nov 8, 2023

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Nov 8, 2023

Run Gluten Clickhouse CI

@felipepessoto
Copy link
Contributor

Moving my comment from old #3376:

I think _metadata is not heavily used in 2.2, or not at all, but may be needed to replace the input_file_name UDF. In recent versions like 2.4 it is used for deletion vectors, as it needs the _metadata_row_index.

I created this repro:

class TestGluten extends QueryTest
  with SharedSparkSession with DeltaSQLCommandTest {
  test("mytest") {
    withTempDir { inputDir =>
      val testPath = inputDir.getCanonicalPath
      spark.range(10)
        .write
        .format("delta")
        .save(testPath)

      spark.sql(s""" ALTER TABLE delta.`$testPath` SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
  )""")

      spark.sql(s"""SELECT id, _metadata.file_name FROM delta.`$testPath`""").show(false)
    }
  }
}

It fails because we try to replace every column, and _metadata fields are not in the mapping:

[info] - mytest *** FAILED *** (7 seconds, 852 milliseconds)
[info] java.util.NoSuchElementException: key not found: file_name
[info] at scala.collection.MapLike.default(MapLike.scala:236)
[info] at scala.collection.MapLike.default$(MapLike.scala:235)
[info] at scala.collection.AbstractMap.default(Map.scala:65)
[info] at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
[info] at io.glutenproject.extension.RewritePlanIfNeeded.$anonfun$transformColumnMappingPlan$4(ColumnarOverrides.scala:109)
[info] at scala.collection.immutable.List.map(List.scala:297)
[info] at io.glutenproject.extension.RewritePlanIfNeeded.io$glutenproject$extension$RewritePlanIfNeeded$$transformColumnMappingPlan(ColumnarOverrides.scala:107)
[info] at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:64)
[info] at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:59)

val newAttr = o.withName(columnNameMapping(o.name))

@@ -0,0 +1,155 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

@felipepessoto felipepessoto Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make change to include this project in gluten-<backend_type>-bundle-spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I hope these gluten-lakeformat modules can be used in both backends.

@YannByron
Copy link
Contributor Author

@felipepessoto I have tested this case, and can work correctly. Please check if there is this patch #2563 in gluten you used. And to support metadata column is tracked by #2618. So let this pr focus on the common cases.

TreeNodeTag[String]("io.glutenproject.delta.column.mapping")

private def notAppliedColumnMappingRule(plan: SparkPlan): Boolean = {
plan.getTagValue(COLUMN_MAPPING_RULE_TAG).isEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the logic here? Seems COLUMN_MAPPING_RULE_TAG is not empty at initialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, this COLUMN_MAPPING_RULE_TAG is used to avoid a delta scan applies this rule multiple times.
At initialization, the original transformer can't be tagged, so COLUMN_MAPPING_RULE_TAG is empty.

@yma11
Copy link
Contributor

yma11 commented Nov 9, 2023

@YannByron LGTM except need to add documentation for this support, like additional configurations, etc.

@zhztheplayer zhztheplayer changed the title [Gluten-core][VL] Support RewriteTransformer Rules and DeltaLake Scan [CORE][VL] Support RewriteTransformer Rules and DeltaLake Scan Nov 9, 2023
@YannByron
Copy link
Contributor Author

@YannByron LGTM except need to add documentation for this support, like additional configurations, etc.

@yma11 thank you for your review. There is no configuration needed. Users just put the additional gluten-delta jar into the class path, then can query delta table in gluten/velox env.

@yma11
Copy link
Contributor

yma11 commented Nov 9, 2023

@yma11 thank you for your review. There is no configuration needed. Users just put the additional gluten-delta jar into the class path, then can query delta table in gluten/velox env.

Yeah. Then just doc what you said in location like here. You may can also make a short introduction about what cases supported.

Copy link

github-actions bot commented Nov 9, 2023

Run Gluten Clickhouse CI

@YannByron
Copy link
Contributor Author

@yma11 Doc is done. PTAL again.

@yma11 yma11 merged commit 19800f9 into apache:main Nov 13, 2023
17 checks passed
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_3646_time.csv log/native_master_11_12_2023_5b632f64e_time.csv difference percentage
q1 33.87 35.06 1.191 103.52%
q2 24.17 24.58 0.412 101.70%
q3 37.63 37.43 -0.200 99.47%
q4 36.33 36.90 0.571 101.57%
q5 70.32 71.39 1.068 101.52%
q6 7.11 7.07 -0.041 99.42%
q7 84.63 82.01 -2.625 96.90%
q8 86.26 86.35 0.097 100.11%
q9 122.48 121.45 -1.036 99.15%
q10 46.12 44.24 -1.876 95.93%
q11 19.75 20.06 0.311 101.57%
q12 26.64 25.25 -1.392 94.77%
q13 45.82 46.66 0.841 101.84%
q14 18.20 14.90 -3.294 81.90%
q15 27.32 27.08 -0.233 99.15%
q16 15.41 14.92 -0.498 96.77%
q17 100.53 99.68 -0.852 99.15%
q18 147.32 147.54 0.211 100.14%
q19 13.09 12.97 -0.115 99.12%
q20 27.35 28.14 0.786 102.87%
q21 219.02 220.95 1.925 100.88%
q22 13.07 13.04 -0.027 99.80%
total 1222.45 1217.68 -4.775 99.61%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Velox doesn't work with Spark Delta Lake
4 participants