Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gluten-core][VL] Supports DeltaLake 2.2 Read #3376

Closed
wants to merge 1 commit into from

Conversation

YannByron
Copy link
Contributor

What changes were proposed in this pull request?

  1. Supports Delta scan in Velox .
  2. Delta 2.x supports Column Mapping, which is also supported in this PR.
  3. Not support DeletionVector that is a new feature after Delta2.3

(Fixes: #2891)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI

@YannByron
Copy link
Contributor Author

YannByron commented Oct 11, 2023

Follow #2902, more previous discussion can be found there.

@YannByron YannByron changed the title [Gluten-core] support Delta2.2 read [Gluten-core][VL] Supports DeltaLake 2.2 Read Oct 11, 2023
@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@github-actions
Copy link

Run Gluten Clickhouse CI

@felipepessoto
Copy link
Contributor

felipepessoto commented Oct 16, 2023

@YannByron do you have any suggestion how we can test how broken is Delta when running on Gluten/Velox?

I started an experiment, but then realized it doesn't follow the same pattern Gluten uses to test Spark. It seems we extends the existing classes and adds GlutenSQLTestsTrait, example: https://github.com/oap-project/gluten/pull/1381/files.
Should we do the same with Delta?

The experiment I started was to actually change Delta code to run the tests using Gluten:

1-) Cherry picked this to Delta 2.2. (this old workaround is not needed anymore and was messing the extension setting): delta-io/delta@975daae
2-) Added gluten to the session, at DeltaSQLCommandTest.sparkConf:

  override protected def sparkConf: SparkConf = {
    super.sparkConf
      .set(StaticSQLConf.SPARK_SESSION_EXTENSIONS.key,
        classOf[DeltaSparkSessionExtension].getName)
      .set(SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION.key,
        classOf[DeltaCatalog].getName)
      .set("spark.plugins", "io.glutenproject.GlutenPlugin")
      .set("spark.gluten.sql.columnar.backend.lib", "velox")
      .set("spark.memory.offHeap.enabled", "true")
      .set("spark.memory.offHeap.size", "10g")
      .set("spark.gluten.sql.columnar.forceshuffledhashjoin", "true")
      .set("spark.shuffle.manager", "org.apache.spark.shuffle.sort.ColumnarShuffleManager")
  }

3-) Added a reference to gluten jar in libraryDependencies for delta-core.

But it it not a complete solution, most of tests are failing or not running. And the ones which passes, I can't say if they are passing because it is really working fine with Gluten/Velox, or because it falls back to non-native.

@felipepessoto
Copy link
Contributor

Found that _metadata is not working for Delta when column mapping is enabled.

_metadata has been recently changed to fall back: #2618

@yma11
Copy link
Contributor

yma11 commented Oct 30, 2023

@YannByron will you work on update this PR based on discussion?

@YannByron
Copy link
Contributor Author

YannByron commented Nov 8, 2023

Hi, @felipepessoto, sorry for the late reply.

I run TPCDS based on delta table with gluten/velox in EMR. I use Delta 2.2 only with some internal commits (there aren't related to gluten/velox), and add the delta jar in $SPARK_HOME/jars. And the configures are same with yours except that I don't use spark.gluten.sql.columnar.backend.lib.

Found that _metadata is not working for Delta when column mapping is enabled.

I'm not really aware of this _metadata. Where to use this during querying delta in gluten/velox env? If _metadata is used in parsing delta log, i think we can ignore for now because there are other factors (like UDFs) which make parsing delta log unsupported and the time cost for this phase was low.

@YannByron YannByron closed this Nov 8, 2023
@felipepessoto
Copy link
Contributor

I think _metadata is not heavily used in 2.2, or not at all, but may be needed to replace the input_file_name UDF. In recent versions like 2.4 it is used for deletion vectors, as it needs the _metadata_row_index.

I created this repro:

class TestGluten extends QueryTest
  with SharedSparkSession with DeltaSQLCommandTest {
  test("mytest") {
    withTempDir { inputDir =>
      val testPath = inputDir.getCanonicalPath
      spark.range(10)
        .write
        .format("delta")
        .save(testPath)

      spark.sql(s""" ALTER TABLE delta.`$testPath` SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
  )""")

      spark.sql(s"""SELECT id, _metadata.file_name FROM delta.`$testPath`""").show(false)
    }
  }
}

It fails because we try to replace every column, and _metadata fields are not in the mapping:

[info] - mytest *** FAILED *** (7 seconds, 852 milliseconds)
[info] java.util.NoSuchElementException: key not found: file_name
[info] at scala.collection.MapLike.default(MapLike.scala:236)
[info] at scala.collection.MapLike.default$(MapLike.scala:235)
[info] at scala.collection.AbstractMap.default(Map.scala:65)
[info] at scala.collection.mutable.HashMap.apply(HashMap.scala:69)
[info] at io.glutenproject.extension.RewritePlanIfNeeded.$anonfun$transformColumnMappingPlan$4(ColumnarOverrides.scala:109)
[info] at scala.collection.immutable.List.map(List.scala:297)
[info] at io.glutenproject.extension.RewritePlanIfNeeded.io$glutenproject$extension$RewritePlanIfNeeded$$transformColumnMappingPlan(ColumnarOverrides.scala:107)
[info] at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:64)
[info] at io.glutenproject.extension.RewritePlanIfNeeded$$anonfun$apply$1.applyOrElse(ColumnarOverrides.scala:59)

val newAttr = o.withName(columnNameMapping(o.name))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Velox doesn't work with Spark Delta Lake
3 participants