-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gluten-core][VL] Supports DeltaLake 2.2 Read #3376
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
Run Gluten Clickhouse CI |
Follow #2902, more previous discussion can be found there. |
Run Gluten Clickhouse CI |
5c4dc20
to
403eb40
Compare
Run Gluten Clickhouse CI |
403eb40
to
3d6a949
Compare
Run Gluten Clickhouse CI |
3d6a949
to
bc6f0f4
Compare
Run Gluten Clickhouse CI |
@YannByron do you have any suggestion how we can test how broken is Delta when running on Gluten/Velox? I started an experiment, but then realized it doesn't follow the same pattern Gluten uses to test Spark. It seems we extends the existing classes and adds GlutenSQLTestsTrait, example: https://github.com/oap-project/gluten/pull/1381/files. The experiment I started was to actually change Delta code to run the tests using Gluten: 1-) Cherry picked this to Delta 2.2. (this old workaround is not needed anymore and was messing the extension setting): delta-io/delta@975daae
3-) Added a reference to gluten jar in libraryDependencies for delta-core. But it it not a complete solution, most of tests are failing or not running. And the ones which passes, I can't say if they are passing because it is really working fine with Gluten/Velox, or because it falls back to non-native. |
Found that _metadata is not working for Delta when column mapping is enabled. _metadata has been recently changed to fall back: #2618 |
@YannByron will you work on update this PR based on discussion? |
Hi, @felipepessoto, sorry for the late reply. I run TPCDS based on delta table with gluten/velox in EMR. I use Delta 2.2 only with some internal commits (there aren't related to gluten/velox), and add the delta jar in
I'm not really aware of this _metadata. Where to use this during querying delta in gluten/velox env? If _metadata is used in parsing delta log, i think we can ignore for now because there are other factors (like UDFs) which make parsing delta log unsupported and the time cost for this phase was low. |
I think _metadata is not heavily used in 2.2, or not at all, but may be needed to replace the input_file_name UDF. In recent versions like 2.4 it is used for deletion vectors, as it needs the _metadata_row_index. I created this repro:
It fails because we try to replace every column, and _metadata fields are not in the mapping:
|
What changes were proposed in this pull request?
(Fixes: #2891)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)