Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable filesystem caching of /_delta_log/ directory #23408

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

sdaberdaku
Copy link
Member

@sdaberdaku sdaberdaku commented Sep 13, 2024

Description

Added a configuration option to disable object caching of files with /_delta_log/ in their path to avoid issues with Delta tables with mutable commits. This is useful in those scenarios when delta tables are deleted and re-created and the files inside the _delta_log folder cannot be considered immutable anymore, and thus are unsafe to cache.

Additional context and related issues

Fixes #21451

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Delta
* Add a configuration `delta.fs.cache.disable-transaction-log-caching` to allow disabling caching of delta transaction logs when filesystem caching is enabled. ({issue}`21451`)

Copy link

cla-bot bot commented Sep 13, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added docs delta-lake Delta Lake connector labels Sep 13, 2024
Copy link

cla-bot bot commented Sep 13, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Sep 13, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Sep 13, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@sdaberdaku sdaberdaku marked this pull request as ready for review September 13, 2024 22:27
@wendigo
Copy link
Contributor

wendigo commented Sep 14, 2024

@cla-bot check

Copy link

cla-bot bot commented Sep 14, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Sep 14, 2024

The cla-bot has been summoned, and re-checked this pull request!

Copy link
Contributor

@jkylling jkylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! It would be good if @raunaqmorarka and @wendigo could take a look as well.

@Praveen2112
Copy link
Member

scenarios when delta tables are deleted and re-created

Since the delta lake support CREATE OR REPLACE TABLE the files inside the _delta_log could be immutable right ?

@sdaberdaku
Copy link
Member Author

sdaberdaku commented Sep 16, 2024

Since the delta lake support CREATE OR REPLACE TABLE the files inside the _delta_log could be immutable right ?

@raunaqmorarka When overwriting with some Spark operations or externally deleting the delta table (i.e. manually deleting the files on the object storage) and recreating it makes the delta log mutable.

Copy link

cla-bot bot commented Sep 16, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@sdaberdaku
Copy link
Member Author

Dear @jkylling, @raunaqmorarka, and, @Praveen2112,

I implemented the suggested changes.
I believe the new implementation is more elegant, and the test should be more significant as well.
Looking forward to hearing from you!

Best,

Sebastian

PS: For some reason, the verification/cla-signed is still failing although I submitted the CLA on Friday. Maybe it needs more time to get processed.

Copy link

cla-bot bot commented Sep 16, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Sep 16, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@raunaqmorarka
Copy link
Member

Since the delta lake support CREATE OR REPLACE TABLE the files inside the _delta_log could be immutable right ?

@raunaqmorarka When overwriting with some Spark operations or externally deleting the delta table (i.e. manually deleting the files on the object storage) and recreating it makes the delta log mutable.

Could you clarify what Spark operations we would need to run to encounter this situation ?
Is manually deleting files on object storage and then recreating it with the same a valid way of changing a delta lake table ? Why not use CREATE OR REPLACE TABLE which would preserve _delta_log immutability ?

For some reason, the verification/cla-signed is still failing although I submitted the CLA on Friday

CLA is processed manually, it will take time to process.

Copy link

cla-bot bot commented Sep 16, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@sdaberdaku
Copy link
Member Author

sdaberdaku commented Sep 16, 2024

Since the delta lake support CREATE OR REPLACE TABLE the files inside the _delta_log could be immutable right ?

@raunaqmorarka When overwriting with some Spark operations or externally deleting the delta table (i.e. manually deleting the files on the object storage) and recreating it makes the delta log mutable.

Could you clarify what Spark operations we would need to run to encounter this situation ? Is manually deleting files on object storage and then recreating it with the same a valid way of changing a delta lake table ? Why not use CREATE OR REPLACE TABLE which would preserve _delta_log immutability ?

For some reason, the verification/cla-signed is still failing although I submitted the CLA on Friday

CLA is processed manually, it will take time to process.

It has happened to me when running:

(df
 .write
 .format("delta")
 .mode("overwrite")
 .option("overwriteSchema", "true")
 .save(path))

With Spark 3.5.1 and Delta 3.1.0.

Copy link

cla-bot bot commented Sep 16, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@sdaberdaku
Copy link
Member Author

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Sep 18, 2024
Copy link

cla-bot bot commented Sep 18, 2024

The cla-bot has been summoned, and re-checked this pull request!

@sdaberdaku
Copy link
Member Author

Hello @jkylling, @Praveen2112, @raunaqmorarka, and @wendigo!

My CLA has finally been processed!
Whenever you can, please have another look at my PR.

Best,

S.

@sdaberdaku
Copy link
Member Author

Hello @jkylling and @raunaqmorarka,

I don't know if you guys had a chance to review the minimal test I wrote.
Whenever you find the time, please let me know!

Thanks again for your support!

Best,
S.

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment about test, lgtm otherwise

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comment

Copy link
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@raunaqmorarka raunaqmorarka merged commit 99b456a into trinodb:master Sep 23, 2024
25 checks passed
@github-actions github-actions bot added this to the 459 milestone Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Cached delta log gets corrupted when dropping and recreating delta table with Trino
5 participants