-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support unprefixed config format options #9594
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @tinfoil-knight! The overall approach looks good and is well tested. I have proposed a few changes. Let me know what you think.
datafusion/sql/src/statement.rs
Outdated
match &file_type { | ||
// Renames un-prefixed keys to support legacy format specific options | ||
FileType::CSV | FileType::JSON | FileType::PARQUET => { | ||
let prefix = format!("{}", file_type); | ||
let keys_to_rename: Vec<_> = options | ||
.keys() | ||
.filter(|key| !key.starts_with(&prefix)) | ||
.cloned() | ||
.collect(); | ||
|
||
for key in keys_to_rename { | ||
if let Some(value) = options.remove(&key) { | ||
let new_key = format!("{}.{}", prefix, key); | ||
options.insert(new_key, value); | ||
} | ||
} | ||
} | ||
_ => {} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match &file_type { | |
// Renames un-prefixed keys to support legacy format specific options | |
FileType::CSV | FileType::JSON | FileType::PARQUET => { | |
let prefix = format!("{}", file_type); | |
let keys_to_rename: Vec<_> = options | |
.keys() | |
.filter(|key| !key.starts_with(&prefix)) | |
.cloned() | |
.collect(); | |
for key in keys_to_rename { | |
if let Some(value) = options.remove(&key) { | |
let new_key = format!("{}.{}", prefix, key); | |
options.insert(new_key, value); | |
} | |
} | |
} | |
_ => {} | |
} | |
let options = match &file_type { | |
// Renames un-prefixed keys to support legacy format specific options | |
FileType::CSV | FileType::JSON | FileType::PARQUET => { | |
options | |
.into_iter() | |
.map(|(k, v)| { | |
// If config does not belong to any namespace, assume it is | |
// a legacy option and apply file_type namespace for backwards | |
// compatibility. | |
if !k.contains('.') { | |
let new_key = format!("{}.{}", file_type, k); | |
(new_key, v) | |
} else { | |
(k, v) | |
} | |
}) | |
.collect() | |
} | |
_ => options, | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall strategy here looks good, but I propose two changes:
- We can make the logic more concise by modifying options in a single pass with into_iter, map, collect. This also avoids clones.
- I think we should be a bit more conservative about the keys we modify. If the key has a namespace at all (rather than specifically the file format namespace), we can leave it as is. Downstream users may for example add a custom namespace and we wouldn't wan't this code to modify it unexpectedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented both changes.
Converted the PR to draft for the while being due to #9594 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tinfoil-knight and @devinjdangelo
I think once @devinjdangelo 's comments are addressed this PR will be ready to go
cc @metesynnada @ozankabak and @mustafasrepo as you commented on the original PR
@@ -484,3 +484,35 @@ COPY (select col2, sum(col1) from source_table | |||
# Copy from table with non literal | |||
query error DataFusion error: SQL error: ParserError\("Expected ',' or '\)' after option definition, found: \+"\) | |||
COPY source_table to '/tmp/table.parquet' (row_group_size 55 + 102); | |||
|
|||
|
|||
# Legacy Format Options Support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Thank you @tinfoil-knight for this. We will do a quick follow-up finalizing the design in #9382. Let's get the follow-up in and then modify this/get this in afterwards for backwards compatibility. |
Okay. Sounds good. |
@tinfoil-knight I think this should be unblocked now that #9382 is closed. If you have a moment to rebase/resolve any conflicts and CI failures, I think we can get this merged after a final review. |
I think we should also consider holding the release for this PR (see #9682 ) - so if you need help finishing it up @tinfoil-knight please just let us know |
I'll fix the tests by tomorrow & will let you know if anything comes up. |
@devinjdangelo I'm not sure if the original issue is valid anymore. After #9604, you can use formatting options without prefixing them with the format name. See this for an example: I think this PR & the linked issue should be closed. @alamb noted an issue here: #9604 (review) with the I also found a CLI bug while working on this issue & I've filed: #9707 & pushed a PR for it separately. |
Update: #9716 is opened now for the format prefix issue. @alamb I'm closing this PR since it doesn't seem relevant anymore. See the previous comment (#9594 (comment)) for more details. Please close the issue: #9575 now as it's stale. |
Which issue does this PR close?
Closes #9575
Rationale for this change
#9382 standardised write configuration but broke backwards compatibility for the format specific options of the
COPY
command.Earlier users could have a query like this without prefixing all options:
But after #9382, they'd need to prefix every format-specific options in this verbose way:
This PR adds back the support for format-specific options that worked without providing the format prefix.
What changes are included in this PR?
This PR renames keys without the file_type/format prefix for file types supported by the
COPY
command for backwards compatibility.Are these changes tested?
copy.slt
confirm that the prefixed format options (eg.(format json, 'json.compression' gzip)
) added in Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382 work & aren't broken.(format json, compression gzip)
).Are there any user-facing changes?
Yes. But the current documentation (https://arrow.apache.org/datafusion/user-guide/sql/write_options.html#available-options) lists the legacy format so this PR doesn't require a doc update.
This PR restores breaking changes so I'm not adding the
api change
label. Please let me know if you think we should add the label nonetheless.