-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COPY TO allign with CREATE EXTERNAL TABLE #10
Conversation
@ozankabak If the syntax meets the requirements, I will submit a PR to the upstream repository. |
Looks good |
'parquet.bloom_filter_ndv' 100 | ||
STORED AS PARQUET | ||
OPTIONS ( | ||
'format.compression' snappy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be awesome if we could avoid having to repeat out format.
for each option
Note this change may conflict with the changes from @tinfoil-knight in apache#9594
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge here is an option may refer to a non-format thing too. Think about something like:
STORED AS PARQUET
OPTIONS (
format.foo bar
format.fizz buzz
credentials.username admin
)
With the aligned/unified syntax here, all format
entries refer to parquet
and there is no repetition of parquet
. But we repeat format
instead, and this prefix is necessary as it separates format-related and non-format-related options.
As a next step, a consistent and generalizable way to remove repetitive patterns would be a syntax like:
STORED AS PARQUET
OPTIONS format (
foo bar
hey ho
)
OPTIONS other.prefix (
fizz buzz
paul atreides
)
OPTIONS (
credentials.username admin
)
where multiple OPTIONS
are allowable, each with a possible prefix to avoid repetition. In case of an option is specified multiple times with different values, we'd generate an error. The above would be equivalent to
STORED AS PARQUET
OPTIONS (
format.foo bar
format.hey ho
other.prefix.fizz buzz
other.prefix.paul atreides
credentials.username admin
)
where the user is explicitly repeating prefixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- I agree that being able to explicitly namespace out the options is valuable to disambiguate them
I think as a user it would be really nice to not have to worry about such collisions unless i actually had to disambiguate them
Like in my mind, this statement isn't ambiguous
STORED AS PARQUET
OPTIONS (
compression snappy
)
Because the only possible "compression" option is on the format.
I think the way @tinfoil-knight handles this in apache#9594 is to internally try and resolve compression
as though it were format.compression
(or parquet.compression
) if the compression
configuration isn't valid.
Even something like this is not ambiguous (though I realize actually implementing this may not be feasible without much more work):
STORED AS PARQUET
LOCATION 's3://....'
OPTIONS (
compression snappy
access_key paul
secrete_key atreides
)
As a next step, a consistent and generalizable way to remove repetitive patterns would be a syntax like:
That is definitely neat as well 🤔
So TLDR is I think it would be ok to introduce the syntax in this PR, but then I would likely try and propose (as a follow on PR) some smarter configuration namespace resolution that didn't require explicitly typing out the namespace each time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let's get the "base" syntax in and then we can progressively optimize shortcuts for cases when there is no ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think this is a PR into the synnada fork, not the apache repo (not sure if that is the intent)
I have one question, about a test change, but otherwise this PR looks good to me.
I would like to track the "make the syntax require less repeitition" as discussed in https://github.com/synnada-ai/datafusion-upstream/pull/10/files#r1523875240 as a new ticket (or maybe add to the existing one in apache#9575)
@@ -299,7 +299,7 @@ select * from validate_parquet_with_options; | |||
|
|||
# Copy from table to single file | |||
query IT | |||
COPY source_table to 'test_files/scratch/copy/table.parquet' STORED AS PARQUET; | |||
COPY source_table to 'test_files/scratch/copy/table.parquet'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@@ -152,7 +152,7 @@ FileSinkExec: sink=ParquetSink(file_groups=[]) | |||
--MemoryExec: partitions=1, partition_sizes=[1] | |||
|
|||
# Error case | |||
query error DataFusion error: SQL error: ParserError\("Missing STORED AS clause in COPY statement"\) | |||
query error DataFusion error: Invalid or Unsupported Configuration: Format not explicitly set and unable to get file extension! Use STORED AS to define file format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -23,7 +23,7 @@ | |||
# create.sql came from | |||
# https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql | |||
# Data file made with DuckDB: | |||
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET); | |||
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this (comment) should not be updated as it is duckdb comment
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV | ||
LOCATION 'test_files/scratch/copy/escape_quote/' PARTITIONED BY ("'test2'", "'test3'"); | ||
|
||
## Until the partition by parsing uses ColumnDef, this test is meaningless since it becomes an overfit. Even in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment or why this test doesn't work anymore. Maybe it just needs to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this pull request, there was a problem with how escape characters in the PARTITIONED BY clause were processed in both the CREATE EXTERNAL TABLE and COPY TO statements. The issue stemmed from converting the Value
to a string too early in the parsing process, leading to errors.
Although it appears that the test passes, in reality, during the execution of CREATE EXTERNAL TABLE, the specified field names do not match as expected, indicating a silent error. This issue should be addressed separately, as it falls outside the scope of the current PR.
I propose that we standardize the approach to handling column names in both the schema definitions and the PARTITIONED BY clauses. Once we establish a consistent method, the relevance of this test can be reassessed.
I will proceed to open an issue to address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed apache#9714
We usually scan the PRs internally first, then submit an upstream PR. The upstream PR was already opened at apache#9604. You can review it on upstream as well. |
Got it. Sorry about my confusion. |
Merged upstream. |
… `interval` (apache#11466) * Unparser rule for datatime cast (#10) * use timestamp as the identifier for date64 * rename * implement CustomDialectBuilder * fix * dialect with interval style (#11) --------- Co-authored-by: Phillip LeBlanc <[email protected]> * fmt * clippy * doc * Update datafusion/sql/src/unparser/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * update the doc for CustomDialectBuilder * fix doc test --------- Co-authored-by: Phillip LeBlanc <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes apache#9369.
Rationale for this change
Changing
into
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?