COPY TO allign with CREATE EXTERNAL TABLE #10

metesynnada · 2024-03-13T14:36:03Z

Which issue does this PR close?

Rationale for this change

Changing

OPTIONS (
    format X,
    X.foo.bar baz
)

into

STORED AS X
OPTIONS (
    format.foo.bar baz
)

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

metesynnada · 2024-03-13T14:36:58Z

@ozankabak If the syntax meets the requirements, I will submit a PR to the upstream repository.

ozankabak · 2024-03-13T17:07:59Z

Looks good

alamb · 2024-03-13T17:21:46Z

datafusion/sqllogictest/test_files/copy.slt

-'parquet.bloom_filter_ndv' 100
+STORED AS PARQUET
+OPTIONS (
+'format.compression' snappy,


It would be awesome if we could avoid having to repeat out format. for each option

Note this change may conflict with the changes from @tinfoil-knight in apache#9594

The challenge here is an option may refer to a non-format thing too. Think about something like:

STORED AS PARQUET OPTIONS ( format.foo bar format.fizz buzz credentials.username admin )

With the aligned/unified syntax here, all format entries refer to parquet and there is no repetition of parquet. But we repeat format instead, and this prefix is necessary as it separates format-related and non-format-related options.

As a next step, a consistent and generalizable way to remove repetitive patterns would be a syntax like:

STORED AS PARQUET OPTIONS format ( foo bar hey ho ) OPTIONS other.prefix ( fizz buzz paul atreides ) OPTIONS ( credentials.username admin )

where multiple OPTIONS are allowable, each with a possible prefix to avoid repetition. In case of an option is specified multiple times with different values, we'd generate an error. The above would be equivalent to

STORED AS PARQUET OPTIONS ( format.foo bar format.hey ho other.prefix.fizz buzz other.prefix.paul atreides credentials.username admin )

where the user is explicitly repeating prefixes.

Got it -- I agree that being able to explicitly namespace out the options is valuable to disambiguate them

I think as a user it would be really nice to not have to worry about such collisions unless i actually had to disambiguate them

Like in my mind, this statement isn't ambiguous

STORED AS PARQUET OPTIONS ( compression snappy )

Because the only possible "compression" option is on the format.

I think the way @tinfoil-knight handles this in apache#9594 is to internally try and resolve compression as though it were format.compression (or parquet.compression) if the compression configuration isn't valid.

Even something like this is not ambiguous (though I realize actually implementing this may not be feasible without much more work):

STORED AS PARQUET LOCATION 's3://....' OPTIONS ( compression snappy access_key paul secrete_key atreides )

As a next step, a consistent and generalizable way to remove repetitive patterns would be a syntax like:

That is definitely neat as well 🤔

So TLDR is I think it would be ok to introduce the syntax in this PR, but then I would likely try and propose (as a follow on PR) some smarter configuration namespace resolution that didn't require explicitly typing out the namespace each time

Sounds good. Let's get the "base" syntax in and then we can progressively optimize shortcuts for cases when there is no ambiguity.

alamb

FWIW I think this is a PR into the synnada fork, not the apache repo (not sure if that is the intent)

I have one question, about a test change, but otherwise this PR looks good to me.

I would like to track the "make the syntax require less repeitition" as discussed in https://github.com/synnada-ai/datafusion-upstream/pull/10/files#r1523875240 as a new ticket (or maybe add to the existing one in apache#9575)

alamb · 2024-03-15T14:29:49Z

datafusion/sqllogictest/test_files/copy.slt

@@ -299,7 +299,7 @@ select * from validate_parquet_with_options;

 # Copy from table to single file
 query IT
-COPY source_table to 'test_files/scratch/copy/table.parquet' STORED AS PARQUET;
+COPY source_table to 'test_files/scratch/copy/table.parquet';


alamb · 2024-03-15T14:30:06Z

datafusion/sqllogictest/test_files/copy.slt

@@ -152,7 +152,7 @@ FileSinkExec: sink=ParquetSink(file_groups=[])
 --MemoryExec: partitions=1, partition_sizes=[1]

 # Error case
-query error DataFusion error: SQL error: ParserError\("Missing STORED AS clause in COPY statement"\)
+query error DataFusion error: Invalid or Unsupported Configuration: Format not explicitly set and unable to get file extension! Use STORED AS to define file format.


alamb · 2024-03-15T14:32:10Z

datafusion/sqllogictest/test_files/clickbench.slt

@@ -23,7 +23,7 @@
 # create.sql came from
 # https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql
 # Data file made with DuckDB:
-# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET);
+# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet';


I think this (comment) should not be updated as it is duckdb comment

alamb · 2024-03-15T14:33:38Z

datafusion/sqllogictest/test_files/copy.slt

-CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV 
-LOCATION 'test_files/scratch/copy/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");
-
+## Until the partition by parsing uses ColumnDef, this test is meaningless since it becomes an overfit. Even in


I don't understand this comment or why this test doesn't work anymore. Maybe it just needs to be updated?

Before this pull request, there was a problem with how escape characters in the PARTITIONED BY clause were processed in both the CREATE EXTERNAL TABLE and COPY TO statements. The issue stemmed from converting the Value to a string too early in the parsing process, leading to errors.

Although it appears that the test passes, in reality, during the execution of CREATE EXTERNAL TABLE, the specified field names do not match as expected, indicating a silent error. This issue should be addressed separately, as it falls outside the scope of the current PR.

I propose that we standardize the approach to handling column names in both the schema definitions and the PARTITIONED BY clauses. Once we establish a consistent method, the relevance of this test can be reassessed.

I will proceed to open an issue to address it.

Filed apache#9714

metesynnada · 2024-03-15T14:52:13Z

We usually scan the PRs internally first, then submit an upstream PR. The upstream PR was already opened at apache#9604. You can review it on upstream as well.

alamb · 2024-03-15T20:16:40Z

We usually scan the PRs internally first, then submit an upstream PR. The upstream PR was already opened at apache#9604. You can review it on upstream as well.

Got it. Sorry about my confusion.

ozankabak · 2024-03-18T22:46:49Z

Merged upstream.

… `interval` (apache#11466) * Unparser rule for datatime cast (#10) * use timestamp as the identifier for date64 * rename * implement CustomDialectBuilder * fix * dialect with interval style (#11) --------- Co-authored-by: Phillip LeBlanc <[email protected]> * fmt * clippy * doc * Update datafusion/sql/src/unparser/expr.rs Co-authored-by: Andrew Lamb <[email protected]> * update the doc for CustomDialectBuilder * fix doc test --------- Co-authored-by: Phillip LeBlanc <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

COPY TO allign with CREATE EXTERNAL TABLE

7af77ce

github-actions bot added core sqllogictest sql labels Mar 13, 2024

alamb reviewed Mar 13, 2024

View reviewed changes

Resolve datafusion-cli error

536e291

metesynnada mentioned this pull request Mar 14, 2024

Make COPY TO align with CREATE EXTERNAL TABLE apache/datafusion#9604

Merged

Make STORED AS optional

68105cf

alamb reviewed Mar 15, 2024

View reviewed changes

ozankabak and others added 5 commits March 16, 2024 11:25

Review

8df4b0e

Review resolved

59e51b5

Merge remote-tracking branch 'upstream/main' into copy-to-parser

9efaae0

Merge resolve

8c82a94

Enhancing comments, solving some bugs

065295f

ozankabak closed this Mar 18, 2024

This was referenced Mar 20, 2024

Can not handle ' characters in PARTITIONED BY clause apache/datafusion#9714

Open

Regression: All formatting options in COPY commands require format. prefix, but did not in DataFusion 36.0.0 apache/datafusion#9716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COPY TO allign with CREATE EXTERNAL TABLE #10

COPY TO allign with CREATE EXTERNAL TABLE #10

metesynnada commented Mar 13, 2024

metesynnada commented Mar 13, 2024

ozankabak commented Mar 13, 2024

alamb Mar 13, 2024

ozankabak Mar 13, 2024

alamb Mar 13, 2024

ozankabak Mar 13, 2024

alamb left a comment

alamb Mar 15, 2024

alamb Mar 15, 2024

alamb Mar 15, 2024

alamb Mar 15, 2024

metesynnada Mar 15, 2024

alamb Mar 20, 2024

metesynnada commented Mar 15, 2024

alamb commented Mar 15, 2024

ozankabak commented Mar 18, 2024

COPY TO allign with CREATE EXTERNAL TABLE #10

COPY TO allign with CREATE EXTERNAL TABLE #10

Conversation

metesynnada commented Mar 13, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

metesynnada commented Mar 13, 2024

ozankabak commented Mar 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metesynnada commented Mar 15, 2024

alamb commented Mar 15, 2024

ozankabak commented Mar 18, 2024