Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document SQL dialect guidance #13706

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions docs/source/user-guide/sql/dialect.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# SQL Dialect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to say this section pertains the SQL frontend:

  • sql parser and sql-to-rel
  • dataframe API
  • analyzer and the type coercions done at this stage
  • function semantics of functions bundled with datafusion

see also #13704 (comment)

it's also worth noting that we are not going to align with PostgreSQL's dialect fully

  • eg our type system is different (inherited from arrow)
  • we provide extended syntax for certain operations (like CREATE TABLE)
  • (...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded and expanded in b674a42


The included SQL supported in Apache DataFusion mostly follows the [PostgreSQL
SQL dialect], including:

- The SQL parser and [SQL planner]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The SQL parser and [SQL planner]
- SQL Syntax

I think for user SQL syntax is more clear. Also I don't see why SQL planner is involved in sql dialect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for user -- agreed.
but DF is also a library of reusable components.
The parser and sql-to-rel also follow some specific dialect's semantics, and this should be documented.
Not sure if it belongs under user-guide/ though.

- Type checking, analyzer, and type coercions
- Semantics of functions bundled with DataFusion

Notable exceptions:

- Array/List functions and semantics follow the [DuckDB SQL dialect].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite? DuckDB array seems to be fixed size. Is this saying DF array follows DuckDB's list semantics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes DuckDB's "list" not "array". We prefer to convert fixed-size lists into regular lists, so we don’t make a distinction between the two

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to be obvious to the reader of this document?

- DataFusion's type system is based on the [Apache Arrow type system], and the mapping to PostgreSQL types is not always 1:1.
- DataFusion has its own syntax (dialect) for certain operations (like [`CREATE EXTERNAL TABLE`])

As Apache DataFusion is designed to be fully customizable, systems built on
DataFusion can and do implement different SQL semantics. Using DataFusion's APIs,
you can provide alternate function definitions, type rules, and/or SQL syntax
that matches other systems such as Apache Spark or MySQL or your own custom
semantics.

[postgresql sql dialect]: https://www.postgresql.org/docs/current/sql.html
[sql planner]: https://docs.rs/datafusion/latest/datafusion/sql/planner/struct.SqlToRel.html
[duckdb sql dialect]: https://duckdb.org/docs/sql/functions/array
[apache arrow type system]: https://arrow.apache.org/docs/format/Columnar.html#data-types
[`create external table`]: ddl.md#create-external-table

## Rationale

SQL Engines have a choice to either use an existing SQL dialect or define their
own. Using an existing dialect may not fit perfectly as it is hard to match
semantics exactly (need bug-for-bug compatibility), and is likely not what all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More reasons for using pogstgres specifically: it is

  • rather extensive, in contrast to e.g. sqlite
  • rather well aligned w/ the SQL standard (at least that's my personal impression, after having faced MySQL)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather well aligned w/ the SQL standard (at least that's my personal impression, after having faced MySQL)

mostly true (but i know of some deviations)
if we wanted something "executable but also aligned with SQL std", I'd recommend Trino

i kind of assumed PostgreSQL ship has sailed and we're just retro-documenting. But if the ball (choice) is still in play, my vote goes to Trino as a good reference implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kind of assumed PostgreSQL ship has sailed and we're just retro-documenting.

That was my assumption too, but as others like @jayzhan211 (and yourself) have pointed out, I don't think DataFusion is (or can be) 100% consistent with this point

users want. However, it avoids the (very significant) effort of defining
semantics as well as documenting and teaching users about them.
1 change: 1 addition & 0 deletions docs/source/user-guide/sql/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ SQL Reference
.. toctree::
:maxdepth: 2

dialect
data_types
select
subqueries
Expand Down
Loading