Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic implementation for DuckDB as PG Engine #437

Merged

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Jan 19, 2024

Description

Previously, we executed all SQL on the data source side. Accio serves as a pure SQL conversion layer. However, it introduces challenges when it comes to supporting new data source types. Implementing a data source connector requires extensive SQL rewriting or the implementation of specific SQL dialects. Each data source presents unique issues, making it cumbersome to expand our usage efficiently.

In this PR, we present an architecture to invoke DuckDB as a single PG engine. Run every PG-related query in DuckDB and others in DataSource.

How to do

DuckDB is a highly PG compatibility SQL engine

It implements many pg_catalog view and information_schema view. We directly sync the MDL to DuckDB, then we can invoke those schemas.

3 Level Query Flow

The main purpose to implement PG Wire Protocol is enhancing our client ecosystem. We won't expose those PG usage for user common using. We only care how the SQL behavior of BI tools or PG drivers. In the past experience, the most PG-related SQL is used to get the metadata. It means we don't need to execute all query in the data source side.

Level 1 - Metastore Full supported

The SQL is full supported by DuckDB without any rewrite.

Level 2 - Metastore Semi supported

The SQL isn't supported by DuckDB but it's related to PG Metadata. We should do some rewrite for it.

Leve 3 - Data source

The SQL is used to query real data. We should execute it in Data source side.

New Configuration

duckdb.max-concurrent-metadata-queries

We uses HikariCP as our connection pool to avoid to create connection repeatedly. This config is used to control how many max queries Accio will keep.

@goldmedal goldmedal marked this pull request as draft January 19, 2024 10:56
@goldmedal goldmedal force-pushed the feature/duckdb-as-pg-engine-basic branch 2 times, most recently from 83e9dd0 to 5192bcd Compare January 22, 2024 09:07
@grieve54706
Copy link
Contributor

We can use accioMDLDirectory.listFiles((dir, name) -> name.endsWith(".json")) to early filter to reduce create too many unused files.
https://github.com/Canner/accio/blob/5192bcdccb838140fc66ff4ccb03244233edf563/accio-main/src/main/java/io/accio/main/AccioManager.java#L64

@goldmedal goldmedal force-pushed the feature/duckdb-as-pg-engine-basic branch from 5810eaf to 8835b0e Compare January 24, 2024 09:46
Copy link
Contributor

@grieve54706 grieve54706 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate your contribution. I've some point I don’t understand. If you can teach me when you have time, I will grateful to you.

accio-base/src/main/java/io/accio/base/AccioMDL.java Outdated Show resolved Hide resolved

import java.util.List;

public interface PgMetastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class will be implemented by many data sources different from PG. Should it not be with the Pg prefix to avoid confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PG prefix means PG Wire Protocol which is responsible to execute the metadata query. Metastore means where we store the meta data.


public class DuckDBMetadata
implements Metadata
implements Metadata, PgMetastore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me know what different between Metadata and PgMetastore? I saw they have same method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata is an interface to access data source. PgMetastore is also use to access data source but the data source is used to execute PG metadata query. In our desgin,level-1 and level-2 will be executed by PgMetastore. level-3 will be exectued by the normal metadata.

pom.xml Show resolved Hide resolved
@goldmedal goldmedal force-pushed the feature/duckdb-as-pg-engine-basic branch 3 times, most recently from 2b440fd to f62127a Compare January 29, 2024 06:51
@goldmedal goldmedal marked this pull request as ready for review January 30, 2024 02:00
@goldmedal goldmedal requested a review from brandboat January 30, 2024 02:00
@goldmedal goldmedal force-pushed the feature/duckdb-as-pg-engine-basic branch from 45f47fe to 286531f Compare January 30, 2024 02:15
@goldmedal goldmedal force-pushed the feature/duckdb-as-pg-engine-basic branch from 17c925e to e6f1384 Compare January 31, 2024 08:12
@goldmedal goldmedal changed the base branch from feature/duckdb-as-pg-engine to main January 31, 2024 08:25
@goldmedal goldmedal changed the base branch from main to feature/duckdb-as-pg-engine January 31, 2024 08:25
Copy link
Contributor

@grieve54706 grieve54706 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brandboat brandboat merged commit 2ae5e07 into feature/duckdb-as-pg-engine Feb 1, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants