Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ibis): introduce Local file connector #1029

Merged
merged 13 commits into from
Jan 7, 2025

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Jan 7, 2025

Description

  • This PR supports using the local file dataset as the data source.
  • Supported file formats: csv, parquet or json

Connection Info

{
    "url": "tests/resource/tpch/data",
    "format": "parquet",
}

The URL should pointer to a Directory. The file or directory in this path would be considered as a Table.
For example:

/path-to-url/
  orders.parquet
  customer/
    customer1.parquet
    customer2.parquet

If we point the URL to path-to-url, we can get 2 tables: orders and customer.

API URL

  • /v2/connector/local_file
  • /v3/connector/local_file

Sample Model Definition

{
    "name": "Orders",
    "tableReference": {
        "table": "tests/resource/tpch/data/orders.parquet",
    },
    "columns": [
        {"name": "orderkey", "expression": "o_orderkey", "type": "integer"},
        {"name": "custkey", "expression": "o_custkey", "type": "integer"},
        {
            "name": "orderstatus",
            "expression": "o_orderstatus",
            "type": "varchar",
        },
        {
            "name": "totalprice",
            "expression": "o_totalprice",
            "type": "float",
        },
        {"name": "orderdate", "expression": "o_orderdate", "type": "date"},
    ],
    "primaryKey": "orderkey",
},

Testing data

The testing data is generated by DuckDB with the following SQL:

  SELECT
      cast(100 as bigint) as c_bigint,
      cast(1 as bit) as c_bit,
      cast('hello' as blob) as c_blob,
      cast(true as boolean) as c_boolean,
      cast('2020-01-01' as date) as c_date,
      cast(1.1 as double) as c_double,
      cast(1.1 as float) as c_float,
      cast(1 as integer) as c_integer,
      cast(1 as hugeint) as c_hugeint,
      cast('1 day' as interval) as c_interval,
      cast('{"a": 1}' as json) as c_json,
      cast(1 as smallint) as c_smallint,
      cast('12:00:00' as time) as c_time,
      cast('2020-01-01 12:00:00' as timestamp) as c_timestamp,
      cast('2020-01-01 12:00:00' as timestamp with time zone) as c_timestamptz,
      cast(1 as tinyint) as c_tinyint,
      cast(1 as ubigint) as c_ubigint,
      cast(1 as uhugeint) as c_uhugeint,
      cast(1 as uinteger) as c_uinteger,
      cast(1 as usmallint) as c_usmallint,
      cast(1 as utinyint) as c_utinyint,
      cast('123e4567-e89b-12d3-a456-426614174000' as uuid) as c_uuid,
      cast('hello' as varchar) as c_varchar;

Then, convert to different data fomrat.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added support for local file data sources.
    • Introduced DuckDB connector for querying local files.
    • Implemented metadata handling for local file connections.
    • Added a new field for path in table properties.
  • Dependencies

    • Added DuckDB and OpenDAL libraries.
  • Improvements

    • Enhanced error handling for unimplemented features.
    • Updated SQL transformation logic to use source aliases.
  • Testing

    • Added comprehensive test suite for local file connector.
    • Introduced new test markers and fixtures for local file testing.
    • Expanded metadata tests to include new path attribute.

Copy link

coderabbitai bot commented Jan 7, 2025

Walkthrough

The pull request introduces comprehensive support for local file data sources in the ibis-server project. This enhancement includes adding a new local_file data source type across multiple components of the system, such as metadata handling, connectors, and data models. The changes enable querying and metadata retrieval for local files using DuckDB as the underlying engine, with support for various file formats like CSV, Parquet, and JSON. The implementation spans multiple files and introduces new classes, methods, and test cases to fully integrate local file functionality.

Changes

File Change Summary
ibis-server/app/mdl/rewriter.py Added handling for DataSource.local_file in _get_write_dialect method.
ibis-server/app/model/__init__.py Introduced QueryLocalFileDTO and LocalFileConnectionInfo classes.
ibis-server/app/model/connector.py Added DuckDBConnector for local file querying.
ibis-server/app/model/data_source.py Added local_file to DataSource and DataSourceExtension enums.
ibis-server/app/model/metadata/factory.py Updated metadata factory to support local file metadata.
ibis-server/app/model/metadata/object_storage.py Added ObjectStorageMetadata and LocalFileMetadata classes.
ibis-server/pyproject.toml Added duckdb and opendal dependencies.
ibis-server/tests/routers/v2/connector/test_local_file.py Introduced tests for local file connector functionality.
ibis-server/tests/routers/v3/connector/local_file/conftest.py Added test configuration for local file tests.
ibis-server/tests/routers/v3/connector/local_file/test_query.py Added tests for querying local files.
wren-core/core/src/logical_plan/analyze/model_generation.rs Streamlined expression handling in model generation.
wren-core/core/src/mdl/mod.rs Updated SQL transformation to include source aliasing.
ibis-server/app/main.py Added specific exception handler for NotImplementedError.
wren-core-py/tests/test_modeling_core.py Updated tests to reflect new SQL aliasing logic.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Router
    participant Connector
    participant DuckDB
    participant Metadata

    Client->>Router: Send local file query
    Router->>Connector: Initialize DuckDBConnector
    Connector->>DuckDB: Establish connection
    Router->>Metadata: Retrieve table metadata
    Metadata->>DuckDB: Read file schema
    Router->>Connector: Execute query
    Connector->>DuckDB: Run SQL
    DuckDB-->>Connector: Return results
    Connector-->>Router: Provide query results
    Router-->>Client: Send response
Loading

Poem

🐰 Local files dance with glee,
DuckDB queries set data free!
CSV, Parquet, JSON delight,
Metadata shines so bright and light.
A rabbit's code, a file's embrace! 📂✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added core ibis dependencies Pull requests that update a dependency file rust Pull requests that update Rust code python Pull requests that update Python code labels Jan 7, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
ibis-server/app/model/metadata/object_storage.py (2)

83-85: Offer assistance for supporting STRUCT data types

The TODO comment indicates that support for STRUCT data types is needed. Implementing this will enhance data type handling.

Would you like me to provide a solution to add support for STRUCT data types or open a new GitHub issue to track this task?


87-89: Offer assistance for supporting ARRAY data types

The TODO comment indicates that support for ARRAY data types is needed. Including this feature will improve the robustness of your data type mappings.

Would you like me to help implement support for ARRAY data types or create a GitHub issue to address this?

ibis-server/app/model/__init__.py (1)

140-144: Enhance field validation and documentation.

Consider the following improvements:

  1. Use an Enum for the format field to restrict values to supported formats
  2. Add description and examples for the url field to clarify expected values
 class LocalFileConnectionInfo(BaseModel):
-    url: SecretStr
+    url: SecretStr = Field(
+        description="Path to the local file or directory",
+        examples=["data/customers.csv", "data/orders/"]
+    )
-    format: str = Field(
+    format: FileFormat = Field(
         description="File format", default="csv", examples=["csv", "parquet", "json"]
     )

+class FileFormat(str, Enum):
+    CSV = "csv"
+    PARQUET = "parquet"
+    JSON = "json"
ibis-server/tests/routers/v3/connector/local_file/test_query.py (1)

152-180: Add more comprehensive dry run tests.

The dry run tests should include:

  1. Syntax errors in SQL
  2. Invalid column references
  3. Type mismatches
ibis-server/pyproject.toml (1)

51-52: Consider deployment and testing implications of DuckDB.

DuckDB is a good choice for local file processing, but consider:

  1. Document memory requirements and limitations for production deployments
  2. Add integration tests with different file formats (CSV, Parquet, JSON)
  3. Consider adding examples in the README for common use cases
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c93ec6 and df6a2e6.

⛔ Files ignored due to path filters (1)
  • ibis-server/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • ibis-server/app/mdl/rewriter.py (1 hunks)
  • ibis-server/app/model/__init__.py (3 hunks)
  • ibis-server/app/model/connector.py (2 hunks)
  • ibis-server/app/model/data_source.py (3 hunks)
  • ibis-server/app/model/metadata/factory.py (2 hunks)
  • ibis-server/app/model/metadata/object_storage.py (1 hunks)
  • ibis-server/pyproject.toml (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_local_file.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/local_file/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/local_file/test_query.py (1 hunks)
  • wren-core/core/src/logical_plan/analyze/model_generation.rs (5 hunks)
  • wren-core/core/src/mdl/mod.rs (19 hunks)
👮 Files not reviewed due to content moderation or server errors (4)
  • ibis-server/app/model/connector.py
  • ibis-server/tests/routers/v2/connector/test_local_file.py
  • wren-core/core/src/logical_plan/analyze/model_generation.rs
  • wren-core/core/src/mdl/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: cargo test (win64)
  • GitHub Check: cargo test (macos-aarch64)
  • GitHub Check: cargo check
  • GitHub Check: cargo test (macos)
  • GitHub Check: ci
  • GitHub Check: test
🔇 Additional comments (5)
ibis-server/app/model/metadata/factory.py (1)

8-8: Integration of LocalFileMetadata into MetadataFactory

Adding LocalFileMetadata to the import statements and the mapping dictionary correctly integrates the new local file data source into the metadata factory. This allows for proper handling of local files throughout the application.

Also applies to: 22-22

ibis-server/app/mdl/rewriter.py (1)

75-76: Verify DuckDB connector implementation.

The dialect is correctly set to "duckdb" for local files. However, we should verify that the DuckDB connector is properly implemented to handle this dialect.

✅ Verification successful

DuckDB connector implementation verified for local files

The codebase shows proper implementation of DuckDB connector for local files:

  • Local file data source is correctly mapped to DuckDBConnector
  • LocalFileMetadata class handles metadata operations
  • Complete implementation chain from data source recognition to connector mapping is in place
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify DuckDB connector implementation
# Test: Search for DuckDB connector class and its implementation
ast-grep --pattern 'class $CONNECTOR {
  $$$
  def $METHOD($ARGS) {
    $$$
    duckdb$$$
    $$$
  }
  $$$
}'

Length of output: 112


Script:

#!/bin/bash
# Search for DuckDB-related code
echo "=== DuckDB References ==="
rg -i "duckdb" -A 3

echo -e "\n=== Local File Handling ==="
rg -i "local_file" -A 3

echo -e "\n=== Connector Classes ==="
ast-grep --pattern 'class $NAME($$$) {
  $$$
}'

Length of output: 70107


Script:

#!/bin/bash
# Search for local file handling in the codebase
echo "=== Local File DataSource References ==="
rg "DataSource\.local_file" -A 3

echo -e "\n=== File Operations ==="
rg "\.file|readFile|writeFile|FileUtil" -A 3

echo -e "\n=== DuckDB File Config ==="
rg "file.*config.*duckdb" -i -A 3

Length of output: 15989

ibis-server/tests/routers/v3/connector/local_file/test_query.py (1)

108-124: Verify limit parameter behavior.

The test asserts that the result has 1 row but doesn't verify that it's using the correct limit (URL parameter vs SQL LIMIT clause). Add assertions to verify which limit takes precedence.

ibis-server/pyproject.toml (2)

66-66: LGTM! Well-organized test categorization.

The new marker local_file follows the existing pattern and is appropriately placed in alphabetical order.


51-52: Verify DuckDB version and consider security implications.

The fixed version of DuckDB (1.1.3) might be outdated. Let's verify the latest stable version and any security advisories.

Additionally, consider using a version range (e.g., >=1.1.3,<2.0.0) to automatically get bug fixes while avoiding breaking changes.

✅ Verification successful

DuckDB version 1.1.3 is appropriate and secure

The specified version is the latest available on PyPI and has no known security vulnerabilities. Using a fixed version is good for reproducibility in this case.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for latest versions and security advisories for DuckDB

# Check PyPI for latest versions
curl -s https://pypi.org/pypi/duckdb/json | jq '.info.version'

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "duckdb") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Length of output: 410

ibis-server/app/model/metadata/object_storage.py Outdated Show resolved Hide resolved
ibis-server/app/model/metadata/object_storage.py Outdated Show resolved Hide resolved
ibis-server/app/model/metadata/object_storage.py Outdated Show resolved Hide resolved
ibis-server/app/model/metadata/object_storage.py Outdated Show resolved Hide resolved
ibis-server/app/model/data_source.py Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
wren-core-py/tests/test_modeling_core.py (2)

101-101: Consider simplifying the nested subqueries.

While the SQL transformation is correct, there might be room for optimization by reducing the number of nested subqueries while maintaining the same semantics.

Consider simplifying to:

SELECT add_two(customer.c_custkey) FROM (SELECT __source.c_custkey AS c_custkey FROM main.customer AS __source) AS customer

116-116: Consider consolidating duplicate SQL transformation tests.

This assertion duplicates the test from test_session_context. Consider extracting common test cases into a shared parametrized test to improve maintainability.

Example refactor:

@pytest.mark.parametrize(
    "sql,expected",
    [
        (
            "SELECT * FROM my_catalog.my_schema.customer",
            "SELECT customer.c_custkey, customer.c_name FROM (SELECT __source.c_custkey AS c_custkey, __source.c_name AS c_name FROM main.customer AS __source) AS customer"
        ),
        (
            "SELECT add_two(c_custkey) FROM my_catalog.my_schema.customer",
            "SELECT add_two(customer.c_custkey) FROM (SELECT __source.c_custkey AS c_custkey FROM main.customer AS __source) AS customer"
        )
    ]
)
def test_sql_transformations(sql, expected):
    session_context = SessionContext(manifest_str, "tests/functions.csv")
    assert session_context.transform_sql(sql) == expected
ibis-server/app/model/metadata/object_storage.py (1)

123-125: Consider connection pooling for better performance.

The current implementation creates a new connection for each operation. Consider implementing connection pooling to improve performance.

from contextlib import contextmanager

def _get_connection_pool(self):
    if not hasattr(self, '_pool'):
        self._pool = []  # Initialize with a few connections
    return self._pool

@contextmanager
def _get_connection(self):
    pool = self._get_connection_pool()
    try:
        conn = pool.pop() if pool else duckdb.connect()
        yield conn
    finally:
        pool.append(conn)
ibis-server/tests/routers/v2/connector/test_local_file.py (2)

91-120: Add negative test cases for data validation.

The test suite should include negative test cases for:

  • Invalid file paths
  • Malformed data
  • Concurrent access scenarios
  • Large file handling

Here's an example of additional test cases:

async def test_invalid_file_path(client, manifest_str):
    response = await client.post(
        f"{base_url}/query",
        json={
            "manifestStr": manifest_str,
            "sql": 'SELECT * FROM "Orders" LIMIT 1',
            "connectionInfo": {
                "url": "nonexistent/path",
                "format": "parquet"
            },
        },
    )
    assert response.status_code == 404

async def test_concurrent_access(client, manifest_str, connection_info):
    import asyncio
    tasks = []
    for _ in range(10):
        tasks.append(
            client.post(
                f"{base_url}/query",
                json={
                    "manifestStr": manifest_str,
                    "sql": 'SELECT * FROM "Orders" LIMIT 1',
                    "connectionInfo": connection_info,
                },
            )
        )
    responses = await asyncio.gather(*tasks)
    assert all(r.status_code == 200 for r in responses)

235-246: Add more format-specific test cases.

The unsupported format test is good, but consider adding tests for:

  • CSV with different delimiters
  • JSON with nested structures
  • Parquet with compression

Example test case:

async def test_csv_with_custom_delimiter(client):
    response = await client.post(
        url=f"{base_url}/metadata/tables",
        json={
            "connectionInfo": {
                "url": "tests/resource/tpch/data",
                "format": "csv",
                "options": {"delimiter": "|"}
            },
        },
    )
    assert response.status_code == 200
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df6a2e6 and 65559b2.

📒 Files selected for processing (6)
  • ibis-server/app/main.py (1 hunks)
  • ibis-server/app/model/data_source.py (4 hunks)
  • ibis-server/app/model/metadata/object_storage.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_local_file.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/local_file/conftest.py (1 hunks)
  • wren-core-py/tests/test_modeling_core.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • ibis-server/app/model/data_source.py
  • ibis-server/tests/routers/v3/connector/local_file/conftest.py
🧰 Additional context used
📓 Learnings (1)
ibis-server/app/model/metadata/object_storage.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: clippy
  • GitHub Check: cargo test (amd64)
  • GitHub Check: check Cargo.toml formatting
  • GitHub Check: cargo test (win64)
  • GitHub Check: cargo test (macos-aarch64)
  • GitHub Check: cargo test (macos)
  • GitHub Check: ci
  • GitHub Check: test
🔇 Additional comments (5)
wren-core-py/tests/test_modeling_core.py (1)

93-93: LGTM! Verify consistent usage of __source alias.

The introduction of the __source alias in the SQL transformation provides better clarity about the data source. This change aligns well with supporting multiple data sources, including local files.

Let's verify that this aliasing pattern is consistently applied across the codebase:

✅ Verification successful

__source alias pattern is consistently implemented

The __source alias is systematically applied across the codebase, following a consistent three-layer transformation pattern in both test assertions and core implementation. The change is well-integrated and maintains consistency across different SQL operations, data types, and character sets.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for SQL transformations to ensure consistent use of __source alias
rg -l "AS __source" --type rust --type python

Length of output: 78


Script:

#!/bin/bash
# Find all Python and Rust files and search for __source alias pattern
fd -e py -e rs | xargs rg "AS __source" -A 2 -B 2

Length of output: 12454

ibis-server/app/main.py (1)

81-83: LGTM! Appropriate error handling for unimplemented features.

The implementation correctly uses HTTP status code 501 for "Not Implemented" responses, following FastAPI's exception handling patterns.

ibis-server/app/model/metadata/object_storage.py (3)

46-46: ⚠️ Potential issue

Fix data type retrieval using DuckDB's DESCRIBE statement.

Based on the retrieved learning from @goldmedal, we should use DuckDB's DESCRIBE statement instead of accessing DataFrame attributes.

Apply this diff to fix the data type retrieval:

-                    duckdb_type = df[col].dtypes[0]
+                    query_result = conn.execute(f"DESCRIBE SELECT {col} FROM '{file.path}'")
+                    column_types = dict(query_result.fetchall())
+                    duckdb_type = column_types[col]

Likely invalid or redundant comment.


63-79: 🛠️ Refactor suggestion

Use DuckDB table functions instead of connection methods.

Replace the direct connection methods with DuckDB's table functions for better reliability.

Apply this diff:

     def _read_df(self, conn, path):
+        full_path = f"{self.connection_info.url.get_secret_value()}/{path}"
         if self.connection_info.format == "parquet":
-            return conn.read_parquet(
-                f"{self.connection_info.url.get_secret_value()}/{path}"
-            )
+            return conn.execute(f"SELECT * FROM read_parquet('{full_path}')").df()
         elif self.connection_info.format == "csv":
-            return conn.read_csv(
-                f"{self.connection_info.url.get_secret_value()}/{path}"
-            )
+            return conn.execute(f"SELECT * FROM read_csv('{full_path}')").df()
         elif self.connection_info.format == "json":
-            return conn.read_json(
-                f"{self.connection_info.url.get_secret_value()}/{path}"
-            )
+            return conn.execute(f"SELECT * FROM read_json('{full_path}')").df()
         else:
             raise NotImplementedError(
                 f"Unsupported format: {self.connection_info.format}"
             )

Likely invalid or redundant comment.


85-92: Consider implementing STRUCT and array type support.

The TODOs indicate missing support for STRUCT and array types. This could limit functionality for complex data types.

Let's check if there are any files using these types:

Would you like me to help implement support for these types or create GitHub issues to track them?

@goldmedal goldmedal marked this pull request as draft January 7, 2025 08:04
@goldmedal goldmedal marked this pull request as ready for review January 7, 2025 10:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
ibis-server/app/model/metadata/dto.py (1)

69-71: LGTM! Consider enhancing the field documentation.

The path field is well-implemented with proper typing and default value. The comment clearly indicates its purpose for file-based tables.

Consider expanding the documentation to include supported file formats:

-    )  # the full path of the table for file-based table
+    )  # the full path of the table for file-based table (supports csv, parquet, json formats)
ibis-server/tests/routers/v2/connector/test_trino.py (1)

394-394: Consider reducing test duplication across connector tests.

The pattern of testing path=None is repeated across all connector tests. Consider extracting this into a shared test utility to improve maintainability.

Example approach:

# tests/utils/connector_test_utils.py
def assert_table_properties(table_props, expected_catalog, expected_schema, expected_table, expected_path=None):
    assert table_props == {
        "catalog": expected_catalog,
        "schema": expected_schema,
        "table": expected_table,
        "path": expected_path,
    }
ibis-server/tests/routers/v2/connector/test_mssql.py (1)

383-383: LGTM! Consider adding documentation.

The addition of the path property to table metadata is consistent with the new local file connector feature. Setting it to None for MSSQL tables is appropriate as they don't have filesystem paths.

Consider adding a comment in the test to explain why the path property exists and its significance in the context of different data sources (database tables vs. local files).

ibis-server/tests/routers/v2/connector/test_postgres.py (1)

569-569: LGTM! Consider extracting common test assertions.

The addition of the path property maintains consistency with other database connectors. Given that this metadata structure is now common across all connectors, consider extracting these assertions into a shared test utility.

Consider creating a shared test utility function to verify the common metadata structure:

def assert_table_metadata(result, expected_name, expected_catalog, expected_schema, expected_table):
    """Assert common table metadata structure across different connectors."""
    assert result["name"] == expected_name
    assert result["primaryKey"] is not None
    assert result["properties"] == {
        "catalog": expected_catalog,
        "schema": expected_schema,
        "table": expected_table,
        "path": None,
    }

This would reduce duplication and make it easier to maintain consistent assertions across all connector tests.

ibis-server/tests/routers/v2/connector/test_local_file.py (6)

11-11: Fix typo in "catalog" field.

The word "catalog" is misspelled as "calalog".

-    "catalog": "my_calalog",
+    "catalog": "my_catalog",

16-18: Consider using path fixtures for test file paths.

Using hardcoded relative paths in tests can be fragile if tests are run from different directories. Consider creating a fixture that provides the absolute path to test resources.

Example implementation:

@pytest.fixture(scope="module")
def test_resources_path():
    return os.path.join(os.path.dirname(__file__), "..", "..", "..", "resource", "tpch", "data")

Also applies to: 44-45


104-111: Consider using test data fixtures instead of hardcoded values.

Hardcoded test data makes tests brittle and harder to maintain. Consider creating fixtures or constants for test data.

Example implementation:

EXPECTED_ORDER_DATA = {
    "orderkey": 1,
    "custkey": 370,
    "orderstatus": "O",
    "totalprice": "172799.49",
    "orderdate": "1996-01-02 00:00:00.000000",
    "order_cust_key": "1_370"
}

def test_query(...):
    assert result["data"][0] == list(EXPECTED_ORDER_DATA.values())

181-182: Enhance error message validation in dry run test.

The test only verifies that the response text is not None. Consider asserting the specific error message to ensure proper error handling.

     assert response.status_code == 422
-    assert response.text is not None
+    assert "Table 'NotFound' not found" in response.text

194-194: Replace filter() with a more robust table lookup.

Using filter() could fail silently if the table is not found. Consider using a more explicit approach.

-    result = next(filter(lambda x: x["name"] == "orders", response.json()))
+    tables = response.json()
+    orders_table = next((x for x in tables if x["name"] == "orders"), None)
+    assert orders_table is not None, "Orders table not found in response"
+    result = orders_table

250-448: Reduce duplication in format type mapping tests.

The type mapping assertions are repeated across format tests. Consider extracting common type mapping logic into helper functions or fixtures.

Example implementation:

@pytest.fixture
def type_mapping_assertions():
    def _assert_common_types(columns):
        assert columns[0]["name"] == "c_bigint"
        assert columns[0]["type"] == "INT64"
        # ... more common assertions
    return _assert_common_types

def test_list_parquet_files(client, type_mapping_assertions):
    # ... test setup ...
    type_mapping_assertions(columns)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65559b2 and 12d1f1c.

⛔ Files ignored due to path filters (7)
  • ibis-server/poetry.lock is excluded by !**/*.lock
  • ibis-server/tests/resource/test_file_source/type-test-csv/t1.csv is excluded by !**/*.csv
  • ibis-server/tests/resource/test_file_source/type-test-csv/t2.csv is excluded by !**/*.csv
  • ibis-server/tests/resource/test_file_source/type-test-parquet/t1.parquet is excluded by !**/*.parquet
  • ibis-server/tests/resource/test_file_source/type-test-parquet/t2.parquet is excluded by !**/*.parquet
  • ibis-server/tests/resource/test_file_source/type-test.csv is excluded by !**/*.csv
  • ibis-server/tests/resource/test_file_source/type-test.parquet is excluded by !**/*.parquet
📒 Files selected for processing (14)
  • ibis-server/app/model/metadata/dto.py (1 hunks)
  • ibis-server/app/model/metadata/object_storage.py (1 hunks)
  • ibis-server/pyproject.toml (2 hunks)
  • ibis-server/tests/resource/test_file_source/invalid (1 hunks)
  • ibis-server/tests/resource/test_file_source/type-test-json/t1.json (1 hunks)
  • ibis-server/tests/resource/test_file_source/type-test-json/t2.json (1 hunks)
  • ibis-server/tests/resource/test_file_source/type-test.json (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_clickhouse.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_local_file.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_mssql.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_mysql.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_postgres.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_snowflake.py (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_trino.py (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • ibis-server/tests/resource/test_file_source/invalid
  • ibis-server/tests/resource/test_file_source/type-test-json/t1.json
  • ibis-server/tests/resource/test_file_source/type-test.json
  • ibis-server/tests/resource/test_file_source/type-test-json/t2.json
🚧 Files skipped from review as they are similar to previous changes (2)
  • ibis-server/pyproject.toml
  • ibis-server/app/model/metadata/object_storage.py
🔇 Additional comments (5)
ibis-server/tests/routers/v2/connector/test_snowflake.py (1)

290-290: LGTM! Path field correctly included in metadata test.

The test properly verifies that non-file-based tables return None for the path field, maintaining consistency with the DTO changes.

ibis-server/tests/routers/v2/connector/test_mysql.py (1)

365-365: LGTM! Path field correctly included in metadata test.

The test properly verifies that non-file-based tables return None for the path field, maintaining consistency with other connectors.

ibis-server/tests/routers/v2/connector/test_trino.py (1)

394-394: LGTM! Path field correctly included in metadata test.

The test properly verifies that non-file-based tables return None for the path field, maintaining consistency with other connectors.

ibis-server/tests/routers/v2/connector/test_clickhouse.py (1)

530-530: LGTM! Consistent with other connectors.

The addition of the path property maintains consistency with other database connectors and aligns with the local file connector feature.

ibis-server/tests/routers/v2/connector/test_local_file.py (1)

332-333: Handle invalid CSV files more explicitly.

The comment suggests that invalid files are silently accepted as one-column CSV files. Consider adding explicit validation for CSV file format.

@goldmedal goldmedal merged commit 54fcec7 into Canner:main Jan 7, 2025
16 checks passed
@goldmedal goldmedal deleted the feature/file-connecotr-duckdb branch January 7, 2025 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core dependencies Pull requests that update a dependency file ibis python Pull requests that update Python code rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants