feat(ibis): introduce Local file connector #1029

goldmedal · 2025-01-07T03:37:06Z

Description

This PR supports using the local file dataset as the data source.
Supported file formats: csv, parquet or json

Connection Info

{
    "url": "tests/resource/tpch/data",
    "format": "parquet",
}

The URL should pointer to a Directory. The file or directory in this path would be considered as a Table.
For example:

/path-to-url/
  orders.parquet
  customer/
    customer1.parquet
    customer2.parquet

If we point the URL to path-to-url, we can get 2 tables: orders and customer.

API URL

/v2/connector/local_file
/v3/connector/local_file

Sample Model Definition

{
    "name": "Orders",
    "tableReference": {
        "table": "tests/resource/tpch/data/orders.parquet",
    },
    "columns": [
        {"name": "orderkey", "expression": "o_orderkey", "type": "integer"},
        {"name": "custkey", "expression": "o_custkey", "type": "integer"},
        {
            "name": "orderstatus",
            "expression": "o_orderstatus",
            "type": "varchar",
        },
        {
            "name": "totalprice",
            "expression": "o_totalprice",
            "type": "float",
        },
        {"name": "orderdate", "expression": "o_orderdate", "type": "date"},
    ],
    "primaryKey": "orderkey",
},

Testing data

The testing data is generated by DuckDB with the following SQL:

  SELECT
      cast(100 as bigint) as c_bigint,
      cast(1 as bit) as c_bit,
      cast('hello' as blob) as c_blob,
      cast(true as boolean) as c_boolean,
      cast('2020-01-01' as date) as c_date,
      cast(1.1 as double) as c_double,
      cast(1.1 as float) as c_float,
      cast(1 as integer) as c_integer,
      cast(1 as hugeint) as c_hugeint,
      cast('1 day' as interval) as c_interval,
      cast('{"a": 1}' as json) as c_json,
      cast(1 as smallint) as c_smallint,
      cast('12:00:00' as time) as c_time,
      cast('2020-01-01 12:00:00' as timestamp) as c_timestamp,
      cast('2020-01-01 12:00:00' as timestamp with time zone) as c_timestamptz,
      cast(1 as tinyint) as c_tinyint,
      cast(1 as ubigint) as c_ubigint,
      cast(1 as uhugeint) as c_uhugeint,
      cast(1 as uinteger) as c_uinteger,
      cast(1 as usmallint) as c_usmallint,
      cast(1 as utinyint) as c_utinyint,
      cast('123e4567-e89b-12d3-a456-426614174000' as uuid) as c_uuid,
      cast('hello' as varchar) as c_varchar;

Then, convert to different data fomrat.

Summary by CodeRabbit

New Features
- Added support for local file data sources.
- Introduced DuckDB connector for querying local files.
- Implemented metadata handling for local file connections.
- Added a new field for path in table properties.
Dependencies
- Added DuckDB and OpenDAL libraries.
Improvements
- Enhanced error handling for unimplemented features.
- Updated SQL transformation logic to use source aliases.
Testing
- Added comprehensive test suite for local file connector.
- Introduced new test markers and fixtures for local file testing.
- Expanded metadata tests to include new path attribute.

coderabbitai · 2025-01-07T03:37:14Z

Walkthrough

The pull request introduces comprehensive support for local file data sources in the ibis-server project. This enhancement includes adding a new local_file data source type across multiple components of the system, such as metadata handling, connectors, and data models. The changes enable querying and metadata retrieval for local files using DuckDB as the underlying engine, with support for various file formats like CSV, Parquet, and JSON. The implementation spans multiple files and introduces new classes, methods, and test cases to fully integrate local file functionality.

Changes

File	Change Summary
`ibis-server/app/mdl/rewriter.py`	Added handling for `DataSource.local_file` in `_get_write_dialect` method.
`ibis-server/app/model/__init__.py`	Introduced `QueryLocalFileDTO` and `LocalFileConnectionInfo` classes.
`ibis-server/app/model/connector.py`	Added `DuckDBConnector` for local file querying.
`ibis-server/app/model/data_source.py`	Added `local_file` to `DataSource` and `DataSourceExtension` enums.
`ibis-server/app/model/metadata/factory.py`	Updated metadata factory to support local file metadata.
`ibis-server/app/model/metadata/object_storage.py`	Added `ObjectStorageMetadata` and `LocalFileMetadata` classes.
`ibis-server/pyproject.toml`	Added `duckdb` and `opendal` dependencies.
`ibis-server/tests/routers/v2/connector/test_local_file.py`	Introduced tests for local file connector functionality.
`ibis-server/tests/routers/v3/connector/local_file/conftest.py`	Added test configuration for local file tests.
`ibis-server/tests/routers/v3/connector/local_file/test_query.py`	Added tests for querying local files.
`wren-core/core/src/logical_plan/analyze/model_generation.rs`	Streamlined expression handling in model generation.
`wren-core/core/src/mdl/mod.rs`	Updated SQL transformation to include source aliasing.
`ibis-server/app/main.py`	Added specific exception handler for `NotImplementedError`.
`wren-core-py/tests/test_modeling_core.py`	Updated tests to reflect new SQL aliasing logic.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Router
    participant Connector
    participant DuckDB
    participant Metadata

    Client->>Router: Send local file query
    Router->>Connector: Initialize DuckDBConnector
    Connector->>DuckDB: Establish connection
    Router->>Metadata: Retrieve table metadata
    Metadata->>DuckDB: Read file schema
    Router->>Connector: Execute query
    Connector->>DuckDB: Run SQL
    DuckDB-->>Connector: Return results
    Connector-->>Router: Provide query results
    Router-->>Client: Send response

Poem

🐰 Local files dance with glee,
DuckDB queries set data free!
CSV, Parquet, JSON delight,
Metadata shines so bright and light.
A rabbit's code, a file's embrace! 📂✨

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (5)

ibis-server/app/model/metadata/object_storage.py (2)

83-85: Offer assistance for supporting STRUCT data types

The TODO comment indicates that support for STRUCT data types is needed. Implementing this will enhance data type handling.

Would you like me to provide a solution to add support for STRUCT data types or open a new GitHub issue to track this task?

87-89: Offer assistance for supporting ARRAY data types

The TODO comment indicates that support for ARRAY data types is needed. Including this feature will improve the robustness of your data type mappings.

Would you like me to help implement support for ARRAY data types or create a GitHub issue to address this?
ibis-server/app/model/__init__.py (1)
140-144: Enhance field validation and documentation.

Consider the following improvements:

Use an Enum for the format field to restrict values to supported formats

Add description and examples for the url field to clarify expected values
 class LocalFileConnectionInfo(BaseModel):
-    url: SecretStr
+    url: SecretStr = Field(
+        description="Path to the local file or directory",
+        examples=["data/customers.csv", "data/orders/"]
+    )
-    format: str = Field(
+    format: FileFormat = Field(
         description="File format", default="csv", examples=["csv", "parquet", "json"]
     )

+class FileFormat(str, Enum):
+    CSV = "csv"
+    PARQUET = "parquet"
+    JSON = "json"
ibis-server/tests/routers/v3/connector/local_file/test_query.py (1)

152-180: Add more comprehensive dry run tests.

The dry run tests should include:

Syntax errors in SQL

Invalid column references

Type mismatches

ibis-server/pyproject.toml (1)

51-52: Consider deployment and testing implications of DuckDB.

DuckDB is a good choice for local file processing, but consider:

Document memory requirements and limitations for production deployments

Add integration tests with different file formats (CSV, Parquet, JSON)

Consider adding examples in the README for common use cases

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c93ec6 and df6a2e6.

⛔ Files ignored due to path filters (1)

ibis-server/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (12)

ibis-server/app/mdl/rewriter.py (1 hunks)
ibis-server/app/model/__init__.py (3 hunks)
ibis-server/app/model/connector.py (2 hunks)
ibis-server/app/model/data_source.py (3 hunks)
ibis-server/app/model/metadata/factory.py (2 hunks)
ibis-server/app/model/metadata/object_storage.py (1 hunks)
ibis-server/pyproject.toml (2 hunks)
ibis-server/tests/routers/v2/connector/test_local_file.py (1 hunks)
ibis-server/tests/routers/v3/connector/local_file/conftest.py (1 hunks)
ibis-server/tests/routers/v3/connector/local_file/test_query.py (1 hunks)
wren-core/core/src/logical_plan/analyze/model_generation.rs (5 hunks)
wren-core/core/src/mdl/mod.rs (19 hunks)

👮 Files not reviewed due to content moderation or server errors (4)

ibis-server/app/model/connector.py
ibis-server/tests/routers/v2/connector/test_local_file.py
wren-core/core/src/logical_plan/analyze/model_generation.rs
wren-core/core/src/mdl/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: cargo test (win64)
GitHub Check: cargo test (macos-aarch64)
GitHub Check: cargo check
GitHub Check: cargo test (macos)
GitHub Check: ci
GitHub Check: test

🔇 Additional comments (5)

ibis-server/app/model/metadata/factory.py (1)

8-8: Integration of LocalFileMetadata into MetadataFactory

Adding LocalFileMetadata to the import statements and the mapping dictionary correctly integrates the new local file data source into the metadata factory. This allows for proper handling of local files throughout the application.

Also applies to: 22-22
ibis-server/app/mdl/rewriter.py (1)

75-76: Verify DuckDB connector implementation.

The dialect is correctly set to "duckdb" for local files. However, we should verify that the DuckDB connector is properly implemented to handle this dialect.

✅ Verification successful

DuckDB connector implementation verified for local files

The codebase shows proper implementation of DuckDB connector for local files:

Local file data source is correctly mapped to DuckDBConnector

LocalFileMetadata class handles metadata operations

Complete implementation chain from data source recognition to connector mapping is in place
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify DuckDB connector implementation
# Test: Search for DuckDB connector class and its implementation
ast-grep --pattern 'class $CONNECTOR {
  $$$
  def $METHOD($ARGS) {
    $$$
    duckdb$$$
    $$$
  }
  $$$
}'
Length of output: 112

Script:
#!/bin/bash
# Search for DuckDB-related code
echo "=== DuckDB References ==="
rg -i "duckdb" -A 3

echo -e "\n=== Local File Handling ==="
rg -i "local_file" -A 3

echo -e "\n=== Connector Classes ==="
ast-grep --pattern 'class $NAME($$$) {
  $$$
}'
Length of output: 70107

Script:
#!/bin/bash
# Search for local file handling in the codebase
echo "=== Local File DataSource References ==="
rg "DataSource\.local_file" -A 3

echo -e "\n=== File Operations ==="
rg "\.file|readFile|writeFile|FileUtil" -A 3

echo -e "\n=== DuckDB File Config ==="
rg "file.*config.*duckdb" -i -A 3
Length of output: 15989
ibis-server/tests/routers/v3/connector/local_file/test_query.py (1)

108-124: Verify limit parameter behavior.

The test asserts that the result has 1 row but doesn't verify that it's using the correct limit (URL parameter vs SQL LIMIT clause). Add assertions to verify which limit takes precedence.
ibis-server/pyproject.toml (2)

66-66: LGTM! Well-organized test categorization.

The new marker local_file follows the existing pattern and is appropriately placed in alphabetical order.

51-52: Verify DuckDB version and consider security implications.

The fixed version of DuckDB (1.1.3) might be outdated. Let's verify the latest stable version and any security advisories.

Additionally, consider using a version range (e.g., >=1.1.3,<2.0.0) to automatically get bug fixes while avoiding breaking changes.

✅ Verification successful

DuckDB version 1.1.3 is appropriate and secure

The specified version is the latest available on PyPI and has no known security vulnerabilities. Using a fixed version is good for reproducibility in this case.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for latest versions and security advisories for DuckDB

# Check PyPI for latest versions
curl -s https://pypi.org/pypi/duckdb/json | jq '.info.version'

# Check for security advisories
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: PIP, package: "duckdb") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'
Length of output: 410

ibis-server/app/model/metadata/object_storage.py

ibis-server/tests/routers/v3/connector/local_file/conftest.py

ibis-server/app/model/data_source.py

ibis-server/tests/routers/v3/connector/local_file/test_query.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

wren-core-py/tests/test_modeling_core.py (2)

101-101: Consider simplifying the nested subqueries.

While the SQL transformation is correct, there might be room for optimization by reducing the number of nested subqueries while maintaining the same semantics.

Consider simplifying to:

SELECT add_two(customer.c_custkey) FROM (SELECT __source.c_custkey AS c_custkey FROM main.customer AS __source) AS customer

116-116: Consider consolidating duplicate SQL transformation tests.

This assertion duplicates the test from test_session_context. Consider extracting common test cases into a shared parametrized test to improve maintainability.

Example refactor:

@pytest.mark.parametrize(
    "sql,expected",
    [
        (
            "SELECT * FROM my_catalog.my_schema.customer",
            "SELECT customer.c_custkey, customer.c_name FROM (SELECT __source.c_custkey AS c_custkey, __source.c_name AS c_name FROM main.customer AS __source) AS customer"
        ),
        (
            "SELECT add_two(c_custkey) FROM my_catalog.my_schema.customer",
            "SELECT add_two(customer.c_custkey) FROM (SELECT __source.c_custkey AS c_custkey FROM main.customer AS __source) AS customer"
        )
    ]
)
def test_sql_transformations(sql, expected):
    session_context = SessionContext(manifest_str, "tests/functions.csv")
    assert session_context.transform_sql(sql) == expected

ibis-server/app/model/metadata/object_storage.py (1)

123-125: Consider connection pooling for better performance.

The current implementation creates a new connection for each operation. Consider implementing connection pooling to improve performance.

from contextlib import contextmanager

def _get_connection_pool(self):
    if not hasattr(self, '_pool'):
        self._pool = []  # Initialize with a few connections
    return self._pool

@contextmanager
def _get_connection(self):
    pool = self._get_connection_pool()
    try:
        conn = pool.pop() if pool else duckdb.connect()
        yield conn
    finally:
        pool.append(conn)

ibis-server/tests/routers/v2/connector/test_local_file.py (2)

91-120: Add negative test cases for data validation.

The test suite should include negative test cases for:

Invalid file paths
Malformed data
Concurrent access scenarios
Large file handling

Here's an example of additional test cases:

async def test_invalid_file_path(client, manifest_str):
    response = await client.post(
        f"{base_url}/query",
        json={
            "manifestStr": manifest_str,
            "sql": 'SELECT * FROM "Orders" LIMIT 1',
            "connectionInfo": {
                "url": "nonexistent/path",
                "format": "parquet"
            },
        },
    )
    assert response.status_code == 404

async def test_concurrent_access(client, manifest_str, connection_info):
    import asyncio
    tasks = []
    for _ in range(10):
        tasks.append(
            client.post(
                f"{base_url}/query",
                json={
                    "manifestStr": manifest_str,
                    "sql": 'SELECT * FROM "Orders" LIMIT 1',
                    "connectionInfo": connection_info,
                },
            )
        )
    responses = await asyncio.gather(*tasks)
    assert all(r.status_code == 200 for r in responses)

235-246: Add more format-specific test cases.

The unsupported format test is good, but consider adding tests for:

CSV with different delimiters
JSON with nested structures
Parquet with compression

Example test case:

async def test_csv_with_custom_delimiter(client):
    response = await client.post(
        url=f"{base_url}/metadata/tables",
        json={
            "connectionInfo": {
                "url": "tests/resource/tpch/data",
                "format": "csv",
                "options": {"delimiter": "|"}
            },
        },
    )
    assert response.status_code == 200

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df6a2e6 and 65559b2.

📒 Files selected for processing (6)

ibis-server/app/main.py (1 hunks)
ibis-server/app/model/data_source.py (4 hunks)
ibis-server/app/model/metadata/object_storage.py (1 hunks)
ibis-server/tests/routers/v2/connector/test_local_file.py (1 hunks)
ibis-server/tests/routers/v3/connector/local_file/conftest.py (1 hunks)
wren-core-py/tests/test_modeling_core.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

ibis-server/app/model/data_source.py
ibis-server/tests/routers/v3/connector/local_file/conftest.py

🧰 Additional context used

📓 Learnings (1)

ibis-server/app/model/metadata/object_storage.py (1)

Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: clippy
GitHub Check: cargo test (amd64)
GitHub Check: check Cargo.toml formatting
GitHub Check: cargo test (win64)
GitHub Check: cargo test (macos-aarch64)
GitHub Check: cargo test (macos)
GitHub Check: ci
GitHub Check: test

🔇 Additional comments (5)

wren-core-py/tests/test_modeling_core.py (1)

93-93: LGTM! Verify consistent usage of __source alias.

The introduction of the __source alias in the SQL transformation provides better clarity about the data source. This change aligns well with supporting multiple data sources, including local files.

Let's verify that this aliasing pattern is consistently applied across the codebase:

✅ Verification successful

__source alias pattern is consistently implemented

The __source alias is systematically applied across the codebase, following a consistent three-layer transformation pattern in both test assertions and core implementation. The change is well-integrated and maintains consistency across different SQL operations, data types, and character sets.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for SQL transformations to ensure consistent use of __source alias
rg -l "AS __source" --type rust --type python
Length of output: 78

Script:
#!/bin/bash
# Find all Python and Rust files and search for __source alias pattern
fd -e py -e rs | xargs rg "AS __source" -A 2 -B 2
Length of output: 12454
ibis-server/app/main.py (1)

81-83: LGTM! Appropriate error handling for unimplemented features.

The implementation correctly uses HTTP status code 501 for "Not Implemented" responses, following FastAPI's exception handling patterns.
ibis-server/app/model/metadata/object_storage.py (3)

46-46: ⚠️ Potential issue

Fix data type retrieval using DuckDB's DESCRIBE statement.

Based on the retrieved learning from @goldmedal, we should use DuckDB's DESCRIBE statement instead of accessing DataFrame attributes.

Apply this diff to fix the data type retrieval:
-                    duckdb_type = df[col].dtypes[0]
+                    query_result = conn.execute(f"DESCRIBE SELECT {col} FROM '{file.path}'")
+                    column_types = dict(query_result.fetchall())
+                    duckdb_type = column_types[col]
Likely invalid or redundant comment.

63-79: 🛠️ Refactor suggestion

Use DuckDB table functions instead of connection methods.

Replace the direct connection methods with DuckDB's table functions for better reliability.

Apply this diff:
     def _read_df(self, conn, path):
+        full_path = f"{self.connection_info.url.get_secret_value()}/{path}"
         if self.connection_info.format == "parquet":
-            return conn.read_parquet(
-                f"{self.connection_info.url.get_secret_value()}/{path}"
-            )
+            return conn.execute(f"SELECT * FROM read_parquet('{full_path}')").df()
         elif self.connection_info.format == "csv":
-            return conn.read_csv(
-                f"{self.connection_info.url.get_secret_value()}/{path}"
-            )
+            return conn.execute(f"SELECT * FROM read_csv('{full_path}')").df()
         elif self.connection_info.format == "json":
-            return conn.read_json(
-                f"{self.connection_info.url.get_secret_value()}/{path}"
-            )
+            return conn.execute(f"SELECT * FROM read_json('{full_path}')").df()
         else:
             raise NotImplementedError(
                 f"Unsupported format: {self.connection_info.format}"
             )
Likely invalid or redundant comment.

85-92: Consider implementing STRUCT and array type support.

The TODOs indicate missing support for STRUCT and array types. This could limit functionality for complex data types.

Let's check if there are any files using these types:

Would you like me to help implement support for these types or create GitHub issues to track them?

ibis-server/tests/routers/v2/connector/test_local_file.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (10)

ibis-server/app/model/metadata/dto.py (1)
69-71: LGTM! Consider enhancing the field documentation.

The path field is well-implemented with proper typing and default value. The comment clearly indicates its purpose for file-based tables.

Consider expanding the documentation to include supported file formats:
-    )  # the full path of the table for file-based table
+    )  # the full path of the table for file-based table (supports csv, parquet, json formats)
ibis-server/tests/routers/v2/connector/test_trino.py (1)
394-394: Consider reducing test duplication across connector tests.

The pattern of testing path=None is repeated across all connector tests. Consider extracting this into a shared test utility to improve maintainability.

Example approach:
# tests/utils/connector_test_utils.py
def assert_table_properties(table_props, expected_catalog, expected_schema, expected_table, expected_path=None):
    assert table_props == {
        "catalog": expected_catalog,
        "schema": expected_schema,
        "table": expected_table,
        "path": expected_path,
    }
ibis-server/tests/routers/v2/connector/test_mssql.py (1)

383-383: LGTM! Consider adding documentation.

The addition of the path property to table metadata is consistent with the new local file connector feature. Setting it to None for MSSQL tables is appropriate as they don't have filesystem paths.

Consider adding a comment in the test to explain why the path property exists and its significance in the context of different data sources (database tables vs. local files).
ibis-server/tests/routers/v2/connector/test_postgres.py (1)
569-569: LGTM! Consider extracting common test assertions.

The addition of the path property maintains consistency with other database connectors. Given that this metadata structure is now common across all connectors, consider extracting these assertions into a shared test utility.

Consider creating a shared test utility function to verify the common metadata structure:
def assert_table_metadata(result, expected_name, expected_catalog, expected_schema, expected_table):
    """Assert common table metadata structure across different connectors."""
    assert result["name"] == expected_name
    assert result["primaryKey"] is not None
    assert result["properties"] == {
        "catalog": expected_catalog,
        "schema": expected_schema,
        "table": expected_table,
        "path": None,
    }
This would reduce duplication and make it easier to maintain consistent assertions across all connector tests.
ibis-server/tests/routers/v2/connector/test_local_file.py (6)
11-11: Fix typo in "catalog" field.

The word "catalog" is misspelled as "calalog".
-    "catalog": "my_calalog",
+    "catalog": "my_catalog",
16-18: Consider using path fixtures for test file paths.

Using hardcoded relative paths in tests can be fragile if tests are run from different directories. Consider creating a fixture that provides the absolute path to test resources.

Example implementation:
@pytest.fixture(scope="module")
def test_resources_path():
    return os.path.join(os.path.dirname(__file__), "..", "..", "..", "resource", "tpch", "data")
Also applies to: 44-45

104-111: Consider using test data fixtures instead of hardcoded values.

Hardcoded test data makes tests brittle and harder to maintain. Consider creating fixtures or constants for test data.

Example implementation:
EXPECTED_ORDER_DATA = {
    "orderkey": 1,
    "custkey": 370,
    "orderstatus": "O",
    "totalprice": "172799.49",
    "orderdate": "1996-01-02 00:00:00.000000",
    "order_cust_key": "1_370"
}

def test_query(...):
    assert result["data"][0] == list(EXPECTED_ORDER_DATA.values())
181-182: Enhance error message validation in dry run test.

The test only verifies that the response text is not None. Consider asserting the specific error message to ensure proper error handling.
     assert response.status_code == 422
-    assert response.text is not None
+    assert "Table 'NotFound' not found" in response.text
194-194: Replace filter() with a more robust table lookup.

Using filter() could fail silently if the table is not found. Consider using a more explicit approach.
-    result = next(filter(lambda x: x["name"] == "orders", response.json()))
+    tables = response.json()
+    orders_table = next((x for x in tables if x["name"] == "orders"), None)
+    assert orders_table is not None, "Orders table not found in response"
+    result = orders_table
250-448: Reduce duplication in format type mapping tests.

The type mapping assertions are repeated across format tests. Consider extracting common type mapping logic into helper functions or fixtures.

Example implementation:
@pytest.fixture
def type_mapping_assertions():
    def _assert_common_types(columns):
        assert columns[0]["name"] == "c_bigint"
        assert columns[0]["type"] == "INT64"
        # ... more common assertions
    return _assert_common_types

def test_list_parquet_files(client, type_mapping_assertions):
    # ... test setup ...
    type_mapping_assertions(columns)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65559b2 and 12d1f1c.

⛔ Files ignored due to path filters (7)

ibis-server/poetry.lock is excluded by !**/*.lock
ibis-server/tests/resource/test_file_source/type-test-csv/t1.csv is excluded by !**/*.csv
ibis-server/tests/resource/test_file_source/type-test-csv/t2.csv is excluded by !**/*.csv
ibis-server/tests/resource/test_file_source/type-test-parquet/t1.parquet is excluded by !**/*.parquet
ibis-server/tests/resource/test_file_source/type-test-parquet/t2.parquet is excluded by !**/*.parquet
ibis-server/tests/resource/test_file_source/type-test.csv is excluded by !**/*.csv
ibis-server/tests/resource/test_file_source/type-test.parquet is excluded by !**/*.parquet

📒 Files selected for processing (14)

ibis-server/app/model/metadata/dto.py (1 hunks)
ibis-server/app/model/metadata/object_storage.py (1 hunks)
ibis-server/pyproject.toml (2 hunks)
ibis-server/tests/resource/test_file_source/invalid (1 hunks)
ibis-server/tests/resource/test_file_source/type-test-json/t1.json (1 hunks)
ibis-server/tests/resource/test_file_source/type-test-json/t2.json (1 hunks)
ibis-server/tests/resource/test_file_source/type-test.json (1 hunks)
ibis-server/tests/routers/v2/connector/test_clickhouse.py (1 hunks)
ibis-server/tests/routers/v2/connector/test_local_file.py (1 hunks)
ibis-server/tests/routers/v2/connector/test_mssql.py (1 hunks)
ibis-server/tests/routers/v2/connector/test_mysql.py (1 hunks)
ibis-server/tests/routers/v2/connector/test_postgres.py (1 hunks)
ibis-server/tests/routers/v2/connector/test_snowflake.py (1 hunks)
ibis-server/tests/routers/v2/connector/test_trino.py (1 hunks)

✅ Files skipped from review due to trivial changes (4)

ibis-server/tests/resource/test_file_source/invalid
ibis-server/tests/resource/test_file_source/type-test-json/t1.json
ibis-server/tests/resource/test_file_source/type-test.json
ibis-server/tests/resource/test_file_source/type-test-json/t2.json

🚧 Files skipped from review as they are similar to previous changes (2)

ibis-server/pyproject.toml
ibis-server/app/model/metadata/object_storage.py

🔇 Additional comments (5)

ibis-server/tests/routers/v2/connector/test_snowflake.py (1)

290-290: LGTM! Path field correctly included in metadata test.

The test properly verifies that non-file-based tables return None for the path field, maintaining consistency with the DTO changes.

ibis-server/tests/routers/v2/connector/test_mysql.py (1)

365-365: LGTM! Path field correctly included in metadata test.

The test properly verifies that non-file-based tables return None for the path field, maintaining consistency with other connectors.

ibis-server/tests/routers/v2/connector/test_trino.py (1)

394-394: LGTM! Path field correctly included in metadata test.

The test properly verifies that non-file-based tables return None for the path field, maintaining consistency with other connectors.

ibis-server/tests/routers/v2/connector/test_clickhouse.py (1)

530-530: LGTM! Consistent with other connectors.

The addition of the path property maintains consistency with other database connectors and aligns with the local file connector feature.

ibis-server/tests/routers/v2/connector/test_local_file.py (1)

332-333: Handle invalid CSV files more explicitly.

The comment suggests that invalid files are silently accepted as one-column CSV files. Consider adding explicit validation for CSV file format.

ibis-server/tests/routers/v2/connector/test_local_file.py

goldmedal added 4 commits January 6, 2025 22:52

add default alias for source plan

8e846b4

add local file data source

1968898

add metadata api for file connector

99cbbd7

add test for v3 api

df6a2e6

github-actions bot added core ibis dependencies Pull requests that update a dependency file rust Pull requests that update Rust code python Pull requests that update Python code labels Jan 7, 2025

coderabbitai bot reviewed Jan 7, 2025

View reviewed changes

goldmedal added 4 commits January 7, 2025 11:53

fix wren-core-py test

53921d1

address comments

b1d398c

add negative test and handle unsupported error

db091dc

fix format

65559b2

coderabbitai bot reviewed Jan 7, 2025

View reviewed changes

goldmedal requested review from wwwy3y3 and onlyjackfrost January 7, 2025 05:46

goldmedal commented Jan 7, 2025

View reviewed changes

ibis-server/tests/routers/v2/connector/test_local_file.py Show resolved Hide resolved

add path info in table properties

f0ffc03

goldmedal marked this pull request as draft January 7, 2025 08:04

goldmedal added 4 commits January 7, 2025 17:23

enhance the test for type and format

23e4360

Merge branch 'main' into feature/file-connecotr-duckdb

98b9ad4

poetry lock

318e11f

fix test

12d1f1c

goldmedal marked this pull request as ready for review January 7, 2025 10:04

coderabbitai bot reviewed Jan 7, 2025

View reviewed changes

ibis-server/tests/routers/v2/connector/test_local_file.py Show resolved Hide resolved

ibis-server/tests/routers/v2/connector/test_local_file.py Show resolved Hide resolved

wwwy3y3 approved these changes Jan 7, 2025

View reviewed changes

goldmedal merged commit 54fcec7 into Canner:main Jan 7, 2025
16 checks passed

goldmedal deleted the feature/file-connecotr-duckdb branch January 7, 2025 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ibis): introduce Local file connector #1029

feat(ibis): introduce Local file connector #1029

goldmedal commented Jan 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

feat(ibis): introduce Local file connector #1029

feat(ibis): introduce Local file connector #1029

Conversation

goldmedal commented Jan 7, 2025 • edited by coderabbitai bot Loading

Description

Connection Info

API URL

Sample Model Definition

Testing data

Summary by CodeRabbit

Summary by CodeRabbit

coderabbitai bot commented Jan 7, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

goldmedal commented Jan 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)