-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ibis): introduce Local file connector #1029
feat(ibis): introduce Local file connector #1029
Conversation
WalkthroughThe pull request introduces comprehensive support for local file data sources in the ibis-server project. This enhancement includes adding a new Changes
Sequence DiagramsequenceDiagram
participant Client
participant Router
participant Connector
participant DuckDB
participant Metadata
Client->>Router: Send local file query
Router->>Connector: Initialize DuckDBConnector
Connector->>DuckDB: Establish connection
Router->>Metadata: Retrieve table metadata
Metadata->>DuckDB: Read file schema
Router->>Connector: Execute query
Connector->>DuckDB: Run SQL
DuckDB-->>Connector: Return results
Connector-->>Router: Provide query results
Router-->>Client: Send response
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (5)
ibis-server/app/model/metadata/object_storage.py (2)
83-85
: Offer assistance for supporting STRUCT data typesThe TODO comment indicates that support for STRUCT data types is needed. Implementing this will enhance data type handling.
Would you like me to provide a solution to add support for STRUCT data types or open a new GitHub issue to track this task?
87-89
: Offer assistance for supporting ARRAY data typesThe TODO comment indicates that support for ARRAY data types is needed. Including this feature will improve the robustness of your data type mappings.
Would you like me to help implement support for ARRAY data types or create a GitHub issue to address this?
ibis-server/app/model/__init__.py (1)
140-144
: Enhance field validation and documentation.Consider the following improvements:
- Use an Enum for the
format
field to restrict values to supported formats- Add description and examples for the
url
field to clarify expected valuesclass LocalFileConnectionInfo(BaseModel): - url: SecretStr + url: SecretStr = Field( + description="Path to the local file or directory", + examples=["data/customers.csv", "data/orders/"] + ) - format: str = Field( + format: FileFormat = Field( description="File format", default="csv", examples=["csv", "parquet", "json"] ) +class FileFormat(str, Enum): + CSV = "csv" + PARQUET = "parquet" + JSON = "json"ibis-server/tests/routers/v3/connector/local_file/test_query.py (1)
152-180
: Add more comprehensive dry run tests.The dry run tests should include:
- Syntax errors in SQL
- Invalid column references
- Type mismatches
ibis-server/pyproject.toml (1)
51-52
: Consider deployment and testing implications of DuckDB.DuckDB is a good choice for local file processing, but consider:
- Document memory requirements and limitations for production deployments
- Add integration tests with different file formats (CSV, Parquet, JSON)
- Consider adding examples in the README for common use cases
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
ibis-server/poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (12)
ibis-server/app/mdl/rewriter.py
(1 hunks)ibis-server/app/model/__init__.py
(3 hunks)ibis-server/app/model/connector.py
(2 hunks)ibis-server/app/model/data_source.py
(3 hunks)ibis-server/app/model/metadata/factory.py
(2 hunks)ibis-server/app/model/metadata/object_storage.py
(1 hunks)ibis-server/pyproject.toml
(2 hunks)ibis-server/tests/routers/v2/connector/test_local_file.py
(1 hunks)ibis-server/tests/routers/v3/connector/local_file/conftest.py
(1 hunks)ibis-server/tests/routers/v3/connector/local_file/test_query.py
(1 hunks)wren-core/core/src/logical_plan/analyze/model_generation.rs
(5 hunks)wren-core/core/src/mdl/mod.rs
(19 hunks)
👮 Files not reviewed due to content moderation or server errors (4)
- ibis-server/app/model/connector.py
- ibis-server/tests/routers/v2/connector/test_local_file.py
- wren-core/core/src/logical_plan/analyze/model_generation.rs
- wren-core/core/src/mdl/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: cargo test (win64)
- GitHub Check: cargo test (macos-aarch64)
- GitHub Check: cargo check
- GitHub Check: cargo test (macos)
- GitHub Check: ci
- GitHub Check: test
🔇 Additional comments (5)
ibis-server/app/model/metadata/factory.py (1)
8-8
: Integration ofLocalFileMetadata
intoMetadataFactory
Adding
LocalFileMetadata
to the import statements and themapping
dictionary correctly integrates the new local file data source into the metadata factory. This allows for proper handling of local files throughout the application.Also applies to: 22-22
ibis-server/app/mdl/rewriter.py (1)
75-76
: Verify DuckDB connector implementation.The dialect is correctly set to "duckdb" for local files. However, we should verify that the DuckDB connector is properly implemented to handle this dialect.
✅ Verification successful
DuckDB connector implementation verified for local files
The codebase shows proper implementation of DuckDB connector for local files:
- Local file data source is correctly mapped to DuckDBConnector
- LocalFileMetadata class handles metadata operations
- Complete implementation chain from data source recognition to connector mapping is in place
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify DuckDB connector implementation # Test: Search for DuckDB connector class and its implementation ast-grep --pattern 'class $CONNECTOR { $$$ def $METHOD($ARGS) { $$$ duckdb$$$ $$$ } $$$ }'Length of output: 112
Script:
#!/bin/bash # Search for DuckDB-related code echo "=== DuckDB References ===" rg -i "duckdb" -A 3 echo -e "\n=== Local File Handling ===" rg -i "local_file" -A 3 echo -e "\n=== Connector Classes ===" ast-grep --pattern 'class $NAME($$$) { $$$ }'Length of output: 70107
Script:
#!/bin/bash # Search for local file handling in the codebase echo "=== Local File DataSource References ===" rg "DataSource\.local_file" -A 3 echo -e "\n=== File Operations ===" rg "\.file|readFile|writeFile|FileUtil" -A 3 echo -e "\n=== DuckDB File Config ===" rg "file.*config.*duckdb" -i -A 3Length of output: 15989
ibis-server/tests/routers/v3/connector/local_file/test_query.py (1)
108-124
: Verify limit parameter behavior.The test asserts that the result has 1 row but doesn't verify that it's using the correct limit (URL parameter vs SQL LIMIT clause). Add assertions to verify which limit takes precedence.
ibis-server/pyproject.toml (2)
66-66
: LGTM! Well-organized test categorization.The new marker
local_file
follows the existing pattern and is appropriately placed in alphabetical order.
51-52
: Verify DuckDB version and consider security implications.The fixed version of DuckDB (1.1.3) might be outdated. Let's verify the latest stable version and any security advisories.
Additionally, consider using a version range (e.g.,
>=1.1.3,<2.0.0
) to automatically get bug fixes while avoiding breaking changes.✅ Verification successful
DuckDB version 1.1.3 is appropriate and secure
The specified version is the latest available on PyPI and has no known security vulnerabilities. Using a fixed version is good for reproducibility in this case.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for latest versions and security advisories for DuckDB # Check PyPI for latest versions curl -s https://pypi.org/pypi/duckdb/json | jq '.info.version' # Check for security advisories gh api graphql -f query=' { securityVulnerabilities(first: 5, ecosystem: PIP, package: "duckdb") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'Length of output: 410
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
wren-core-py/tests/test_modeling_core.py (2)
101-101
: Consider simplifying the nested subqueries.While the SQL transformation is correct, there might be room for optimization by reducing the number of nested subqueries while maintaining the same semantics.
Consider simplifying to:
SELECT add_two(customer.c_custkey) FROM (SELECT __source.c_custkey AS c_custkey FROM main.customer AS __source) AS customer
116-116
: Consider consolidating duplicate SQL transformation tests.This assertion duplicates the test from
test_session_context
. Consider extracting common test cases into a shared parametrized test to improve maintainability.Example refactor:
@pytest.mark.parametrize( "sql,expected", [ ( "SELECT * FROM my_catalog.my_schema.customer", "SELECT customer.c_custkey, customer.c_name FROM (SELECT __source.c_custkey AS c_custkey, __source.c_name AS c_name FROM main.customer AS __source) AS customer" ), ( "SELECT add_two(c_custkey) FROM my_catalog.my_schema.customer", "SELECT add_two(customer.c_custkey) FROM (SELECT __source.c_custkey AS c_custkey FROM main.customer AS __source) AS customer" ) ] ) def test_sql_transformations(sql, expected): session_context = SessionContext(manifest_str, "tests/functions.csv") assert session_context.transform_sql(sql) == expectedibis-server/app/model/metadata/object_storage.py (1)
123-125
: Consider connection pooling for better performance.The current implementation creates a new connection for each operation. Consider implementing connection pooling to improve performance.
from contextlib import contextmanager def _get_connection_pool(self): if not hasattr(self, '_pool'): self._pool = [] # Initialize with a few connections return self._pool @contextmanager def _get_connection(self): pool = self._get_connection_pool() try: conn = pool.pop() if pool else duckdb.connect() yield conn finally: pool.append(conn)ibis-server/tests/routers/v2/connector/test_local_file.py (2)
91-120
: Add negative test cases for data validation.The test suite should include negative test cases for:
- Invalid file paths
- Malformed data
- Concurrent access scenarios
- Large file handling
Here's an example of additional test cases:
async def test_invalid_file_path(client, manifest_str): response = await client.post( f"{base_url}/query", json={ "manifestStr": manifest_str, "sql": 'SELECT * FROM "Orders" LIMIT 1', "connectionInfo": { "url": "nonexistent/path", "format": "parquet" }, }, ) assert response.status_code == 404 async def test_concurrent_access(client, manifest_str, connection_info): import asyncio tasks = [] for _ in range(10): tasks.append( client.post( f"{base_url}/query", json={ "manifestStr": manifest_str, "sql": 'SELECT * FROM "Orders" LIMIT 1', "connectionInfo": connection_info, }, ) ) responses = await asyncio.gather(*tasks) assert all(r.status_code == 200 for r in responses)
235-246
: Add more format-specific test cases.The unsupported format test is good, but consider adding tests for:
- CSV with different delimiters
- JSON with nested structures
- Parquet with compression
Example test case:
async def test_csv_with_custom_delimiter(client): response = await client.post( url=f"{base_url}/metadata/tables", json={ "connectionInfo": { "url": "tests/resource/tpch/data", "format": "csv", "options": {"delimiter": "|"} }, }, ) assert response.status_code == 200
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
ibis-server/app/main.py
(1 hunks)ibis-server/app/model/data_source.py
(4 hunks)ibis-server/app/model/metadata/object_storage.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_local_file.py
(1 hunks)ibis-server/tests/routers/v3/connector/local_file/conftest.py
(1 hunks)wren-core-py/tests/test_modeling_core.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- ibis-server/app/model/data_source.py
- ibis-server/tests/routers/v3/connector/local_file/conftest.py
🧰 Additional context used
📓 Learnings (1)
ibis-server/app/model/metadata/object_storage.py (1)
Learnt from: goldmedal
PR: Canner/wren-engine#1029
File: ibis-server/app/model/metadata/object_storage.py:44-44
Timestamp: 2025-01-07T03:56:21.741Z
Learning: When working with DuckDB in Python, use `conn.execute("DESCRIBE SELECT * FROM table").fetchall()` to get column types instead of accessing DataFrame-style attributes like `dtype` or `dtypes`.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: clippy
- GitHub Check: cargo test (amd64)
- GitHub Check: check Cargo.toml formatting
- GitHub Check: cargo test (win64)
- GitHub Check: cargo test (macos-aarch64)
- GitHub Check: cargo test (macos)
- GitHub Check: ci
- GitHub Check: test
🔇 Additional comments (5)
wren-core-py/tests/test_modeling_core.py (1)
93-93
: LGTM! Verify consistent usage of__source
alias.The introduction of the
__source
alias in the SQL transformation provides better clarity about the data source. This change aligns well with supporting multiple data sources, including local files.Let's verify that this aliasing pattern is consistently applied across the codebase:
✅ Verification successful
__source
alias pattern is consistently implementedThe
__source
alias is systematically applied across the codebase, following a consistent three-layer transformation pattern in both test assertions and core implementation. The change is well-integrated and maintains consistency across different SQL operations, data types, and character sets.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for SQL transformations to ensure consistent use of __source alias rg -l "AS __source" --type rust --type pythonLength of output: 78
Script:
#!/bin/bash # Find all Python and Rust files and search for __source alias pattern fd -e py -e rs | xargs rg "AS __source" -A 2 -B 2Length of output: 12454
ibis-server/app/main.py (1)
81-83
: LGTM! Appropriate error handling for unimplemented features.The implementation correctly uses HTTP status code 501 for "Not Implemented" responses, following FastAPI's exception handling patterns.
ibis-server/app/model/metadata/object_storage.py (3)
46-46
:⚠️ Potential issueFix data type retrieval using DuckDB's DESCRIBE statement.
Based on the retrieved learning from @goldmedal, we should use DuckDB's DESCRIBE statement instead of accessing DataFrame attributes.
Apply this diff to fix the data type retrieval:
- duckdb_type = df[col].dtypes[0] + query_result = conn.execute(f"DESCRIBE SELECT {col} FROM '{file.path}'") + column_types = dict(query_result.fetchall()) + duckdb_type = column_types[col]Likely invalid or redundant comment.
63-79
: 🛠️ Refactor suggestionUse DuckDB table functions instead of connection methods.
Replace the direct connection methods with DuckDB's table functions for better reliability.
Apply this diff:
def _read_df(self, conn, path): + full_path = f"{self.connection_info.url.get_secret_value()}/{path}" if self.connection_info.format == "parquet": - return conn.read_parquet( - f"{self.connection_info.url.get_secret_value()}/{path}" - ) + return conn.execute(f"SELECT * FROM read_parquet('{full_path}')").df() elif self.connection_info.format == "csv": - return conn.read_csv( - f"{self.connection_info.url.get_secret_value()}/{path}" - ) + return conn.execute(f"SELECT * FROM read_csv('{full_path}')").df() elif self.connection_info.format == "json": - return conn.read_json( - f"{self.connection_info.url.get_secret_value()}/{path}" - ) + return conn.execute(f"SELECT * FROM read_json('{full_path}')").df() else: raise NotImplementedError( f"Unsupported format: {self.connection_info.format}" )Likely invalid or redundant comment.
85-92
: Consider implementing STRUCT and array type support.The TODOs indicate missing support for STRUCT and array types. This could limit functionality for complex data types.
Let's check if there are any files using these types:
Would you like me to help implement support for these types or create GitHub issues to track them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (10)
ibis-server/app/model/metadata/dto.py (1)
69-71
: LGTM! Consider enhancing the field documentation.The
path
field is well-implemented with proper typing and default value. The comment clearly indicates its purpose for file-based tables.Consider expanding the documentation to include supported file formats:
- ) # the full path of the table for file-based table + ) # the full path of the table for file-based table (supports csv, parquet, json formats)ibis-server/tests/routers/v2/connector/test_trino.py (1)
394-394
: Consider reducing test duplication across connector tests.The pattern of testing
path=None
is repeated across all connector tests. Consider extracting this into a shared test utility to improve maintainability.Example approach:
# tests/utils/connector_test_utils.py def assert_table_properties(table_props, expected_catalog, expected_schema, expected_table, expected_path=None): assert table_props == { "catalog": expected_catalog, "schema": expected_schema, "table": expected_table, "path": expected_path, }ibis-server/tests/routers/v2/connector/test_mssql.py (1)
383-383
: LGTM! Consider adding documentation.The addition of the
path
property to table metadata is consistent with the new local file connector feature. Setting it toNone
for MSSQL tables is appropriate as they don't have filesystem paths.Consider adding a comment in the test to explain why the
path
property exists and its significance in the context of different data sources (database tables vs. local files).ibis-server/tests/routers/v2/connector/test_postgres.py (1)
569-569
: LGTM! Consider extracting common test assertions.The addition of the
path
property maintains consistency with other database connectors. Given that this metadata structure is now common across all connectors, consider extracting these assertions into a shared test utility.Consider creating a shared test utility function to verify the common metadata structure:
def assert_table_metadata(result, expected_name, expected_catalog, expected_schema, expected_table): """Assert common table metadata structure across different connectors.""" assert result["name"] == expected_name assert result["primaryKey"] is not None assert result["properties"] == { "catalog": expected_catalog, "schema": expected_schema, "table": expected_table, "path": None, }This would reduce duplication and make it easier to maintain consistent assertions across all connector tests.
ibis-server/tests/routers/v2/connector/test_local_file.py (6)
11-11
: Fix typo in "catalog" field.The word "catalog" is misspelled as "calalog".
- "catalog": "my_calalog", + "catalog": "my_catalog",
16-18
: Consider using path fixtures for test file paths.Using hardcoded relative paths in tests can be fragile if tests are run from different directories. Consider creating a fixture that provides the absolute path to test resources.
Example implementation:
@pytest.fixture(scope="module") def test_resources_path(): return os.path.join(os.path.dirname(__file__), "..", "..", "..", "resource", "tpch", "data")Also applies to: 44-45
104-111
: Consider using test data fixtures instead of hardcoded values.Hardcoded test data makes tests brittle and harder to maintain. Consider creating fixtures or constants for test data.
Example implementation:
EXPECTED_ORDER_DATA = { "orderkey": 1, "custkey": 370, "orderstatus": "O", "totalprice": "172799.49", "orderdate": "1996-01-02 00:00:00.000000", "order_cust_key": "1_370" } def test_query(...): assert result["data"][0] == list(EXPECTED_ORDER_DATA.values())
181-182
: Enhance error message validation in dry run test.The test only verifies that the response text is not None. Consider asserting the specific error message to ensure proper error handling.
assert response.status_code == 422 - assert response.text is not None + assert "Table 'NotFound' not found" in response.text
194-194
: Replace filter() with a more robust table lookup.Using filter() could fail silently if the table is not found. Consider using a more explicit approach.
- result = next(filter(lambda x: x["name"] == "orders", response.json())) + tables = response.json() + orders_table = next((x for x in tables if x["name"] == "orders"), None) + assert orders_table is not None, "Orders table not found in response" + result = orders_table
250-448
: Reduce duplication in format type mapping tests.The type mapping assertions are repeated across format tests. Consider extracting common type mapping logic into helper functions or fixtures.
Example implementation:
@pytest.fixture def type_mapping_assertions(): def _assert_common_types(columns): assert columns[0]["name"] == "c_bigint" assert columns[0]["type"] == "INT64" # ... more common assertions return _assert_common_types def test_list_parquet_files(client, type_mapping_assertions): # ... test setup ... type_mapping_assertions(columns)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (7)
ibis-server/poetry.lock
is excluded by!**/*.lock
ibis-server/tests/resource/test_file_source/type-test-csv/t1.csv
is excluded by!**/*.csv
ibis-server/tests/resource/test_file_source/type-test-csv/t2.csv
is excluded by!**/*.csv
ibis-server/tests/resource/test_file_source/type-test-parquet/t1.parquet
is excluded by!**/*.parquet
ibis-server/tests/resource/test_file_source/type-test-parquet/t2.parquet
is excluded by!**/*.parquet
ibis-server/tests/resource/test_file_source/type-test.csv
is excluded by!**/*.csv
ibis-server/tests/resource/test_file_source/type-test.parquet
is excluded by!**/*.parquet
📒 Files selected for processing (14)
ibis-server/app/model/metadata/dto.py
(1 hunks)ibis-server/app/model/metadata/object_storage.py
(1 hunks)ibis-server/pyproject.toml
(2 hunks)ibis-server/tests/resource/test_file_source/invalid
(1 hunks)ibis-server/tests/resource/test_file_source/type-test-json/t1.json
(1 hunks)ibis-server/tests/resource/test_file_source/type-test-json/t2.json
(1 hunks)ibis-server/tests/resource/test_file_source/type-test.json
(1 hunks)ibis-server/tests/routers/v2/connector/test_clickhouse.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_local_file.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_mssql.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_mysql.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_postgres.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_snowflake.py
(1 hunks)ibis-server/tests/routers/v2/connector/test_trino.py
(1 hunks)
✅ Files skipped from review due to trivial changes (4)
- ibis-server/tests/resource/test_file_source/invalid
- ibis-server/tests/resource/test_file_source/type-test-json/t1.json
- ibis-server/tests/resource/test_file_source/type-test.json
- ibis-server/tests/resource/test_file_source/type-test-json/t2.json
🚧 Files skipped from review as they are similar to previous changes (2)
- ibis-server/pyproject.toml
- ibis-server/app/model/metadata/object_storage.py
🔇 Additional comments (5)
ibis-server/tests/routers/v2/connector/test_snowflake.py (1)
290-290
: LGTM! Path field correctly included in metadata test.The test properly verifies that non-file-based tables return
None
for the path field, maintaining consistency with the DTO changes.ibis-server/tests/routers/v2/connector/test_mysql.py (1)
365-365
: LGTM! Path field correctly included in metadata test.The test properly verifies that non-file-based tables return
None
for the path field, maintaining consistency with other connectors.ibis-server/tests/routers/v2/connector/test_trino.py (1)
394-394
: LGTM! Path field correctly included in metadata test.The test properly verifies that non-file-based tables return
None
for the path field, maintaining consistency with other connectors.ibis-server/tests/routers/v2/connector/test_clickhouse.py (1)
530-530
: LGTM! Consistent with other connectors.The addition of the
path
property maintains consistency with other database connectors and aligns with the local file connector feature.ibis-server/tests/routers/v2/connector/test_local_file.py (1)
332-333
: Handle invalid CSV files more explicitly.The comment suggests that invalid files are silently accepted as one-column CSV files. Consider adding explicit validation for CSV file format.
Description
csv
,parquet
orjson
Connection Info
The URL should pointer to a
Directory
. The file or directory in this path would be considered as aTable
.For example:
If we point the URL to
path-to-url
, we can get 2 tables:orders
andcustomer
.API URL
/v2/connector/local_file
/v3/connector/local_file
Sample Model Definition
Testing data
The testing data is generated by DuckDB with the following SQL:
Then, convert to different data fomrat.
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Dependencies
Improvements
Testing