Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow merge on expressions #388

Merged
merged 7 commits into from
Sep 9, 2024
Merged

allow merge on expressions #388

merged 7 commits into from
Sep 9, 2024

Conversation

mattseddon
Copy link
Member

@mattseddon mattseddon commented Sep 4, 2024

Closes #299

This PR allows users to merge DataChains using Columns and/or expressions (e.g. path.file_stem(dc.c("source.path"))). Examples of the new syntax are shown in examples/multimodal/clip_inference.py & examples/multimodal/wds.py. Once this change has been merged I will make the appropriate changes in the datachain-examples repo.

It should be noted that the change is backwards compatible and merge will continue to accept strings for the on and right_on parameters.

@mattseddon mattseddon self-assigned this Sep 4, 2024
Copy link

cloudflare-workers-and-pages bot commented Sep 4, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 04a425b
Status: ✅  Deploy successful!
Preview URL: https://c74f4386.datachain-documentation.pages.dev
Branch Preview URL: https://impl-299.datachain-documentation.pages.dev

View logs

col: sqlalchemy.ColumnClause = (
sqlalchemy.column(column)
if isinstance(column, str)
else sqlalchemy.column(column.name, column.type)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] It does not seem possible to overwrite the table of a column which already has the table set. I.e test_merge_with_itself_column fails without this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is no longer required for this PR to work. I'd be happy to split it out if required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for me personally I am OK to leave it here. It feels natural to have this change here.

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

Attention: Patch coverage is 89.74359% with 4 lines in your changes missing coverage. Please review.

Project coverage is 87.31%. Comparing base (ae493e7) to head (04a425b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/lib/dc.py 89.18% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #388      +/-   ##
==========================================
+ Coverage   87.28%   87.31%   +0.02%     
==========================================
  Files          92       92              
  Lines        9960     9981      +21     
  Branches     2033     2041       +8     
==========================================
+ Hits         8694     8715      +21     
+ Misses        912      911       -1     
- Partials      354      355       +1     
Flag Coverage Δ
datachain 87.26% <89.74%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@mattseddon mattseddon force-pushed the impl-299 branch 2 times, most recently from dfcda52 to 6b31c8a Compare September 5, 2024 03:19
@mattseddon mattseddon changed the title allow merge to accept a column element (merge on expressions) allow merge on expressions Sep 5, 2024
@mattseddon mattseddon force-pushed the impl-299 branch 2 times, most recently from 5b86816 to da07321 Compare September 6, 2024 01:47
@mattseddon mattseddon marked this pull request as ready for review September 6, 2024 04:29
@mattseddon mattseddon requested a review from a team September 6, 2024 04:29
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍
Couple comments below, also Studio tests fails 😥. But it is not related to this PR, as far as I can see.

Comment on lines +275 to +278
if "." in name:
name_path = name.split(".")
elif DEFAULT_DELIMITER in name:
name_path = name.split(DEFAULT_DELIMITER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder, if it is possible to have both dot (.) and DEFAULT_DELIMITER (__) in column name? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Seems like one is always swapped for the other.

*right_on
).db_signals() # type: ignore[assignment]

if len(right_on_columns) != len(on_columns):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we loosing this check after these changes or I miss something? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is moved down. If any of the columns fail to resolve then we will have an entry in errors and we'll raise a DatasetMergeError.

col: sqlalchemy.ColumnClause = (
sqlalchemy.column(column)
if isinstance(column, str)
else sqlalchemy.column(column.name, column.type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for me personally I am OK to leave it here. It feels natural to have this change here.

@mattseddon mattseddon merged commit 2444964 into main Sep 9, 2024
33 of 38 checks passed
@mattseddon mattseddon deleted the impl-299 branch September 9, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce @ operator for File path manipulations
2 participants