Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chain group_by #482

Merged
merged 5 commits into from
Oct 16, 2024
Merged

Implement chain group_by #482

merged 5 commits into from
Oct 16, 2024

Conversation

dreadatour
Copy link
Contributor

@dreadatour dreadatour commented Sep 27, 2024

Implement chain group_by:

group_by.py:

from datachain import C, DataChain
from datachain.lib import func
from datachain.sql.functions.path import file_ext


res = (
    DataChain.from_storage("s3://dql-50k-laion-files/")
    .group_by(
        cnt=func.count(),
        total_size=func.sum("file.size"),
        avg_size=func.avg("file.size"),
        partition_by=file_ext(C("file__path")),
    )
)

res.show()

Run:

$ python group_by.py
Processed: 1 rows [00:00, 1085.76 rows/s]
Generated: 1 rows [00:00, 1162.82 rows/s]
Cleanup: 1 tables [00:00, 6615.62 tables/s]
Listing s3://dql-50k-laion-files: 129136 objects [05:07, 419.31 objects/s]
Processed: 1 rows [05:09, 309.83s/ rows] objects [05:07, 364.91 objects/s]
Generated: 129136 rows [05:04, 423.69 rows/s]
Cleanup: 1 tables [00:00, 257.19 tables/s]/s]
  file_ext    cnt  total_size      avg_size
0      jpg  43042  1079645149  2.508353e+04
1     json  43047    29743128  6.909454e+02
2  parquet      5    15378208  3.075642e+06
3      txt  43042     2927814  6.802226e+01
$

See also tests.

@dreadatour dreadatour self-assigned this Sep 27, 2024
@dreadatour dreadatour marked this pull request as draft September 27, 2024 14:46
Copy link

cloudflare-workers-and-pages bot commented Sep 27, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 56999e8
Status: ✅  Deploy successful!
Preview URL: https://027a0733.datachain-documentation.pages.dev
Branch Preview URL: https://228-group-by.datachain-documentation.pages.dev

View logs

@dreadatour dreadatour mentioned this pull request Sep 27, 2024
14 tasks
Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 95.59748% with 7 lines in your changes missing coverage. Please review.

Project coverage is 87.25%. Comparing base (437898c) to head (56999e8).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/query/dataset.py 69.23% 2 Missing and 2 partials ⚠️
src/datachain/lib/func/func.py 91.66% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
+ Coverage   87.15%   87.25%   +0.10%     
==========================================
  Files          92       96       +4     
  Lines        9834     9943     +109     
  Branches     1348     1362      +14     
==========================================
+ Hits         8571     8676     +105     
- Misses        910      911       +1     
- Partials      353      356       +3     
Flag Coverage Δ
datachain 87.22% <95.59%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dreadatour dreadatour force-pushed the 228-group-by branch 3 times, most recently from 9462406 to 89ec9d0 Compare October 1, 2024 13:13
src/datachain/lib/func.py Outdated Show resolved Hide resolved
@dreadatour dreadatour marked this pull request as ready for review October 1, 2024 16:10
@dreadatour dreadatour requested a review from skshetry October 1, 2024 16:26
Copy link
Contributor

@dtulga dtulga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with just a couple of questions.

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
def q(*columns):
return grouped_query.with_only_columns(*columns)
cols = [
subquery.c[str(c)] if isinstance(c, (str, C)) else c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why query.selected_columns won't work here instead of a subquery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is if we are using LIMIT in the query, group_by will only work within the limited amount of columns. That's why I've added subquery here, to be sure limit will not affect aggregation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that limit is applied before group by in SQL? I thought it was applied after group by. I.e

-- create
CREATE TABLE EMPLOYEE_SALES (
  empId INTEGER NOT NULL,
  name TEXT NOT NULL,
  dept TEXT NOT NULL,
  sale_value FLOAT NOT NULL
);

-- insert
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales', 10);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales', 10);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales', 300);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales',400);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales',1);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales',5);
INSERT INTO EMPLOYEE_SALES VALUES (0002, 'Dave', 'Accounting', 20);
INSERT INTO EMPLOYEE_SALES VALUES (0003, 'Ava', 'Sales',400);

-- fetch 
SELECT 
  count(*) as count, 
  sum(sale_value) as total_sales, 
  name 
FROM EMPLOYEE_SALES group by name limit 3;

yields:

Output:

1|400.0|Ava
6|726.0|Clark
1|20.0|Dave

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check again, but I've added subquery because of the limit issue (after using .show() chain method). I'll be back with the answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not find the way to reproduce issue with limit in current implementation, but removing subquery causes another issues with SQL functions and labels.
My suggestion is to leave it as is for now, because it is another issue — to get rid of subqueries. Right now they are actively used in all steps, even basic select.

I have some work done on this, removed almost all subqueries and in the end there is non performance boost without subqueries in both CLI and SaaS, but there was some issues with tests. Also @rlamy was looking into this. I can create a follow-up issue to solve subqueries issue.

@dreadatour dreadatour merged commit c6ca542 into main Oct 16, 2024
38 checks passed
@dreadatour dreadatour deleted the 228-group-by branch October 16, 2024 07:38
@dreadatour dreadatour linked an issue Oct 20, 2024 that may be closed by this pull request
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce group_by
4 participants