-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement chain group_by #482
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
b2ded4b
to
bf6d2e9
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #482 +/- ##
==========================================
+ Coverage 87.15% 87.25% +0.10%
==========================================
Files 92 96 +4
Lines 9834 9943 +109
Branches 1348 1362 +14
==========================================
+ Hits 8571 8676 +105
- Misses 910 911 +1
- Partials 353 356 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
9462406
to
89ec9d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with just a couple of questions.
def q(*columns): | ||
return grouped_query.with_only_columns(*columns) | ||
cols = [ | ||
subquery.c[str(c)] if isinstance(c, (str, C)) else c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why query.selected_columns
won't work here instead of a subquery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is if we are using LIMIT
in the query, group_by
will only work within the limited amount of columns. That's why I've added subquery here, to be sure limit will not affect aggregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that limit is applied before group by in SQL? I thought it was applied after group by. I.e
-- create
CREATE TABLE EMPLOYEE_SALES (
empId INTEGER NOT NULL,
name TEXT NOT NULL,
dept TEXT NOT NULL,
sale_value FLOAT NOT NULL
);
-- insert
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales', 10);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales', 10);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales', 300);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales',400);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales',1);
INSERT INTO EMPLOYEE_SALES VALUES (0001, 'Clark', 'Sales',5);
INSERT INTO EMPLOYEE_SALES VALUES (0002, 'Dave', 'Accounting', 20);
INSERT INTO EMPLOYEE_SALES VALUES (0003, 'Ava', 'Sales',400);
-- fetch
SELECT
count(*) as count,
sum(sale_value) as total_sales,
name
FROM EMPLOYEE_SALES group by name limit 3;
yields:
Output:
1|400.0|Ava
6|726.0|Clark
1|20.0|Dave
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check again, but I've added subquery because of the limit issue (after using .show()
chain method). I'll be back with the answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not find the way to reproduce issue with limit in current implementation, but removing subquery causes another issues with SQL functions and labels.
My suggestion is to leave it as is for now, because it is another issue — to get rid of subqueries. Right now they are actively used in all steps, even basic select.
I have some work done on this, removed almost all subqueries and in the end there is non performance boost without subqueries in both CLI and SaaS, but there was some issues with tests. Also @rlamy was looking into this. I can create a follow-up issue to solve subqueries issue.
60c5392
to
180bb4d
Compare
180bb4d
to
56999e8
Compare
Implement chain
group_by
:group_by.py
:Run:
See also tests.