-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use unified allocator for execution iterators #613
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #613 +/- ##
============================================
+ Coverage 33.42% 33.58% +0.16%
- Complexity 805 828 +23
============================================
Files 109 109
Lines 42462 42531 +69
Branches 9342 9344 +2
============================================
+ Hits 14191 14286 +95
+ Misses 25322 25296 -26
Partials 2949 2949 ☔ View full report in Codecov by Sentry. |
7d4899a
to
dd09e1a
Compare
spark/src/test/scala/org/apache/spark/sql/CometTPCDSQuerySuite.scala
Outdated
Show resolved
Hide resolved
The OOM issue of some TPCDS queries in CI will be fixed by #639 . |
This only got failures on But I don't see any details about the failure in CI logs. Only got:
I also cannot reproduce it locally. |
2f64c7a
to
f5cac20
Compare
"q70a", | ||
// TODO: unknown failure (seems memory usage over Github action runner) in CI with q72-v2.7 | ||
// in https://github.com/apache/datafusion-comet/pull/613. | ||
// "q72", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In latest run, I saw Error: Process completed with exit code 143.
. It seems like the memory usage is larger than the Github action runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a few particular queries (q72, q16) seems to use more memory than others. q72 cannot be run through sort merge join config now in the CI runner due to its resource limit, but I can run it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will investigate the two queries further but they seem not related to the changes here.
// TODO: unknown failure (seems memory usage over Github action runner) in CI with q72 in | ||
// https://github.com/apache/datafusion-comet/pull/613. | ||
// "q72", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 on skipping running the official q72 query by default (because it is so ridiculous), especially in CI. However, maybe we should consider running an optimized version where the join order is sensible, which makes it at least 10x faster and uses far less memory. I will file a follow on issue to discuss this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of q72 is to test vendors join reordering rules, and that isn't really very relevant to Spark or Comet since Spark queries typically don't have access to statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the version I have been using locally. Since we are not aiming to run the official TPC-DS benchmarks, but just our derived benchmarks, and also given that we are comparing Spark to Comet for the same queries, I think this would be fine to use by default as long it is well documented in our benchmarking guide.
I do think we should still test with the original q72 as a separate exercise though, because if Spark can run it then Comet should be able to as well (with the same memory configuration).
select i_item_desc
,w_warehouse_name
,d1.d_week_seq
,sum(case when p_promo_sk is null then 1 else 0 end) no_promo
,sum(case when p_promo_sk is not null then 1 else 0 end) promo
,count(*) total_cnt
from catalog_sales
join date_dim d1 on (cs_sold_date_sk = d1.d_date_sk)
join customer_demographics on (cs_bill_cdemo_sk = cd_demo_sk)
join household_demographics on (cs_bill_hdemo_sk = hd_demo_sk)
join item on (i_item_sk = cs_item_sk)
join inventory on (cs_item_sk = inv_item_sk)
join warehouse on (w_warehouse_sk=inv_warehouse_sk)
join date_dim d2 on (inv_date_sk = d2.d_date_sk)
join date_dim d3 on (cs_ship_date_sk = d3.d_date_sk)
left outer join promotion on (cs_promo_sk=p_promo_sk)
left outer join catalog_returns on (cr_item_sk = cs_item_sk and cr_order_number = cs_order_number)
where d1.d_week_seq = d2.d_week_seq
and inv_quantity_on_hand < cs_quantity
and d3.d_date > d1.d_date + 5
and hd_buy_potential = '501-1000'
and d1.d_year = 1999
and cd_marital_status = 'S'
group by i_item_desc,w_warehouse_name,d1.d_week_seq
order by total_cnt desc, i_item_desc, w_warehouse_name, d_week_seq
LIMIT 100;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of q72 is to test vendors join reordering rules, and that isn't really very relevant to Spark or Comet since Spark queries typically don't have access to statistics.
Btw, Spark has the capacity to do join reordering if statistics are available but it relies on enabling CBO features which are disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am +1 on skipping running the official q72 query by default (because it is so ridiculous), especially in CI. However, maybe we should consider running an optimized version where the join order is sensible, which makes it at least 10x faster and uses far less memory. I will file a follow on issue to discuss this.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we should still test with the original q72 as a separate exercise though, because if Spark can run it then Comet should be able to as well (with the same memory configuration).
Yea. As I mentioned earlier, I will investigate q72 further to see why it requires extra memory in Comet. Just disable it to unblock this PR.
conf.set(CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key, "true") | ||
conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we can close #387 we should either change the default for COMET_SHUFFLE_ENFORCE_MODE_ENABLED
or remove it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a separate PR but we should not close the issue when we merge this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me create another issue for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #648 for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @viirya
Removing the debugging flags I added.
Merged. Thanks @andygrove |
* feat: Use unified allocator for execution iterators * Disable CometTakeOrderedAndProjectExec * Add comment * Increase heap memory * Enable CometTakeOrderedAndProjectExec * More * More * Reduce heap memory * Run sort merge join TPCDS with -e for debugging * Add -X flag * Disable q72 and q72-v2.7 * Update .github/workflows/benchmark.yml
Which issue does this PR close?
Closes #648.
Relates to #387.
Rationale for this change
What changes are included in this PR?
How are these changes tested?