feat: Use unified allocator for execution iterators #613

viirya · 2024-06-29T22:21:15Z

Which issue does this PR close?

Closes #648.
Relates to #387.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter · 2024-06-29T23:06:32Z

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 33.58%. Comparing base (eff2897) to head (49995dd).
Report is 4 commits behind head on main.

Files	Patch %	Lines
...n/scala/org/apache/comet/vector/StreamReader.scala	0.00%	3 Missing ⚠️
...mmon/src/main/scala/org/apache/comet/package.scala	0.00%	1 Missing ⚠️
...ain/scala/org/apache/comet/vector/NativeUtil.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #613      +/-   ##
============================================
+ Coverage     33.42%   33.58%   +0.16%     
- Complexity      805      828      +23     
============================================
  Files           109      109              
  Lines         42462    42531      +69     
  Branches       9342     9344       +2     
============================================
+ Hits          14191    14286      +95     
+ Misses        25322    25296      -26     
  Partials       2949     2949

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

spark/src/test/scala/org/apache/spark/sql/CometTPCDSQuerySuite.scala

viirya · 2024-07-07T20:03:29Z

The OOM issue of some TPCDS queries in CI will be fixed by #639 .

viirya · 2024-07-09T18:07:40Z

This only got failures on CometTPCDSQuerySuite with sort merge join configs (broadcast and hash join configs are passed).

But I don't see any details about the failure in CI logs. Only got:

2024-07-09T15:55:27.4236260Z [ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project comet-spark-spark3.4_2.12: There are test failures -> [Help 1]
2024-07-09T15:55:27.4283847Z org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project comet-spark-spark3.4_2.12: There are test failures
2024-07-09T15:55:27.4621881Z     at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:333)
2024-07-09T15:55:27.4630194Z     at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:316)
2024-07-09T15:55:27.4634523Z     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:212)
2024-07-09T15:55:27.4635530Z     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:174)
2024-07-09T15:55:27.4667105Z     at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:75)
2024-07-09T15:55:27.4756898Z     at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:162)
2024-07-09T15:55:27.4764057Z     at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39)
2024-07-09T15:55:27.4785783Z     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:159)
2024-07-09T15:55:27.4796446Z     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:105)
2024-07-09T15:55:27.4807376Z     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:73)
2024-07-09T15:55:27.4814525Z     at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:53)
2024-07-09T15:55:27.4817912Z     at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:118)
2024-07-09T15:55:27.4821849Z     at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:261)
2024-07-09T15:55:27.4828407Z     at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:173)
2024-07-09T15:55:27.4829379Z     at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:101)
2024-07-09T15:55:27.4832730Z     at org.apache.maven.cli.MavenCli.execute (MavenCli.java:906)
2024-07-09T15:55:27.4835266Z     at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:283)
2024-07-09T15:55:27.4836081Z     at org.apache.maven.cli.MavenCli.main (MavenCli.java:206)
2024-07-09T15:55:27.4839506Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
2024-07-09T15:55:27.4842763Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
2024-07-09T15:55:27.4849012Z     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
2024-07-09T15:55:27.4861001Z     at java.lang.reflect.Method.invoke (Method.java:566)
2024-07-09T15:55:27.4870461Z     at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:283)
2024-07-09T15:55:27.4885612Z     at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:226)
2024-07-09T15:55:27.4890602Z     at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:407)
2024-07-09T15:55:27.4899835Z     at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:348)
2024-07-09T15:55:27.4904594Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
2024-07-09T15:55:27.4911451Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
2024-07-09T15:55:27.4915859Z     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
2024-07-09T15:55:27.4919199Z     at java.lang.reflect.Method.invoke (Method.java:566)
2024-07-09T15:55:27.4932920Z     at org.apache.maven.wrapper.BootstrapMainStarter.start (BootstrapMainStarter.java:52)
2024-07-09T15:55:27.4936029Z     at org.apache.maven.wrapper.WrapperExecutor.execute (WrapperExecutor.java:161)
2024-07-09T15:55:27.4946395Z     at org.apache.maven.wrapper.MavenWrapperMain.main (MavenWrapperMain.java:73)
2024-07-09T15:55:27.4963372Z Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures
2024-07-09T15:55:27.4981790Z     at org.scalatest.tools.maven.TestMojo.execute (TestMojo.java:109)
2024-07-09T15:55:27.4997308Z     at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:126)
2024-07-09T15:55:27.5010381Z     at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:328)
2024-07-09T15:55:27.5024796Z     at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:316)
2024-07-09T15:55:27.5030990Z     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:212)
2024-07-09T15:55:27.5041603Z     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:174)
2024-07-09T15:55:27.5047002Z     at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:75)
2024-07-09T15:55:27.5052368Z     at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:162)
2024-07-09T15:55:27.5058972Z     at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39)
2024-07-09T15:55:27.5060896Z     at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:159)
2024-07-09T15:55:27.5066526Z     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:105)
2024-07-09T15:55:27.5071471Z     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:73)
2024-07-09T15:55:27.5079711Z     at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:53)
2024-07-09T15:55:27.5147147Z     at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:118)
2024-07-09T15:55:27.5153540Z     at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:261)
2024-07-09T15:55:27.5163314Z     at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:173)
2024-07-09T15:55:27.5174125Z     at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:101)
2024-07-09T15:55:27.5180318Z     at org.apache.maven.cli.MavenCli.execute (MavenCli.java:906)
2024-07-09T15:55:27.5186235Z     at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:283)
2024-07-09T15:55:27.5191786Z     at org.apache.maven.cli.MavenCli.main (MavenCli.java:206)
2024-07-09T15:55:27.5196494Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
2024-07-09T15:55:27.5202334Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
2024-07-09T15:55:27.5209329Z     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
2024-07-09T15:55:27.5215915Z     at java.lang.reflect.Method.invoke (Method.java:566)
2024-07-09T15:55:27.5222526Z     at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:283)
2024-07-09T15:55:27.5228160Z     at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:226)
2024-07-09T15:55:27.5234243Z     at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:407)
2024-07-09T15:55:27.5242323Z     at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:348)
2024-07-09T15:55:27.5254924Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
2024-07-09T15:55:27.5261708Z     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
2024-07-09T15:55:27.5265063Z     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
2024-07-09T15:55:27.5272249Z     at java.lang.reflect.Method.invoke (Method.java:566)
2024-07-09T15:55:27.5278462Z     at org.apache.maven.wrapper.BootstrapMainStarter.start (BootstrapMainStarter.java:52)
2024-07-09T15:55:27.5288016Z     at org.apache.maven.wrapper.WrapperExecutor.execute (WrapperExecutor.java:161)
2024-07-09T15:55:27.5293407Z     at org.apache.maven.wrapper.MavenWrapperMain.main (MavenWrapperMain.java:73)
2024-07-09T15:55:27.5298503Z [ERROR] 
2024-07-09T15:55:27.5303484Z [ERROR] 
2024-07-09T15:55:27.5307253Z [ERROR] For more information about the errors and possible solutions, please read the following articles:
2024-07-09T15:55:27.5317294Z [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
2024-07-09T15:55:27.5323976Z [ERROR] 
2024-07-09T15:55:27.5340567Z [ERROR] After correcting the problems, you can resume the build with the command
2024-07-09T15:55:27.5353491Z [ERROR]   mvn <args> -rf :comet-spark-spark3.4_2.12

I also cannot reproduce it locally.

viirya · 2024-07-09T21:21:05Z

spark/src/test/scala/org/apache/spark/sql/CometTPCDSQuerySuite.scala

+        "q70a",
+        // TODO: unknown failure (seems memory usage over Github action runner) in CI with q72-v2.7
+        // in https://github.com/apache/datafusion-comet/pull/613.
+        // "q72",


In latest run, I saw Error: Process completed with exit code 143.. It seems like the memory usage is larger than the Github action runner.

I found a few particular queries (q72, q16) seems to use more memory than others. q72 cannot be run through sort merge join config now in the CI runner due to its resource limit, but I can run it locally.

I will investigate the two queries further but they seem not related to the changes here.

andygrove · 2024-07-10T15:05:25Z

spark/src/test/scala/org/apache/spark/sql/CometTPCDSQuerySuite.scala

+        // TODO: unknown failure (seems memory usage over Github action runner) in CI with q72 in
+        // https://github.com/apache/datafusion-comet/pull/613.
+        // "q72",


I am +1 on skipping running the official q72 query by default (because it is so ridiculous), especially in CI. However, maybe we should consider running an optimized version where the join order is sensible, which makes it at least 10x faster and uses far less memory. I will file a follow on issue to discuss this.

The purpose of q72 is to test vendors join reordering rules, and that isn't really very relevant to Spark or Comet since Spark queries typically don't have access to statistics.

This is the version I have been using locally. Since we are not aiming to run the official TPC-DS benchmarks, but just our derived benchmarks, and also given that we are comparing Spark to Comet for the same queries, I think this would be fine to use by default as long it is well documented in our benchmarking guide.

I do think we should still test with the original q72 as a separate exercise though, because if Spark can run it then Comet should be able to as well (with the same memory configuration).

select i_item_desc ,w_warehouse_name ,d1.d_week_seq ,sum(case when p_promo_sk is null then 1 else 0 end) no_promo ,sum(case when p_promo_sk is not null then 1 else 0 end) promo ,count(*) total_cnt from catalog_sales join date_dim d1 on (cs_sold_date_sk = d1.d_date_sk) join customer_demographics on (cs_bill_cdemo_sk = cd_demo_sk) join household_demographics on (cs_bill_hdemo_sk = hd_demo_sk) join item on (i_item_sk = cs_item_sk) join inventory on (cs_item_sk = inv_item_sk) join warehouse on (w_warehouse_sk=inv_warehouse_sk) join date_dim d2 on (inv_date_sk = d2.d_date_sk) join date_dim d3 on (cs_ship_date_sk = d3.d_date_sk) left outer join promotion on (cs_promo_sk=p_promo_sk) left outer join catalog_returns on (cr_item_sk = cs_item_sk and cr_order_number = cs_order_number) where d1.d_week_seq = d2.d_week_seq and inv_quantity_on_hand < cs_quantity and d3.d_date > d1.d_date + 5 and hd_buy_potential = '501-1000' and d1.d_year = 1999 and cd_marital_status = 'S' group by i_item_desc,w_warehouse_name,d1.d_week_seq order by total_cnt desc, i_item_desc, w_warehouse_name, d_week_seq LIMIT 100;

The purpose of q72 is to test vendors join reordering rules, and that isn't really very relevant to Spark or Comet since Spark queries typically don't have access to statistics.

Btw, Spark has the capacity to do join reordering if statistics are available but it relies on enabling CBO features which are disabled by default.

I am +1 on skipping running the official q72 query by default (because it is so ridiculous), especially in CI. However, maybe we should consider running an optimized version where the join order is sensible, which makes it at least 10x faster and uses far less memory. I will file a follow on issue to discuss this.

Sounds good to me.

I do think we should still test with the original q72 as a separate exercise though, because if Spark can run it then Comet should be able to as well (with the same memory configuration).

Yea. As I mentioned earlier, I will investigate q72 further to see why it requires extra memory in Comet. Just disable it to unblock this PR.

andygrove · 2024-07-10T15:07:22Z

spark/src/test/scala/org/apache/spark/sql/CometTPCDSQuerySuite.scala

+    conf.set(CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key, "true")
+    conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")


Before we can close #387 we should either change the default for COMET_SHUFFLE_ENFORCE_MODE_ENABLED or remove it completely.

This can be a separate PR but we should not close the issue when we merge this PR

Let me create another issue for this PR.

Created #648 for this PR.

andygrove

LGTM. Thanks @viirya

.github/workflows/benchmark.yml

Removing the debugging flags I added.

viirya · 2024-07-10T19:10:35Z

Merged. Thanks @andygrove

* feat: Use unified allocator for execution iterators * Disable CometTakeOrderedAndProjectExec * Add comment * Increase heap memory * Enable CometTakeOrderedAndProjectExec * More * More * Reduce heap memory * Run sort merge join TPCDS with -e for debugging * Add -X flag * Disable q72 and q72-v2.7 * Update .github/workflows/benchmark.yml

viirya marked this pull request as draft June 29, 2024 23:00

viirya force-pushed the unify_allocator branch 3 times, most recently from 7d4899a to dd09e1a Compare July 3, 2024 06:13

andygrove reviewed Jul 3, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/CometTPCDSQuerySuite.scala Outdated Show resolved Hide resolved

viirya force-pushed the unify_allocator branch from 7bd168a to dd09e1a Compare July 5, 2024 06:11

viirya mentioned this pull request Jul 5, 2024

Sorted rows returned by DataFusion sort operator could be in different order than Spark on irrelevant columns #629

Closed

viirya force-pushed the unify_allocator branch from dd09e1a to b60ebe7 Compare July 5, 2024 23:07

viirya mentioned this pull request Jul 7, 2024

fix: Avoid creating huge duplicate of canonicalized plans for CometNativeExec #639

Merged

viirya added 7 commits July 8, 2024 16:53

feat: Use unified allocator for execution iterators

8123627

Disable CometTakeOrderedAndProjectExec

9f15b91

Add comment

748ca02

Increase heap memory

32e2f50

Enable CometTakeOrderedAndProjectExec

37f58cd

More

6b8361d

More

0e22296

viirya force-pushed the unify_allocator branch from 4938ffa to 0e22296 Compare July 9, 2024 02:06

viirya added 3 commits July 8, 2024 21:42

Reduce heap memory

49995dd

Run sort merge join TPCDS with -e for debugging

92fcd32

Add -X flag

4a4e9c1

viirya force-pushed the unify_allocator branch 3 times, most recently from 2f64c7a to f5cac20 Compare July 9, 2024 21:18

Disable q72 and q72-v2.7

5cda687

viirya force-pushed the unify_allocator branch from f5cac20 to 5cda687 Compare July 9, 2024 21:20

viirya commented Jul 9, 2024

View reviewed changes

viirya force-pushed the unify_allocator branch from db3a9ac to 5cda687 Compare July 10, 2024 05:48

viirya marked this pull request as ready for review July 10, 2024 13:49

viirya requested a review from andygrove July 10, 2024 13:53

andygrove reviewed Jul 10, 2024

View reviewed changes

andygrove approved these changes Jul 10, 2024

View reviewed changes

viirya commented Jul 10, 2024

View reviewed changes

.github/workflows/benchmark.yml Outdated Show resolved Hide resolved

Update .github/workflows/benchmark.yml

acdefa5

Removing the debugging flags I added.

viirya merged commit 3370612 into apache:main Jul 10, 2024
73 checks passed

Kontinuation mentioned this pull request Aug 29, 2024

Memory leaks when running the TPC-H benchmark repeatedly #884

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Use unified allocator for execution iterators #613

feat: Use unified allocator for execution iterators #613

viirya commented Jun 29, 2024 •

edited

Loading

codecov-commenter commented Jun 29, 2024 •

edited

Loading

viirya commented Jul 7, 2024

viirya commented Jul 9, 2024

viirya Jul 9, 2024

viirya Jul 10, 2024

viirya Jul 10, 2024

andygrove Jul 10, 2024

andygrove Jul 10, 2024

andygrove Jul 10, 2024

viirya Jul 10, 2024

viirya Jul 10, 2024

viirya Jul 10, 2024

andygrove Jul 10, 2024

andygrove Jul 10, 2024

viirya Jul 10, 2024

viirya Jul 10, 2024

andygrove left a comment

viirya commented Jul 10, 2024

		conf.set(CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key, "true")
		conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

feat: Use unified allocator for execution iterators #613

feat: Use unified allocator for execution iterators #613

Conversation

viirya commented Jun 29, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

codecov-commenter commented Jun 29, 2024 • edited Loading

Codecov Report

viirya commented Jul 7, 2024

viirya commented Jul 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

viirya commented Jul 10, 2024

viirya commented Jun 29, 2024 •

edited

Loading

codecov-commenter commented Jun 29, 2024 •

edited

Loading