fix: Various metrics bug fixes and improvements #1111

andygrove · 2024-11-22T15:42:34Z

Which issue does this PR close?

Closes #1109
Closes #1003
Closes #1110
Closes #935

Rationale for this change

We currently drop some native metrics due to a design flaw in the current metrics code where we assume that the native plan is a 1:1 mapping with the Spark plan, which is often not true. See the issue for more details.

Improvement 1: Fix bug where metrics were being dropped in some cases

Here are before and after images for BulidRight hash join where we insert an extra projection on the native side, breaking the assumption that there is a 1:1 mapping between Spark plan and native plan:

Improvement 2: Report Arrow FFI time for passing batches from JVM to native

We now include the ScanExec time for transferring batches from JVM to native. The following example shows total scan time of 16.4 seconds but now also shows the additional 17.7 seconds for transferring those batches to native for the filter operation.

What changes are included in this PR?

The native planner now builds a tree of SparkPlan that is a 1:1 mapping with the original Spark plan. Each SparkPlan can reference multiple native plans that should be used for metrics collection.

How are these changes tested?

Existing tests, and new unit tests in the planner

…nal Spark plan

andygrove · 2024-11-22T15:58:29Z

@viirya @parthchandra @mbutrovich This is still WIP but let me know what you think of the overall approach here if you have time.

Current status is that we now log the metrics that we are dropping. Here are two examples from TPC-H q3.

We wrap an aggregate in a projection causing:

Dropping the AggregateExec elapsed_compute time of 1820330 for plan ProjectionExec (#624)

The input to a SortExec is a ScanExec to fetch the input batches from the JVM, and we drop those metrics:

Dropping the ScanExec elapsed_compute time of 1151562087 for plan SortExec (#0)

andygrove · 2024-11-22T16:00:12Z

native/core/src/execution/datafusion/planner.rs

+                        Arc::new(SparkPlan::new_with_additional(
+                            spark_plan.plan_id,
+                            projection,
+                            vec![child],
+                            vec![aggregate],
+                        )),


This is an example where we are currently dropping the aggregate metrics and just capturing the projection metrics

mbutrovich · 2024-11-22T19:31:49Z

My initial thoughts:

Approach makes sense. This is good infrastructure to have for managing when Spark plans and native plans diverge in node count and structure.
I am worried about more logic in update_comet_metric as discussed in Reduce metrics collection overhead #1024. However, we also shouldn't lose metrics. This may be motivation to revisit some of the potential solutions in Reduce metrics collection overhead #1024.

andygrove · 2024-11-22T21:20:43Z

Some progress!

Before

After

andygrove · 2024-11-22T22:39:35Z

We now have metrics for all operators showing time for fetching batches from JVM.

parthchandra · 2024-11-22T23:14:22Z

Approach looks good (though I cannot say I understand it completely). The results are definitely what we wanted!

andygrove · 2024-11-22T23:40:15Z

I can possibly break this down into some smaller PRs as well. I may do that.

comphead · 2024-11-24T00:48:12Z

native/core/src/execution/operators/scan.rs

@@ -365,28 +378,23 @@ struct ScanStream<'a> {
    scan: ScanExec,
    /// Schema representing the data
    schema: SchemaRef,
-    /// Metrics


is it dropped because it repeats what we have on SparkPlan?

I have reverted some of these changes now

comphead · 2024-11-24T00:55:07Z

native/core/src/execution/operators/scan.rs

-        partition: usize,
-        baseline_metrics: BaselineMetrics,
-    ) -> Self {
+    pub fn new(scan: ScanExec, schema: SchemaRef, partition: usize, jvm_fetch_time: Time) -> Self {


perhaps jvm_fetch_time enough for now, but if you wanna expand metrics in future its better to have a wrapper structure similar to BaselineMetrics ?

codecov-commenter · 2024-11-24T16:27:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.43%. Comparing base (b74bfe4) to head (ff5076d).
Report is 12 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1111      +/-   ##
============================================
+ Coverage     34.33%   34.43%   +0.09%     
- Complexity      898      901       +3     
============================================
  Files           115      115              
  Lines         42986    43477     +491     
  Branches       9369     9506     +137     
============================================
+ Hits          14761    14971     +210     
- Misses        25361    25615     +254     
- Partials       2864     2891      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

This reverts commit 56e3ead.

andygrove added 2 commits November 22, 2024 08:39

Refactor native planner to build tree of SparkPlan that maps to origi…

725ccb4

…nal Spark plan

add SparkPlan children

581984e

andygrove commented Nov 22, 2024

View reviewed changes

andygrove added 3 commits November 22, 2024 09:10

clippy

d25f2d3

add some more documentation

2027dc9

aggregate metrics

8c8c9a5

andygrove added 3 commits November 22, 2024 12:36

simplify approach

baae1ba

add a unit test

68d0fdf

save progress

2347203

andygrove added 2 commits November 22, 2024 15:36

remove debug, add specific jvm timer

d0aeda1

fix

3359738

clippy

0a8a06b

andygrove changed the title ~~fix: [WIP] Stop dropping metrics~~ fix: Stop dropping metrics and expose CopyExec and ScanExec in Spark SQL Metrics Nov 22, 2024

andygrove changed the title ~~fix: Stop dropping metrics and expose CopyExec and ScanExec in Spark SQL Metrics~~ fix: Stop dropping metrics Nov 22, 2024

andygrove added 2 commits November 22, 2024 15:50

clippy

40e90d5

format

5bfe334

andygrove requested review from viirya and comphead November 22, 2024 22:53

comphead reviewed Nov 24, 2024

View reviewed changes

andygrove added 2 commits November 24, 2024 08:20

Revert some changes, update documentation

0a3044e

fix typo

d430175

andygrove changed the title ~~fix: Stop dropping metrics~~ fix: Various metrics bug fixes and improvements Nov 24, 2024

fix typo

ff5076d

andygrove added 5 commits November 24, 2024 10:45

measure more FFI time

ca66095

record FFI time for CollectLimitExec and TakeOrderedAndProject

6fd5723

upmerge:

c512187

save

56e3ead

Revert "save"

36e6233

This reverts commit 56e3ead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Various metrics bug fixes and improvements #1111

fix: Various metrics bug fixes and improvements #1111

andygrove commented Nov 22, 2024 •

edited

Loading

andygrove commented Nov 22, 2024

andygrove Nov 22, 2024 •

edited

Loading

mbutrovich commented Nov 22, 2024

andygrove commented Nov 22, 2024 •

edited

Loading

andygrove commented Nov 22, 2024

parthchandra commented Nov 22, 2024

andygrove commented Nov 22, 2024

comphead Nov 24, 2024

andygrove Nov 24, 2024

comphead Nov 24, 2024

codecov-commenter commented Nov 24, 2024 •

edited

Loading

fix: Various metrics bug fixes and improvements #1111

Are you sure you want to change the base?

fix: Various metrics bug fixes and improvements #1111

Conversation

andygrove commented Nov 22, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

Improvement 1: Fix bug where metrics were being dropped in some cases

Improvement 2: Report Arrow FFI time for passing batches from JVM to native

What changes are included in this PR?

How are these changes tested?

andygrove commented Nov 22, 2024

andygrove Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

mbutrovich commented Nov 22, 2024

andygrove commented Nov 22, 2024 • edited Loading

Before

After

andygrove commented Nov 22, 2024

parthchandra commented Nov 22, 2024

andygrove commented Nov 22, 2024

comphead Nov 24, 2024

Choose a reason for hiding this comment

andygrove Nov 24, 2024

Choose a reason for hiding this comment

comphead Nov 24, 2024

Choose a reason for hiding this comment

codecov-commenter commented Nov 24, 2024 • edited Loading

Codecov Report

andygrove commented Nov 22, 2024 •

edited

Loading

andygrove Nov 22, 2024 •

edited

Loading

andygrove commented Nov 22, 2024 •

edited

Loading

codecov-commenter commented Nov 24, 2024 •

edited

Loading