Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Various metrics bug fixes and improvements #1111

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Nov 22, 2024

Which issue does this PR close?

Closes #1109
Closes #1003
Closes #1110
Closes #935

Rationale for this change

We currently drop some native metrics due to a design flaw in the current metrics code where we assume that the native plan is a 1:1 mapping with the Spark plan, which is often not true. See the issue for more details.

Improvement 1: Fix bug where metrics were being dropped in some cases

Here are before and after images for BulidRight hash join where we insert an extra projection on the native side, breaking the assumption that there is a 1:1 mapping between Spark plan and native plan:

Screenshot from 2024-11-22 11-44-21

Screenshot from 2024-11-22 14-21-34

Improvement 2: Report Arrow FFI time for passing batches from JVM to native

We now include the ScanExec time for transferring batches from JVM to native. The following example shows total scan time of 16.4 seconds but now also shows the additional 17.7 seconds for transferring those batches to native for the filter operation.

Screenshot from 2024-11-24 08-21-41

What changes are included in this PR?

The native planner now builds a tree of SparkPlan that is a 1:1 mapping with the original Spark plan. Each SparkPlan can reference multiple native plans that should be used for metrics collection.

How are these changes tested?

Existing tests, and new unit tests in the planner

@andygrove
Copy link
Member Author

@viirya @parthchandra @mbutrovich This is still WIP but let me know what you think of the overall approach here if you have time.

Current status is that we now log the metrics that we are dropping. Here are two examples from TPC-H q3.

We wrap an aggregate in a projection causing:

Dropping the AggregateExec elapsed_compute time of 1820330 for plan ProjectionExec (#624)

The input to a SortExec is a ScanExec to fetch the input batches from the JVM, and we drop those metrics:

Dropping the ScanExec elapsed_compute time of 1151562087 for plan SortExec (#0)

Comment on lines +969 to +974
Arc::new(SparkPlan::new_with_additional(
spark_plan.plan_id,
projection,
vec![child],
vec![aggregate],
)),
Copy link
Member Author

@andygrove andygrove Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example where we are currently dropping the aggregate metrics and just capturing the projection metrics

@mbutrovich
Copy link
Contributor

My initial thoughts:

@andygrove
Copy link
Member Author

andygrove commented Nov 22, 2024

Some progress!

Before

Screenshot from 2024-11-22 11-44-21

After

Screenshot from 2024-11-22 14-21-34

@andygrove
Copy link
Member Author

We now have metrics for all operators showing time for fetching batches from JVM.

Screenshot from 2024-11-22 15-35-34

@andygrove andygrove changed the title fix: [WIP] Stop dropping metrics fix: Stop dropping metrics and expose CopyExec and ScanExec in Spark SQL Metrics Nov 22, 2024
@andygrove andygrove changed the title fix: Stop dropping metrics and expose CopyExec and ScanExec in Spark SQL Metrics fix: Stop dropping metrics Nov 22, 2024
@parthchandra
Copy link
Contributor

Approach looks good (though I cannot say I understand it completely). The results are definitely what we wanted!

@andygrove
Copy link
Member Author

I can possibly break this down into some smaller PRs as well. I may do that.

@@ -365,28 +378,23 @@ struct ScanStream<'a> {
scan: ScanExec,
/// Schema representing the data
schema: SchemaRef,
/// Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it dropped because it repeats what we have on SparkPlan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted some of these changes now

partition: usize,
baseline_metrics: BaselineMetrics,
) -> Self {
pub fn new(scan: ScanExec, schema: SchemaRef, partition: usize, jvm_fetch_time: Time) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps jvm_fetch_time enough for now, but if you wanna expand metrics in future its better to have a wrapper structure similar to BaselineMetrics ?

@andygrove andygrove changed the title fix: Stop dropping metrics fix: Various metrics bug fixes and improvements Nov 24, 2024
@codecov-commenter
Copy link

codecov-commenter commented Nov 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.43%. Comparing base (b74bfe4) to head (ff5076d).
Report is 12 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1111      +/-   ##
============================================
+ Coverage     34.33%   34.43%   +0.09%     
- Complexity      898      901       +3     
============================================
  Files           115      115              
  Lines         42986    43477     +491     
  Branches       9369     9506     +137     
============================================
+ Hits          14761    14971     +210     
- Misses        25361    25615     +254     
- Partials       2864     2891      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants