Avoid extra copies in `CoalesceBatchesExec` to improve performance #7957

alamb · 2023-10-27T17:40:04Z

Is your feature request related to a problem or challenge?

While looking at TPCH query performance for #6782 I noticed several times that CoalesceBatchesExec takes non trivial amounts of time (like 5% of the overall query time)

Here is a specific examples

Create Data:

cd arrow-datafusion/benchmarks
./bench.sh  data tpch10

Run query with datafusion-cli:

cd arrow-datafusion/benchmarks/data/tpch_sf10
datafusion-cli -c "select	o_year,	sum(case	when nation = 'BRAZIL' then volume	else 0	end) / sum(volume) as mkt_share	from	(	select	extract(year from o_orderdate) as o_year,	l_extendedprice * (1 - l_discount) as volume,	n2.n_name as nation	from	part,	supplier,	lineitem,	orders,	customer,	nation n1,	nation n2,	region	where	p_partkey = l_partkey	and s_suppkey = l_suppkey	and l_orderkey = o_orderkey	and o_custkey = c_custkey	and c_nationkey = n1.n_nationkey	and n1.n_regionkey = r_regionkey	and r_name = 'AMERICA'	and s_nationkey = n2.n_nationkey	and o_orderdate between date '1995-01-01' and date '1996-12-31'	and p_type = 'ECONOMY ANODIZED STEEL'	) as all_nations	group by	o_year	order by	o_year;"

select	o_year,
  sum(case	when nation = 'BRAZIL' then volume	else 0	end) / sum(volume) as mkt_share
from (
  select
    extract(year from o_orderdate) as o_year,
    l_extendedprice * (1 - l_discount) as volume,
    n2.n_name as nation
  from
    part,
    supplier,
    lineitem,
    orders,
    customer,
    nation n1,
    nation n2,
    region
  where
    p_partkey = l_partkey
    and s_suppkey = l_suppkey
    and l_orderkey = o_orderkey
    and o_custkey = c_custkey
    and c_nationkey = n1.n_nationkey
    and n1.n_regionkey = r_regionkey
    and r_name = 'AMERICA'
    and s_nationkey = n2.n_nationkey
    and o_orderdate between date '1995-01-01' and date '1996-12-31'	and p_type = 'ECONOMY ANODIZED STEEL'	) as all_nations
  group by
 	o_year
  order by
  	o_year;"

Here is the full EXPLAIN ANLAYZE output:
explan-analyze-q8.txt

A small subset shows there is a single CoalesceBatchesExec that takes 3 seconds (elapsed_compute=3.066514072s):

CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=59986052, elapsed_compute=3.066514072s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
  RepartitionExec: partitioning=Hash([l_partkey@1], 16), input_partitions=16, metrics=[fetch_time=16.321190026s, repart_time=10.382230411s, send_time=4.650058274s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
     ParquetExec: file_groups={16 groups: [[Users/alamb/Software/arrow-datafusion/benchmarks/data/tpch_sf10

I profiled the query and confirmed that CoalesceBatchExec takes 5% of the overall time, as shown in this screen shot

In diagrams this looks like

┌────────────────────┐        Filter                                                                          
│                    │                    ┌────────────────────┐            Coalesce                          
│                    │    ─ ─ ─ ─ ─ ─ ▶   │    RecordBatch     │             Batches                          
│    RecordBatch     │                    │   num_rows = 234   │─ ─ ─ ─ ─ ┐                                   
│  num_rows = 8000   │                    └────────────────────┘                                              
│                    │                                                    │                                   
│                    │                                                                ┌────────────────────┐  
└────────────────────┘                                                    │           │                    │  
┌────────────────────┐                    ┌────────────────────┐                      │                    │  
│                    │        Filter      │                    │          │           │                    │  
│                    │                    │    RecordBatch     │           ─ ─ ─ ─ ─ ▶│                    │  
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   num_rows = 500   │─ ─ ─ ─ ─ ┐           │                    │  
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │  
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │  
│                    │                    └────────────────────┘                      │                    │  
└────────────────────┘                                                                │                    │  
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │  
          ...                   ...                                       │           │                    │  
                                                                                      │                    │  
┌────────────────────┐                                                    │           └────────────────────┘  
│                    │                    ┌────────────────────┐                                              
│                    │       Filter       │                    │          │                                   
│    RecordBatch     │                    │    RecordBatch     │                                              
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   num_rows = 333   │─ ─ ─ ─ ─ ┘                                   
│                    │                    │                    │                                              
│                    │                    └────────────────────┘                                              
└────────────────────┘                                                                                        
                                                                                                              
                      FilterExec                                          RepartitonExec copies the data      
                      creates output batches with copies                  *again* to form final large         
                      of  the matching rows (calls take()                 RecordBatches                       
                      to make a copy)

Describe the solution you'd like

I think we can avoid this overhead by combining the behavior of CoalesceBatchesExec into the operators that make small batches (FilterExec, JoinExec, and RepartitionExec). Something like

┌────────────────────┐        Filter                                                                        
│                    │                    ┌────────────────────┐          Filter output                     
│                    │    ─ ─ ─ ─ ─ ─ ▶   │        mask        │                                            
│    RecordBatch     │                    │   (BooleanArray)   │─ ─ ─ ─ ─ ┐                                 
│  num_rows = 8000   │                    └────────────────────┘                                            
│                    │                                                    │                                 
│                    │                                                                ┌────────────────────┐
└────────────────────┘                                                    │           │                    │
┌────────────────────┐                    ┌────────────────────┐                      │                    │
│                    │        Filter      │                    │          │           │                    │
│                    │                    │        mask        │           ─ ─ ─ ─ ─ ▶│                    │
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   (BooleanArray)   │─ ─ ─ ─ ─ ┐           │                    │
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │
│                    │                    └────────────────────┘                      │                    │
└────────────────────┘                                                                │                    │
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │
          ...                   ...                                       │           │                    │
                                                                                      │                    │
┌────────────────────┐                                                    │           └────────────────────┘
│                    │                    ┌────────────────────┐                                            
│                    │       Filter       │                    │          │                                 
│    RecordBatch     │                    │        mask        │                                            
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   (BooleanArray)   │─ ─ ─ ─ ─ ┘                                 
│                    │                    │                    │                                            
│                    │                    └────────────────────┘                                            
└────────────────────┘                                                                                      
                                                                                                            
                      FilterExec et all                             The Exec then copies rows from          
                      internally buffers RecordBatches and          multiple input batches (based on the    
                      filter results until enough rows are          masks) into a single new output         
                      ready                                         RecordBatch avoiding the second data    
                                                                    copy

The idea would be to take the core coalesce logic from CoalesceBatchesExec that calls concat_batches And instead of creating new small record batches in FilterExec, HashJoinExec, and RepartitionExec buffer the inputs until there are at least target_batch_size rows available, and then call interleave instead

Here is the code in CoalesceBatchesExec that could be adapted:
https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/coalesce_batches.rs#L215-L283

Here is where FilterExec makes the potentially small record batches
https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/filter.rs#L294-L306

The same think would be done in RepartitionExec: https://github.com/apache/arrow-datafusion/blob/a9d66e2b492843c2fb335a7dfe27fed073629b09/datafusion/physical-plan/src/repartition/mod.rs#L193-L218

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

tustvold · 2023-10-27T17:56:45Z

Is the majority of the concat_batches overhead string concatenation? Just wondering if this is something that StringView might help with?

alamb · 2023-10-27T18:04:40Z

From an internal conversation, @tustvold points out that interleave is fair bit slower than the filter kernel, especially for primitives

The current code uses filter then concat_batches

The proposal would use interleave

Thus we may have to potentially special case how the columns are handled depending on type (e.g. what kernels are called for what operations)

Dandandan · 2023-11-30T08:22:34Z

I tried (https://github.com/Dandandan/arrow-datafusion/tree/buffer_batches) to convert the code to use interleave instead of concat (CoalesceBatches) (for hash repartitioning), but this leads to no improvement whatsoever (even a small slowdown):

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ buffer_batches ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  93.31ms │        90.88ms │     no change │
│ QQuery 2     │  26.18ms │        28.39ms │  1.08x slower │
│ QQuery 3     │  53.96ms │        57.86ms │  1.07x slower │
│ QQuery 4     │  52.33ms │        51.77ms │     no change │
│ QQuery 5     │ 114.93ms │       125.69ms │  1.09x slower │
│ QQuery 6     │   9.38ms │         9.20ms │     no change │
│ QQuery 7     │ 211.60ms │       227.30ms │  1.07x slower │
│ QQuery 8     │  61.54ms │        59.56ms │     no change │
│ QQuery 9     │  60.66ms │        63.80ms │  1.05x slower │
│ QQuery 10    │ 114.31ms │       114.95ms │     no change │
│ QQuery 11    │  20.12ms │        20.54ms │     no change │
│ QQuery 12    │  59.98ms │        59.26ms │     no change │
│ QQuery 13    │  53.37ms │        56.42ms │  1.06x slower │
│ QQuery 14    │  18.47ms │        18.63ms │     no change │
│ QQuery 15    │  58.99ms │        59.31ms │     no change │
│ QQuery 16    │  22.79ms │        22.27ms │     no change │
│ QQuery 17    │  54.41ms │        53.59ms │     no change │
│ QQuery 18    │ 158.61ms │       162.68ms │     no change │
│ QQuery 19    │  36.98ms │        34.78ms │ +1.06x faster │
│ QQuery 20    │  62.49ms │        62.04ms │     no change │
│ QQuery 21    │ 251.42ms │       256.99ms │     no change │
│ QQuery 22    │  14.35ms │        14.14ms │     no change │
└──────────────┴──────────┴────────────────┴───────────────┘

Dandandan · 2023-11-30T16:47:27Z

Another finding to add to the list is that changing RepartitionExec to do a filter (with hash repartitioning) is a bit slower than take or interleave on tpc-h benchmarks.

alamb · 2023-11-30T21:35:07Z

Thanks for the info @Dandandan -- those are some interesting results

alamb · 2024-07-23T20:31:16Z

I made some pictures for what I had in mind

This is what happens today (FilterExec + RepartitionExec):

┌────────────────────┐        Filter                                                                          
│                    │                    ┌────────────────────┐            Coalesce                          
│                    │    ─ ─ ─ ─ ─ ─ ▶   │    RecordBatch     │             Batches                          
│    RecordBatch     │                    │   num_rows = 234   │─ ─ ─ ─ ─ ┐                                   
│  num_rows = 8000   │                    └────────────────────┘                                              
│                    │                                                    │                                   
│                    │                                                                ┌────────────────────┐  
└────────────────────┘                                                    │           │                    │  
┌────────────────────┐                    ┌────────────────────┐                      │                    │  
│                    │        Filter      │                    │          │           │                    │  
│                    │                    │    RecordBatch     │           ─ ─ ─ ─ ─ ▶│                    │  
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   num_rows = 500   │─ ─ ─ ─ ─ ┐           │                    │  
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │  
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │  
│                    │                    └────────────────────┘                      │                    │  
└────────────────────┘                                                                │                    │  
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │  
          ...                   ...                                       │           │                    │  
                                                                                      │                    │  
┌────────────────────┐                                                    │           └────────────────────┘  
│                    │                    ┌────────────────────┐                                              
│                    │       Filter       │                    │          │                                   
│    RecordBatch     │                    │    RecordBatch     │                                              
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   num_rows = 333   │─ ─ ─ ─ ─ ┘                                   
│                    │                    │                    │                                              
│                    │                    └────────────────────┘                                              
└────────────────────┘                                                                                        
                                                                                                              
                      FilterExec                                          RepartitonExec copies the data      
                      creates output batches with copies                  *again* to form final large         
                      of  the matching rows (calls take()                 RecordBatches                       
                      to make a copy)

Instead do it like this (only buffer the filter mask results and create the final output in one go):

┌────────────────────┐        Filter                                                                        
│                    │                    ┌────────────────────┐          Filter output                     
│                    │    ─ ─ ─ ─ ─ ─ ▶   │        mask        │                                            
│    RecordBatch     │                    │   (BooleanArray)   │─ ─ ─ ─ ─ ┐                                 
│  num_rows = 8000   │                    └────────────────────┘                                            
│                    │                                                    │                                 
│                    │                                                                ┌────────────────────┐
└────────────────────┘                                                    │           │                    │
┌────────────────────┐                    ┌────────────────────┐                      │                    │
│                    │        Filter      │                    │          │           │                    │
│                    │                    │        mask        │           ─ ─ ─ ─ ─ ▶│                    │
│    RecordBatch     │    ─ ─ ─ ─ ─ ─ ▶   │   (BooleanArray)   │─ ─ ─ ─ ─ ┐           │                    │
│  num_rows = 8000   │                    │                    │                      │    RecordBatch     │
│                    │                    │                    │          └ ─ ─ ─ ─ ─▶│  num_rows = 8000   │
│                    │                    └────────────────────┘                      │                    │
└────────────────────┘                                                                │                    │
                                                    ...                    ─ ─ ─ ─ ─ ▶│                    │
          ...                   ...                                       │           │                    │
                                                                                      │                    │
┌────────────────────┐                                                    │           └────────────────────┘
│                    │                    ┌────────────────────┐                                            
│                    │       Filter       │                    │          │                                 
│    RecordBatch     │                    │        mask        │                                            
│  num_rows = 8000   │   ─ ─ ─ ─ ─ ─ ▶    │   (BooleanArray)   │─ ─ ─ ─ ─ ┘                                 
│                    │                    │                    │                                            
│                    │                    └────────────────────┘                                            
└────────────────────┘                                                                                      
                                                                                                            
                      FilterExec et all                             The Exec then copies rows from          
                      internally buffers RecordBatches and          multiple input batches (based on the    
                      filter results until enough rows are          masks) into a single new output         
                      ready                                         RecordBatch avoiding the second data    
                                                                    copy

alamb · 2024-07-23T20:33:22Z

Maybe we could add a filter_multi kernel along side https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html

The signature would be like

/// filters paris of arrays / predicates into a single output arrauy
pub fn filter_multi(
    input: impl IntoIterator<Item = (&dyn Array, &BooleanArray>)
) -> Result<Arc<dyn Array>, ArrowError>

And then the internal machinery could be the same

2010YOUY01 · 2024-07-25T09:57:31Z

TPCH-Q1's expensive CoalesceBatches might be easier to solve 🤔
Unlike TPCH-Q8 (looks like with more selective predicate), Q1's filter only throws away a small number of rows:

input_batch(4096 rows) -> [FILTER] -> output_batch(~4000 rows)

And the following CoalesceBatches condition will be triggered every time to copy large output batches

datafusion/datafusion/physical-plan/src/coalesce_batches.rs

Line 228 in 49d9d45

if self.buffered_rows >= self.target_batch_size {

But if it didn't do coalescing, the output batch still benefits from vectorization, so maybe this coalescing threshold can be better tuned to like if self.buffered_rows >= 0.6 * self.target_batch_size {

I remember I tried before, but the overall performance on Q1 is like 2% improvement, I think it's possible to set a better threshold for triggering coalescing.

alamb · 2024-07-25T10:15:49Z

Thanks @2010YOUY01 -- that sounds neat. Also thanks for the tip related to Q1 as a good candidate query this could improve.

In general I agree there are likely easier ways to improve specific queries. I am still bullish on the more general idea off trying to avoid the copy (I think it would be good for several percentage points across many queries), though it will be a fairly substantial taks

I am going to spend some more time prototyping and see how it goes

alamb · 2024-07-29T16:22:50Z

I am pretty happy with how this is headed.

I have a PR to start refactoring the code here #11610

I am also feeling good about my WIP prototype

alamb added the enhancement New feature or request label Oct 27, 2023

alamb mentioned this issue Oct 27, 2023

[EPIC] A list of performance improvement tickets #5546

Open

29 tasks

Dandandan changed the title ~~Avoid extra copies in CoalesceBatchesExec to improve performace~~ Avoid extra copies in CoalesceBatchesExec to improve performance Nov 18, 2023

alamb mentioned this issue Jul 23, 2024

Reduce copying in CoalesceBatchesExec for StringViews #11628

Open

alamb mentioned this issue Jul 23, 2024

Extract CoalesceBatchesStream to a struct #11610

Merged

This was referenced Jul 25, 2024

Prototype combined Repartition/Filter + Coalesce (WIP) #11647

Closed

[Epic] High cardinality aggregation performance wishlist #11679

Open

alamb self-assigned this Jul 29, 2024

This was referenced Jul 29, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 #11710

Closed

Reuse hash #11708

Closed

alamb mentioned this issue Aug 5, 2024

DataFusion weekly project plan (Andrew Lamb) - Aug 5, 2024 #11826

Closed

6 tasks

This was referenced Aug 14, 2024

DataFusion weekly project plan (Andrew Lamb) - Aug 12, 2024 #11986

Closed

Minor: Extract BatchCoalescer to its own module #12047

Merged

jayzhan211 mentioned this issue Oct 9, 2024

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Closed

alamb mentioned this issue Nov 5, 2024

Optimize take/filter from multiple input arrays to a single large output array apache/arrow-rs#6692

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid extra copies in `CoalesceBatchesExec` to improve performance #7957

Avoid extra copies in `CoalesceBatchesExec` to improve performance #7957

alamb commented Oct 27, 2023 •

edited

Loading

tustvold commented Oct 27, 2023

alamb commented Oct 27, 2023

Dandandan commented Nov 30, 2023 •

edited

Loading

Dandandan commented Nov 30, 2023 •

edited

Loading

alamb commented Nov 30, 2023

alamb commented Jul 23, 2024

alamb commented Jul 23, 2024

2010YOUY01 commented Jul 25, 2024

alamb commented Jul 25, 2024

alamb commented Jul 29, 2024

Avoid extra copies in CoalesceBatchesExec to improve performance #7957

Avoid extra copies in CoalesceBatchesExec to improve performance #7957

Comments

alamb commented Oct 27, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Create Data:

Run query with datafusion-cli:

Describe the solution you'd like

Describe alternatives you've considered

Additional context

tustvold commented Oct 27, 2023

alamb commented Oct 27, 2023

Dandandan commented Nov 30, 2023 • edited Loading

Dandandan commented Nov 30, 2023 • edited Loading

alamb commented Nov 30, 2023

alamb commented Jul 23, 2024

alamb commented Jul 23, 2024

2010YOUY01 commented Jul 25, 2024

alamb commented Jul 25, 2024

alamb commented Jul 29, 2024

Avoid extra copies in `CoalesceBatchesExec` to improve performance #7957

Avoid extra copies in `CoalesceBatchesExec` to improve performance #7957

alamb commented Oct 27, 2023 •

edited

Loading

Dandandan commented Nov 30, 2023 •

edited

Loading

Dandandan commented Nov 30, 2023 •

edited

Loading