feat(7181): cascading loser tree merges #7379

wiedld · 2023-08-22T22:42:18Z

Which issue does this PR close?

External sorting (cascading merges) of the internal-sorted (in-memory) SortPreservingMergeStream.

Rationale for this change

The current loser tree sort in 0(n log(k)), handing all incoming record batches in a single loser tree merge.
Planned change is to have a cascaded merge with each fan-in up to 10 streams, which should not change the overall asymptotic upper bounds -- but does introduce the ability to do additional performance improvements (such as multi-core).

Performance change

UPDATED
(caveat: on local machine)

Benchmarking -- both branches with 10 threads on M1 Pro chip.

What changes are included in this PR?

How it works:

Each merge uses the same code (in-memory sorting using the sort-preserving-merge loser tree).
The sort_order from previous merges, are yielded/streamed into the next merge in the cascade tree.
The cascade tree root (final merge) node does the construction and yielding of record batches.

Performance considerations:

Focused on not adding too much code to the inner loop of the merge.
Each merge is only sorting the normalized keys (accessed via the cursors).
The distribution function (a.k.a. how distribute streams between merges) is order preserving.
- The loser tree (min heap), at each merge, has at most 1 pointer each from input stream at any given moment.
- Merge stream outputs, are inputs to the next merge stream.
- Therefore, any downstream (cascaded) merges are also ordered and cannot advance beyond each other.
Record batch slicing is expensive. Therefore, only slice (and pass around in merges) the cursors.
- cursors have the normalized key for comparison.
- batches are tracked (write once, read multiple).
Streams are buffered and multithreaded (when possible).

Are these changes tested?

Current tests are passing. , if and only if, the corresponding change in arrow-rs is also linked.
See the temporary commit 50c8636.
Updated:: removed the need to additional changes in arrow-rs.

Are there any user-facing changes?

Changes are to internal APIs only.

datafusion/core/src/physical_plan/sorts/cursor.rs

wiedld · 2023-08-22T22:58:05Z

datafusion/core/src/physical_plan/sorts/streaming_merge.rs

+//! Merge that deals with an arbitrary size of streaming inputs.
+//! This is an order-preserving merge.
+
+use crate::physical_plan::metrics::BaselineMetrics;


This mod is moved code, with the only change => to use the cascaded SortPreservingCascadeStream.

wiedld · 2023-08-22T22:59:45Z

datafusion/core/src/physical_plan/sorts/merge.rs

-
-/// Perform a streaming merge of [`SendableRecordBatchStream`] based on provided sort expressions
-/// while preserving order.
-pub fn streaming_merge(


Since we added a layer between streaming_merge() and SortPreservingMergeStream (a.k.a. the cascading stream layer) => therefore, decided to move this function to it's own mod.

Dandandan · 2023-08-23T13:47:35Z

@wiedld are you sure of the benchmark results? TPC-H 17 doesn't contain any sorting, and other queries do not have very expensive sorts.

alamb · 2023-08-23T19:01:33Z

BTW I think the TPCH benchmark is not one that will likely show the power of this improvement, because as @Dandandan says the TPCH queries don't actually have any large sorts that I know of. I think we need to try the Sort benchmarks. I can help with this.

alamb · 2023-08-23T19:16:40Z

In addition to the sort benchmark (which might have its own issues) here is a suggested testing methodology:

Use this dataset: traces_nd_random.zip (220MB):
Build a release build of datafusion-cli (both on main and on this branchviacargo cargo build --release`)
Compare performance of this command (which will resort the input data randomly and write it to an output file)

datafusion-cli -c "copy (select * from 'traces_nd_random.parquet' order by time desc) to '/tmp/test.parquet'"

alamb · 2023-08-23T20:32:10Z

@wiedld and I spoke a bit this afternoon and I think the next steps for this PR are to get a query that shows significant performance improvements. I think the one in #7379 (comment) is a good candidate

I don't really understand the code in this PR yet, but the way I suggest trying to add more parallelism is by "buffering" the the streams so that rather than computing everything on demand with poll_next spawn an explicit tokio::task for each input stream that will try to pull the next input while the current task is merging the input.

Maybe @crepererum or @tustvold can help with a suggestion on how to do the "add buffering/new tasks" in a reasonable rust way

…o the same structure. During the later stages of Cascade merge, we will no longer be sorting based on each streaming batch (one cursor at a time). Instead will be referencing a previous sort_order per [batch_idx][row_idx] when merging previous steps in the cascade. Therefore, in order to keep the same set of Cursors we are moving the input and output structures more closely together. Later optimizations may be able to decouple these again.

…ge mod. Merge mod has the SortPreservingMergeStream, containing the loser tree. This SortPreservingMergeStream struct will be used repeatedly as part of the cascading merge; in turn, the cascading merge will be implemented for the streaming_merge() method.

SortPreservingCascadeStream currrently has a single root node of SortPreservingMergeStream. TODO: build out tree of SortPreservingMergeStream.

…ortPreservingCascadeStream doing the final interleave(). This commit knowingly fails for tests which are utilizing multiple polls to return all record batches. Specifically: * dataframe::tests::with_column_renamed_join * physical_plan::sorts::sort_preserving_merge::tests::test_partition_sort_streaming_input_output TODO: splicing the RecordBatch and Cursor per merge yield.

This requires slicing the batches and cursors, when yielded in parts. These two tests, reliant on multiple-polls of streaming data, now pass: * dataframe::tests::with_column_renamed_join * physical_plan::sorts::sort_preserving_merge::tests::test_partition_sort_streaming_input_output

* OffsetCursorStream enables the same RowCursorStream (with the same RowConverter) to be used across multiple leaf nodes. * Each tree node is a merge (a.k.a. SortPreservingMergeStream). * YieldedCursorStream enables the output from the previous merge, to be provided as input to the next merge.

…tches. This also enabled simplification of code and cursor handling in the SortOrderBuilder.

…d record batch)

wiedld · 2023-08-29T22:41:41Z

Pushed another 2 commits to handle the planned improvements (reducing known expensive operations). Updated the PR description to show that the current sort benchmarks are (roughly) not changing as much -- which was our hope.

Moving on to the next step, which is moving towards buffering and multi-core.

datafusion/core/src/physical_plan/sorts/stream.rs

…ter-used. Add missing debug fmt.

…sed for other streamed data (besides RecordBatches)

wiedld · 2023-09-13T18:20:12Z

Updated branch and resolved conflicts.

Note that CI will fail because have a temporary commit to point to the arrow-rs branch changes; and therefore it fails the cargo package audit. But at least it's easier to pull the branch and run it locally.

Note: we are intentionally delaying code review, due to higher priorities.

alamb

Wow @wiedld -- I am very impressed by this PR. I found the code easy to read, which is saying something given how complicated the entire area is. That being said I don't yet fully understand the changes here and am still working through them,

I also locally tested using this #7379 (comment) but that workload was dominated by parquet reading / writing so I couldn't see a difference

I am in the process of trying to reproduce your reported benchmark results

Thank you very much.

datafusion/core/src/physical_plan/sorts/builder.rs

datafusion/core/src/physical_plan/sorts/cascade.rs

alamb · 2023-09-14T15:00:38Z

Here are the results of one of my benchmark runs. Very impressive @wiedld

--------------------
Benchmark sort.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ 7181_cascading-loser-tree-merges ┃         Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ Qsort utf8   │ 72379.31ms │                        5969.19ms │ +12.13x faster │
│ Qsort int    │ 89964.04ms │                        6578.05ms │ +13.68x faster │
│ Qsort        │ 73612.54ms │                       10893.99ms │  +6.76x faster │
│ decimal      │            │                                  │                │
│ Qsort        │ 98054.80ms │                       98251.46ms │      no change │
│ integer      │            │                                  │                │
│ tuple        │            │                                  │                │
│ Qsort utf8   │ 73188.81ms │                        6900.60ms │ +10.61x faster │
│ tuple        │            │                                  │                │
│ Qsort mixed  │ 81693.99ms │                       30029.59ms │  +2.72x faster │
│ tuple        │            │                                  │                │
└──────────────┴────────────┴──────────────────────────────────┴────────────────┘

alamb · 2023-09-14T16:52:24Z

Given what I have seen with this PR, I think we should proceed with this PR -- @tustvold do you have time to give it a review as well?

There is likely to be an arrow release in the next few days -- @wiedld can you prepare a PR with whatever changes you need in arrow so we can turn this PR mergable (by relying on a released version of arrow?)

wiedld · 2023-09-15T07:57:44Z

datafusion/core/src/physical_plan/sorts/cascade.rs

+///      1. [`BatchCursorStream`] yields the initial cursors and batches. (e.g. a RowCursorStream)
+///      2. [`BatchTrackingStream`] collects the batches, to avoid passing those around. Yields a [`CursorStream`](super::stream::CursorStream).
+///      3. This initial CursorStream is for a number of partitions (e.g. 100).
+///      4. The initial single CursorStream is shared across multiple leaf nodes, using [`OffsetCursorStream`].


~~Not in the notes: reason why a single input CursorStream is shared across leaves, is such that they share the same RowConverter.~~ See updated comment.

After the arrow-rs version bump, will try to slightly change this design. Goal is to remove the mutex around the BatchTrackingStream and have instead the lock be only on a BatchTracker consumed by the OffsetCursorStream (and ofc also consumed in the final interleave in the cascade stream root).

Mutex removed. This explainer is also updated to reflect the latest design.

wiedld · 2023-09-19T17:49:40Z

datafusion/physical-plan/src/sorts/batch_cursor.rs

+///
+/// Unique representation of sliced cursor is denoted by the [`SlicedBatchCursorIdentifier`].
+#[derive(Debug)]
+pub struct BatchCursor<C: Cursor> {


This is used in the CursorStream (not the BatchCursorStream which included the actual record batches). Therefore, I think this could have a better name.

It wraps the cursor, and maps it to the original (tracked) batch -- as well as tracking the sliced offset. Naming ideas?

datafusion/physical-plan/src/sorts/stream.rs

* Move record batch tracking into its own abstraction and with interior mutability * Split streams instead of locking, which removes the need to poll per offset subset. * As a reflection of this reduced responsibilty, rename OffsetCursorStream to BatchTrackerStream.

wiedld · 2023-10-10T19:26:33Z

datafusion/physical-plan/src/coalesce_partitions.rs

-                    RecordBatchReceiverStream::builder(self.schema(), input_partitions);
+                    ReceiverStream::builder(self.schema(), input_partitions);
+                let input =
+                    Arc::new(RecordBatchReceiverStreamAdaptor::new(self.input.clone()));


RecordBatchReceiverStream was made generic in this commit, such that it could handle a buffered stream of record_batches, or the sort_orders (yielded per each merge node).

In order to make generic, did the following:

create a StreamAdapter trait, with the StreamAdapter::call() to be used for ReceiverStream::run_input().

impl a RecordBatchReceiverStreamAdaptor that is used for record batches

Please let me know if I should have structured this differently.

alamb · 2023-10-30T20:38:02Z

Since we are working this PR through in pieces, marking this PR as draft so it is clear it is not waiting on feedback

github-actions · 2024-05-02T01:46:10Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

wiedld · 2024-05-02T03:45:53Z

Working on other things. If/when we circle back, we'll be recreating differently.

github-actions bot added the core Core DataFusion crate label Aug 22, 2023

wiedld commented Aug 22, 2023

View reviewed changes

datafusion/core/src/physical_plan/sorts/cursor.rs Outdated Show resolved Hide resolved

wiedld commented Aug 22, 2023

View reviewed changes

wiedld force-pushed the 7181/cascading-loser-tree-merges branch from 601d321 to 76fee2d Compare August 29, 2023 22:00

wiedld added 11 commits August 29, 2023 15:11

feat(7181): add cursor.seek()

a3870d0

feat(7181): streaming_merge() consumes SortPreservingCascadeStream

3d43e97

SortPreservingCascadeStream currrently has a single root node of SortPreservingMergeStream. TODO: build out tree of SortPreservingMergeStream.

feat(7181): add slice() to Cursor trait

28454c5

feat(7181): build multiple-level cascade tree.

eb647ea

feat(7181): use RecordBatch tracking to avoid expensive slicing of ba…

8cd22a0

…tches. This also enabled simplification of code and cursor handling in the SortOrderBuilder.

fix(7181): improve performance by using hasher on tuple (unqiue slice…

173577b

…d record batch)

wiedld force-pushed the 7181/cascading-loser-tree-merges branch from 76fee2d to 173577b Compare August 29, 2023 22:16

wiedld commented Aug 29, 2023

View reviewed changes

datafusion/core/src/physical_plan/sorts/stream.rs Outdated Show resolved Hide resolved

wiedld commented Aug 29, 2023

View reviewed changes

datafusion/core/src/physical_plan/sorts/stream.rs Outdated Show resolved Hide resolved

Dandandan reviewed Aug 30, 2023

View reviewed changes

datafusion/core/src/physical_plan/sorts/stream.rs Outdated Show resolved Hide resolved

wiedld added 5 commits September 2, 2023 20:07

chore(7181): make a zero-cost BatchId type, for more explicit code

b0f1402

refactor: comment the major streaming structures, and how they are in…

9ea3a65

…ter-used. Add missing debug fmt.

refactor: use u64 as batch_id in cascading merge sort

7be30c2

feat(7181): convert into generic ReceiverStream, such that can be reu…

0e9573d

…sed for other streamed data (besides RecordBatches)

feat(7181): add buffered multithreading to merge streams

c439138

chore: clippy and linter

d520496

alamb reviewed Sep 14, 2023

View reviewed changes

datafusion/core/src/physical_plan/sorts/builder.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_plan/sorts/cascade.rs Outdated Show resolved Hide resolved

wiedld mentioned this pull request Sep 14, 2023

feat(datafusion-7181): enable slicing of rows apache/arrow-rs#4817

Closed

wiedld added 3 commits September 14, 2023 20:42

fix(7181): have RowCursor slicing be within the a single arc-refed Rows

a324ef8

feat(7181): have BatchCursor be the primary struct passed around

d3613bd

feat(7181): update documentation for the cascaded merge

3786021

wiedld force-pushed the 7181/cascading-loser-tree-merges branch from 4720f19 to 3786021 Compare September 15, 2023 07:17

wiedld commented Sep 15, 2023

View reviewed changes

wiedld and others added 2 commits September 15, 2023 10:40

Merge branch 'main' into 7181/cascading-loser-tree-merges

2932bd5

fix: add apache license header to new mods

8701220

wiedld marked this pull request as ready for review September 15, 2023 17:59

wiedld and others added 2 commits September 18, 2023 17:53

Merge branch 'main' into 7181/cascading-loser-tree-merges

828a5d1

Merge branch 'main' into 7181/cascading-loser-tree-merges

0dfc60c

wiedld commented Sep 19, 2023

View reviewed changes

datafusion/physical-plan/src/sorts/stream.rs Outdated Show resolved Hide resolved

wiedld added 2 commits October 5, 2023 14:28

Merge branch 'main' into 7181/cascading-loser-tree-merges

e642420

wiedld force-pushed the 7181/cascading-loser-tree-merges branch from 79e80a6 to f97cc4d Compare October 6, 2023 05:33

Merge branch 'main' into 7181/cascading-loser-tree-merges

9b10198

wiedld commented Oct 10, 2023

View reviewed changes

wiedld mentioned this pull request Oct 24, 2023

feat(7181): provide slicing of CursorValues #7912

Closed

alamb marked this pull request as draft October 30, 2023 20:38

github-actions bot added the Stale PR has not had any activity for some time label May 2, 2024

wiedld closed this May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(7181): cascading loser tree merges #7379

feat(7181): cascading loser tree merges #7379

wiedld commented Aug 22, 2023 •

edited

Loading

wiedld Aug 22, 2023 •

edited

Loading

wiedld Aug 22, 2023

Dandandan commented Aug 23, 2023

alamb commented Aug 23, 2023

alamb commented Aug 23, 2023

alamb commented Aug 23, 2023

wiedld commented Aug 29, 2023 •

edited

Loading

wiedld commented Sep 13, 2023

alamb left a comment

alamb commented Sep 14, 2023

alamb commented Sep 14, 2023

wiedld Sep 15, 2023 •

edited

Loading

wiedld Oct 6, 2023

wiedld Sep 19, 2023 •

edited

Loading

wiedld Oct 10, 2023 •

edited

Loading

alamb commented Oct 30, 2023

github-actions bot commented May 2, 2024

wiedld commented May 2, 2024

feat(7181): cascading loser tree merges #7379

feat(7181): cascading loser tree merges #7379

Conversation

wiedld commented Aug 22, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

Performance change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

wiedld Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

wiedld Aug 22, 2023

Choose a reason for hiding this comment

Dandandan commented Aug 23, 2023

alamb commented Aug 23, 2023

alamb commented Aug 23, 2023

alamb commented Aug 23, 2023

wiedld commented Aug 29, 2023 • edited Loading

wiedld commented Sep 13, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 14, 2023

alamb commented Sep 14, 2023

wiedld Sep 15, 2023 • edited Loading

Choose a reason for hiding this comment

wiedld Oct 6, 2023

Choose a reason for hiding this comment

wiedld Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

wiedld Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

alamb commented Oct 30, 2023

github-actions bot commented May 2, 2024

wiedld commented May 2, 2024

wiedld commented Aug 22, 2023 •

edited

Loading

wiedld Aug 22, 2023 •

edited

Loading

wiedld commented Aug 29, 2023 •

edited

Loading

wiedld Sep 15, 2023 •

edited

Loading

wiedld Sep 19, 2023 •

edited

Loading

wiedld Oct 10, 2023 •

edited

Loading