Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34698: [C++][Acero] Add node that emits explicit ordering after asserting order #44738

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Nov 15, 2024

Rationale for this change

An acero node that turns an implicit ordering into an explicit ordering (rows sorted by some columns) is useful to re-use order that already exists in the data.

What changes are included in this PR?

This PR adds the AssertOrderNode that implements this logic. The Scanner employs that node to turn the implicit ordering of the ScanNode into an explicit order as defined by user code via ScanBuilder.Ordering.

Are these changes tested?

There are unit tests for the AssertOrderNode as well as for the ScanNode and ScanBuilder.

Are there any user-facing changes?

The following options has been added:

  • the ordering option added to ScanOptions
  • the Ordering option added to ScannerBuilder
  • the AssertNodeOptions class

@EnricoMi EnricoMi requested a review from westonpace as a code owner November 15, 2024 09:29
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@EnricoMi EnricoMi changed the title GH-#34698: [C++] Add node that emits explicit ordering after asserting order GH-#34698: [C++][Acero] Add node that emits explicit ordering after asserting order Nov 15, 2024
@EnricoMi EnricoMi changed the title GH-#34698: [C++][Acero] Add node that emits explicit ordering after asserting order GH-34698: [C++][Acero] Add node that emits explicit ordering after asserting order Nov 15, 2024
Copy link

⚠️ GitHub issue #34698 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably not qualified to review this thoroughly but the CI failure is related:


[  FAILED  ] TestScannerThreading/TestScanner.ScanOrderingFail/0Serial1d1b1024r, where GetParam() = serial-1d-1b-1024i (4 ms)
[ RUN      ] TestScannerThreading/TestScanner.ScanOrderingFail/1Serial2d16b1024r
/arrow/cpp/src/arrow/dataset/scanner_test.cc:1040: Failure
Expected equality of these values:
  next.status().message()
    Which is: "Data is not ordered\n/arrow/cpp/src/arrow/acero/assert_order_node.cc:190  AssertInBatchOrder(*record_batch, ordering_, comparators_)\n/arrow/cpp/src/arrow/acero/source_node.cc:158  output_->InputReceived(this, std::move(batch))"
  "Data is not ordered"
With diff:
@@ -1,3 @@
 Data is not ordered
-/arrow/cpp/src/arrow/acero/assert_order_node.cc:190  AssertInBatchOrder(*record_batch, ordering_, comparators_)
-/arrow/cpp/src/arrow/acero/source_node.cc:158  output_->InputReceived(this, std::move(batch))

There seems to be some stray output on that specific job.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Nov 19, 2024
@EnricoMi EnricoMi force-pushed the assert-order-node branch 2 times, most recently from 7750094 to 2e30308 Compare November 20, 2024 10:54
@EnricoMi
Copy link
Contributor Author

@raulcd thanks, assertions adjusted

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Conceptually this makes sense to me. I like that it verifies the ordering and it's good use of a sequencing queue. I can't speak much to the code in compare.h / compare.cc as I'm not very familiar with it.

It's a brand new node and I don't see much change to existing paths so I'm not too concerned about a regression.

cpp/src/arrow/acero/exec_plan.h Outdated Show resolved Hide resolved
Comment on lines 879 to 883
Visit(const T& left_) {
Visit(const T& left) {
const auto& right = checked_cast<const T&>(right_);
result_ = right.value == left_.value;
result_ = left.value == right.value;
return Status::OK();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

Ordering ordering = assert_options.ordering;

// check output ordering
if (ordering.is_implicit() || ordering.is_unordered()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you use !ordering.is_explicit()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, fixed

@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels Jan 21, 2025
@EnricoMi
Copy link
Contributor Author

@westonpace saying: "I can't speak much to the code in compare.h / compare.cc as I'm not very familiar with it."

@pitrou @lidavidm @bkietz @zeroshade you seem to be familiar with compare.cc, can you take a look, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants