Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Optimize hash_aggregate when there are no null group keys #922

Closed
wants to merge 2 commits into from

Conversation

novemberkilo
Copy link

ref #850

WIP - for purposes of discussion and feedback.

@alamb I have done the simplest possible thing to get the branching code path that avoids the is_valid check. However, I'd appreciate some input on how best to proceed:

  • I'd like to figure out how best to refactor this code to reduce duplication. I'm not sure that the duplication is that bad in the first place. If we were to improve it, is this best done via an additional macro, or perhaps by changing the API of eq_array to take an additional parameter that signals whether the array has no-nulls.
  • How do I test eq_array_no_nulls? Should I look to follow the test for eq_array?

Thanks much in advance for the guidance and input.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 22, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @novemberkilo . This is looking like a good start

How do I test eq_array_no_nulls? Should I look to follow the test for eq_array?

Yes I think the tests for eq_array would be good.

I'm not sure that the duplication is that bad in the first place. If we were to improve it, is this best done via an additional macro

The goal of the optimization is to ensure that checking if the array has any nulls is done once per array, and not once per row (e.g. once per function call to eq_array).

If you used a macro I think you could have the compiler do the duplication for you. For example, you could define a macro like

macro_rules! eq_array_general {
    ($self:expr, $array:expr, $index:expr, $has_nulls:expr) => {{
      if has_nulls { /* check for nulls */}
      ...
    }};
}

And then you can call it like

    pub fn eq_array(&self, array: &ArrayRef, index:usize) -> bool {
      eq_array_general!(self, array, index, true)
    }
    pub fn eq_array_no_nulls(&self, array: &ArrayRef, index:usize) -> bool {
      eq_array_general!(self, array, index, false)
    }

And then we would be counting on the rust compiler to optimize out the if false { .. } for the null check

#[inline]
pub fn eq_array_no_nulls(&self, array: &ArrayRef, index:usize) -> bool {
if let DataType::Dictionary(key_type, _) = array.data_type() {
return self.eq_array_dictionary(array, index, key_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should perhaps also be self.eq_array_dictionary_no_nulls and have a corresponding dictionary special case.

@novemberkilo
Copy link
Author

@alamb @Dandandan I've added a commit that tries to incorporate your suggestions and feedback. Did I understand them correctly ... can you please comment again on whether I am headed in the right direction?

I've not added tests to cover the additional code paths yet - I am aware that I need to do that.

Also, do you have any pointers in the direction of how best to measure the effects of the attempted optimisation? Thanks.

@alamb
Copy link
Contributor

alamb commented Aug 30, 2021

Hi @novemberkilo -- Thanks! I will try and review this tomorrow. In terms of measuring performance, the "Performance Summary" in #808 (comment) -- might give you some good ideas of what to try / ways to show this is faster

@alamb
Copy link
Contributor

alamb commented Aug 30, 2021

(sorry for delays in reviewing, I am on vacation this week and am about to run out of time for the day)

@novemberkilo
Copy link
Author

Ah please don't stress about delays @alamb -- no rush on my behalf ... especially if you are on vacation.

.zip(null_information.iter())
.all(|((array, scalar), has_nulls)| {
scalar.eq_array(array, row, has_nulls)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think broadly speaking this is the pattern, though I am not sure how well the compiler will be able to optimize given a has nulls parameter has_nulls -- it might need to be hoisted out of this loop (the call to all here)

Have you had a chance to try profiling?

Copy link
Author

@novemberkilo novemberkilo Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I don't understand the suggestion about hoisting has_nulls out of the all

Would you be able to please sketch it out in pseudo code or similar?

I haven't used macros much and will go educate myself on how they work (particularly the optimisation role of the compiler). Will also get to profiling (although maybe not until the weekend).

@novemberkilo
Copy link
Author

novemberkilo commented Sep 5, 2021

Hi @alamb - I've used datafusion_aggregate_bench to run up a comparison with master and the ratios are coming up at 0.99 so there's nothing significant in the way of a performance improvement yet.

Could you please help me understand your suggestion:

am not sure how well the compiler will be able to optimize given a has nulls parameter has_nulls -- it might need to be hoisted out of this loop (the call to all here)

I'm not sure what this means. Are you perhaps suggesting that I use an index lookup of null_information rather than using zip?

@alamb
Copy link
Contributor

alamb commented Sep 10, 2021

Hi @alamb - I've used datafusion_aggregate_bench to run up a comparison with master and the ratios are coming up at 0.99 so there's nothing significant in the way of a performance improvement yet.

One reason you may not have seen any difference is that the aggregate benchmark has nulls in the grouping keys. Perhaps adjusting the following numbers and ensuring your code path is hit would be a good next step

https://github.com/alamb/datafusion_aggregate_bench/blob/main/src/main.rs#L23-L24

@novemberkilo
Copy link
Author

Thanks much @alamb

I followed your suggestion and set both of those constants to 0.0 and confirmed that the code path that is being hit is the one where array.is_valid() is skipped. I reran the benchmarks and still have no appreciable difference to report between my code and what is on master

This is confusing to me ... perhaps the is_valid() call is not as expensive as we thought? Or else the optimisation is still somehow not visible. I'll think about it some more but thought I would post these findings as an interim update.

@alamb
Copy link
Contributor

alamb commented Sep 12, 2021

This is confusing to me ... perhaps the is_valid() call is not as expensive as we thought?

Thanks for the update @novemberkilo 👍

This is definitely possible. I don't think we have measured the specific contribution that checking for nulls added to the overall runtime. If it turns out we can't see a difference in performance, perhaps the additional code complexity isn't worth it.

Unfortunately this does happen sometime while doing performance optimizations :( I am sorry if I mislead you here.

@Dandandan
Copy link
Contributor

@novemberkilo thanks for checking this and opening this PR.

Next to the remarks by @alamb, I think there is also some more principal change that would be needed to get the branch outside of the loop.
Currently, the boolean check if *$has_nulls is still done per value inside checking value equality. Ideally, this should be done entirely outside of the loop, as is done in Arrow kernels too. Maybe there would be needed quite some invasive changes / experiments needed to accomplish this.

A similar thing that could be tried here as I started (but didn't complete yet) in this PR for the hash join
#864 is to vectorize the hash collision check: instead of checking the collision for each value individually, it's delayed to a later stage, so it can be done in a more simple loop, which can be specialized for non-null cases. The core idea is that the inner loop should include as little branches / downcasting / etc. as possible.

Some other reasons might be that this currently is not a really hot part of the code. This could be checked by running some profiling tools (e.g. perf / valgrind) which can give hints where the hot code lives. A micro benchmark could help too to more accurately measure performance changes in this part of the code (instead of the performance of executing a full query).

I think some other sources of overhead might live in that currently one of the two values is a ScalarValue (instead of living in a contiguous array), which is potentially boxed / doesn't have good locality and can not be specialized for having no nulls. Also the Array is downcasted in the inner loop based on the datatype.

macro_rules! eq_array_general {
($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr, $has_nulls:expr) => {{
let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap();
if *$has_nulls {
Copy link
Contributor

@Dandandan Dandandan Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should be ideally be outside of the loop of
https://github.com/apache/arrow-datafusion/pull/922/files#diff-03876812a8bef4074e517600fdcf8e6b49f1ea24df44905d6d806836fd61b2a8L360

But this might be hard to accomplish given the current design. I posted some ideas my earlier comment on this PR

@novemberkilo
Copy link
Author

Unfortunately this does happen sometime while doing performance optimizations :( I am sorry if I mislead you here.

Hi @alamb - I am familiar with these sorts of outcomes when working on performance optimisation problems ... experimentation that results in a lack of positive results is also valuable. Please no need for any apology :)

@novemberkilo
Copy link
Author

novemberkilo commented Sep 13, 2021

@Dandandan and @alamb thanks much for your responses. I'm not clear on how to proceed with this PR though. My read is that we probably close this PR and chalk up implementation I presented here as being unsuccessful/unnecessary. WDYT?

The approach that you are suggesting @Dandandan appears to be non-trivial and should probably be broken out or written up on the original issue? I can try and do some of the benchmarking you have suggested but tbh I don't understand the actual problem well enough at the moment to work on this in a focused and productive manner. I will revisit the original issue and ask questions but if either of you can point to any other documentation that may be relevant, that would be great. I will also keep an eye on #864 👍

@novemberkilo
Copy link
Author

Concluding that this approach is not yielding the performance optimisation that we were reaching for. Per #922 (comment) we should probably look for a method that is similar to that being adopted in #864

Closing ...

@alamb
Copy link
Contributor

alamb commented Sep 17, 2021

👍 thanks again for giving it a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants