Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve StringView support for SUBSTR #12044

Merged
merged 27 commits into from
Sep 6, 2024

Conversation

Kev1n8
Copy link
Contributor

@Kev1n8 Kev1n8 commented Aug 17, 2024

Which issue does this PR close?

Closes #12031
Closes #12033

Rationale for this change

What changes are included in this PR?

  1. When it comes to StringVIew in SUBSTR, operate directly on the views instead of generating new Strings.
  2. I took the liberty of making the sqllogictest treat Utf8View column as Text.
  3. Added a bench file to record differences in the following conditions:
    One is both the input and output of SUBSTR is larger or smaller than 12B, for another the input is larger while the output is smaller than 12B. Also, compare Utf8View, Utf8, and LargeUtf8 with each other.

Are these changes tested?

yes

Are there any user-facing changes?

no

@Kev1n8
Copy link
Contributor Author

Kev1n8 commented Aug 17, 2024

The following is my latest bench report of SUBSTR:


SHORTER THAN 12/substr_string_view [size=1024, strlen=12]
                        time:   [29.675 µs 29.964 µs 30.402 µs]
                        change: [-0.1382% +0.9754% +2.5969%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
SHORTER THAN 12/substr_string [size=1024, strlen=12]
                        time:   [59.403 µs 59.823 µs 60.292 µs]
                        change: [+0.2182% +0.9242% +1.7185%] (p = 0.04 < 0.05)
                        Change within noise threshold.
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]
                        time:   [41.462 µs 41.775 µs 42.064 µs]
                        change: [-0.7169% +0.3030% +1.2513%] (p = 0.57 > 0.05)
                        No change in performance detected.

LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]
                        time:   [68.409 µs 68.896 µs 69.437 µs]
                        change: [-22.017% -10.028% +1.0113%] (p = 0.23 > 0.05)
                        No change in performance detected.
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]
                        time:   [101.98 µs 102.20 µs 102.55 µs]
                        change: [-0.1908% +0.0183% +0.3687%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]
                        time:   [101.90 µs 102.32 µs 102.80 µs]
                        change: [+0.1753% +0.5640% +1.0536%] (p = 0.01 < 0.05)
                        Change within noise threshold.

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]
                        time:   [15.031 µs 15.180 µs 15.331 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]
                        time:   [33.738 µs 34.057 µs 34.370 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]
                        time:   [33.509 µs 34.122 µs 35.075 µs]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

SHORTER THAN 12/substr_string_view [size=4096, strlen=12]
                        time:   [120.22 µs 121.43 µs 122.70 µs]
                        change: [+1.3982% +2.3499% +3.4501%] (p = 0.00 < 0.05)
                        Performance has regressed.
SHORTER THAN 12/substr_string [size=4096, strlen=12]
                        time:   [237.06 µs 251.39 µs 268.81 µs]
                        change: [+1.6838% +7.2415% +14.130%] (p = 0.04 < 0.05)
                        Performance has regressed.
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]
                        time:   [163.27 µs 164.24 µs 165.40 µs]
                        change: [-0.2464% +0.3706% +1.0851%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]
                        time:   [275.69 µs 277.81 µs 280.16 µs]
                        change: [+0.9658% +1.8029% +2.6242%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]
                        time:   [416.84 µs 419.95 µs 423.71 µs]
                        change: [-1.6174% -0.0917% +1.2731%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]
                        time:   [419.60 µs 421.94 µs 424.48 µs]
                        change: [+1.5505% +2.0977% +2.7392%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]
                        time:   [59.933 µs 60.122 µs 60.355 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]
                        time:   [133.30 µs 134.66 µs 135.97 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]
                        time:   [131.37 µs 141.42 µs 158.42 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

substr_index_array_array_1000
                        time:   [64.130 µs 68.632 µs 73.376 µs]
                        change: [-16.971% -10.021% -2.4160%] (p = 0.01 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Aug 17, 2024
let mut group = c.benchmark_group("SHORTER THAN 12");
group.sampling_mode(SamplingMode::Flat);
group.sample_size(10);
group.measurement_time(Duration::from_secs(10));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we tune the sampling parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually simply copied these settings from benches/repeat. I'm not very sure about this.

if length == 0 {
builder.append_null();
} else if length > 12 {
let buffer_index = (*raw >> 64) as u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use ByteView from arrow-rs:
https://github.com/apache/arrow-rs/blob/27789d7c9abb50796a4042e7e193703efe3c95b3/arrow-data/src/byte_view.rs#L44-L54

But ByteView is behind arrow-data, which is not explicitly depended by DataFusion, what's your opinion? @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add an explclit dependence on arrow-data similar to

arrow-ord = { version = "52.2.0", default-features = false }

In general I think it would be better if the arrow crate re-exported these various structures (so DataFusion could depend only on arrow rather than all the sub crates)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made apache/arrow-rs#6275 to export this structure publically

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also kind of annoying that the use of inline_value always comes with an immediate call to from_utf8_unchecked

                                let bytes =
                                    StringViewArray::inline_value(raw, length as usize);
                                let str = std::str::from_utf8_unchecked(
                                    &bytes[..length as usize],
                                );

I'll try and make a PR in arrow-rs to do that too

@XiangpengHao
Copy link
Contributor

I think the PR is looking good, left some comments

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Kev1n8 and @XiangpengHao -- I agree this PR is very close

This is a really neat PR

The only thing I think is needed is using ByteView, as suggested by @XiangpengHao in https://github.com/apache/datafusion/pull/12044/files#r1720861699

I left some smaller suggestions, that is the big one

datafusion/functions/src/unicode/substr.rs Show resolved Hide resolved
datafusion/sqllogictest/test_files/arrow_typeof.slt Outdated Show resolved Hide resolved
/// substr('alphabet', 3) = 'phabet'
/// substr('alphabet', 3, 2) = 'ph'
/// The implementation uses UTF-8 code points as characters
fn calculate_substr<'a, V, T>(string_array: V, args: &[ArrayRef]) -> Result<ArrayRef>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function adds an unecessary level of indirection

Rather than havingsubstr dispatch to calculate_substr which then dispatches to calculate_string or calculate_string_view I suspect the code would be easier to follow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I didn't realize it earlier.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Kev1n8 and @XiangpengHao -- this is quite cool and I think looking close

As @XiangpengHao mentions in https://github.com/apache/datafusion/pull/12044/files#r1720861699 , I think we should be using ByteView instead of inlined comparison. If the stuff about using arrow-data isn't clear, I would be happy to work on it

I'll make a PR with some proposed changes to ths one shortly

Comment on lines 76 to 86
pub(crate) fn optimized_utf8_to_str_type(
arg_type: &DataType,
name: &str,
) -> Result<DataType> {
let support_list = ["substr"];
if support_list.contains(&name) {
Ok(DataType::Utf8View)
} else {
utf8_to_str_type(arg_type, name)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 since this function simply returns Utf8View regardless of the input types, maybe we should just change substr() to directly return Ok(DataType::Utf8View) -- that would probably be the simplest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think so too.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @Kev1n8 and @XiangpengHao -- I went though this PR again and I pushed a few more commits. I think this exercise is exploring where using StringViewArray is valuable, though since it is early days it is taking a way.

I think I have two concerns with this PR:

  1. The logic for StringView is now entirely separate from the String/LargeString array
  2. There is some non trivial overlap for manipulating the views

Here is what I suggest to move forward:

  1. Add the benchmark in its own new PR (so we more easily compare just the code change in this PR)
  2. See if we can try and avoid some of the duplication (I left some specific comments)

// Safety:
// 1. idx < string_array.views.size()
// 2. builder is guaranteed to have corresponding blocks
unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems like there is some non trivial duplication here and the clauses below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use value_unchecked to retrieve str in both cases.

// (2) we do not slice on utf-8 codepoint
unsafe {
let bytes =
StringViewArray::inline_value(raw, length as usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that we never had to call append_value -- as I think the substr calculation can be done entirely in terms of the views (and never modify the values in the buffer). Reading this code carefully I do think is the case but it is quite implicit (it knows that calling append_value on a string that is less than 12 bytes) will not push values values into the buffer

I wonder if we could refactor some of this code into functions with names

Perhaps something like

/// Modify a `view` with length <= 12 bytes so it reflects the substring [start..end]
fn substr_small_view(len: usize, view: u128, start: usize, end: size) -> u128 { ... }

/// Modify a `view` with length > 12 bytes so it reflects the substring [start..end]
fn substr_large_view(len: usize, view: u128, start: usize, end: size) -> u128 { ... }

Then we could simply updated the views directly

However, the StringViewBuilder doesn't seem to allow for that at the moment....

Copy link
Contributor Author

@Kev1n8 Kev1n8 Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a function like append_view_u128_unchecked(view: u128) in arrow/src/builder/generic_bytes_view_builder.rs to simply add a view with a given u128. Then the whole process would be:

  1. Get the str of the view by value_unchecked, then get the [start, end)
  2. sub_view = if end-start>12 substr_large_view() else substr_small_view() make_view()
  3. call appned_view_u128_unchecked(sub_view) on the builder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this code carefully I do think is the case but it is quite implicit (it knows that calling append_value on a string that is less than 12 bytes)

Yes

Maybe we could add a function like append_view_u128_unchecked(view: u128)

I like this direction.

The current append_view_unchecked is basically only for long strings (as it asks for a block id), and we don't have a similar method for short strings, what's why we call append_value as a workaround.

We can probably have a function called append_inlined_view_unchecked(len: usize, prefix: &[u8]) to make it the intention clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that I think retrieving the &str by value_unchecked() is a must-do to calculate the [start, end), then the sub_view: u128 could be created based on the original view using make_view. I'll make some commits.

@Kev1n8
Copy link
Contributor Author

Kev1n8 commented Aug 22, 2024

  • Get the str of the view by value_unchecked, then get the [start, end)
  • sub_view = if end-start>12 substr_large_view() else substr_small_view()
  • call appned_view_u128_unchecked(sub_view) on th

I can open a PR for that append_u128 function in arrow if it's considered a good idea, and then adjust this PR.


Refactor the code again into the following behavior:
Collect the views using Vec<u128>, record the nulls using NullBufferBuilder, and construct the StringViewArray with new_unchecked.

@Kev1n8 Kev1n8 marked this pull request as ready for review August 23, 2024 11:20
@alamb
Copy link
Contributor

alamb commented Aug 25, 2024

Thanks @Kev1n8 ad @XiangpengHao -- I am running some benchmarks on this now.

BTW I was thinking that once we have completed #12119 we could potentially make substr always return Utf8View -- as it should always be better to use the views rather than Utf8Array

@Kev1n8
Copy link
Contributor Author

Kev1n8 commented Aug 28, 2024

BTW I was thinking that once we have completed #12119 we could potentially make substr always return Utf8View -- as it should always be better to use the views rather than Utf8Array

Sounds great.

And I think we can also improve the TRIM function like this PR too, which is quite similar to SUBSTR.

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

Sorry -- here are the results of my benchmark run -- TLDR is this looks great to me

++ critcmp main stringview-output-for-substr
group                                                                              main                                   stringview-output-for-substr
-----                                                                              ----                                   ----------------------------
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]               1.01    147.7±0.16µs        ? ?/sec    1.00    146.4±0.10µs        ? ?/sec
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]               1.00    590.8±0.14µs        ? ?/sec    1.00    589.1±1.54µs        ? ?/sec
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]                     1.05    154.9±0.25µs        ? ?/sec    1.00    148.0±0.11µs        ? ?/sec
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]                     1.08    636.7±0.81µs        ? ?/sec    1.00    591.5±0.27µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]                1.50    158.2±0.20µs        ? ?/sec    1.00    105.4±0.04µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]                1.50    629.0±0.18µs        ? ?/sec    1.00    420.6±0.12µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]                         1.00     56.8±0.02µs        ? ?/sec    1.03     58.3±0.93µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]                         1.00    223.8±0.30µs        ? ?/sec    1.02    227.3±0.05µs        ? ?/sec
SHORTER THAN 12/substr_string [size=1024, strlen=12]                               1.05     89.3±0.05µs        ? ?/sec    1.00     84.9±0.04µs        ? ?/sec
SHORTER THAN 12/substr_string [size=4096, strlen=12]                               1.05    351.7±0.32µs        ? ?/sec    1.00    335.2±0.10µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=1024, strlen=12]                          2.82     58.4±0.04µs        ? ?/sec    1.00     20.7±0.01µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=4096, strlen=12]                          2.79    227.8±0.05µs        ? ?/sec    1.00     81.6±0.04µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]    1.00     40.9±0.01µs        ? ?/sec    1.00     41.1±0.03µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]    1.01    157.6±0.13µs        ? ?/sec    1.00    156.3±0.13µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]          1.06     43.9±0.11µs        ? ?/sec    1.00     41.5±0.08µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]          1.07    169.7±0.03µs        ? ?/sec    1.00    158.4±0.09µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]     1.74     43.7±0.01µs        ? ?/sec    1.00     25.1±0.04µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]     1.69    166.4±0.07µs        ? ?/sec    1.00     98.6±0.06µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Kev1n8 and @XiangpengHao - I am sorry for the delay in review / approval / testing of this PR

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

I also merged up from main for this PR to get the arrow 53 release (and thus the changes to add arrow-data as a dependency were removed)

@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

Filed #12338 to track the idea of using StringViewArray always

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of SUBSTR for StringViewArray
3 participants