-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster parquet utf8 validation using simdjson
#6668
base: main
Are you sure you want to change the base?
Conversation
Ok(_) => Ok(()), | ||
Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)), | ||
Err(_) => { | ||
let e = simdutf8::compat::from_utf8(val).unwrap_err(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call compat from_utf8
again to get the same error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the role of simdutf8::basic::from_utf8 and re-run with simdutf8::compat -- does this deserve a code comment?
(at least the .unwrap_err()
safety deserves one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same in offset_buffer.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree deserves some comments explaining why we rerun it in case of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a positive sentiment about using simdutf8
for faster validation, I can do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have our own from_utf8
that wraps the simdutf8
implementation? Then the weird basic/compat path would only need to be documented once (and make it easier to replace other from_utf8
invocations that @alamb identified).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a PR that does this in the next few days if no one beats me to it
@@ -69,6 +69,7 @@ paste = { version = "1.0" } | |||
half = { version = "2.1", default-features = false, features = ["num-traits"] } | |||
sysinfo = { version = "0.32.0", optional = true, default-features = false, features = ["system"] } | |||
crc32fast = { version = "1.4.2", optional = true, default-features = false } | |||
simdutf8 = { version = "0.1.5"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be optional as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How mature is the library and its dependencies?
My random spike led me to https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/aarch64/neon.rs#L16 and https://docs.rs/flexpect/latest/flexpect/ lacks documentation.
Should we help simdutf8 to bring it to arrow's maturity level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it is just some macro helper for clippy split off as crate / dependency. Doesn't seem too bad.
I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe... |
Hm yeah wondering about that. I think that 5% speed up for Parquet might be quite valuable though, given that it often translates in close to 5% faster query execution for queries where Parquet scan is a bottleneck (quite some DF benchmarks actually involving string data). |
FWIW some other projects are using |
I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data? |
The "use case" of this PR is just that utf8 validation takes time, this PR improves the performance. I think having a option to disable it makes sense, but would be good to minimize the cost of validation as well. |
So what shall we do with this PR? Make it an optional opt-in feature of the parquet crate that people can enable if they want more performance? |
I believe this PR is solely for performance enhancement. Introducing an optional opt-in feature deserves a separate PR. |
Coming from #6921 (comment), I have seen much larger performance (~15%) improvements using simdutf8 with StringViewArray + x86, especially when strings are long (>128 byte). |
What I suggest we do with this PR is get some end to end performance numbers (aka run the DataFusion clickbench benchmark with a pinned arrow version with this change) Assuming that looks promising I recommend creating a PR that has an optional feature (enabled by default) for using simdjson for utf8 validation. |
simdjson
This is my benchmark results with Clickbench non-partitioned and filter pushdown enabled. Benchmarked on x86 AMD 9900x. Some scan-dominate queries can get 20% improvements.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan and @findepi, @XiangpengHao @doki23 and @tustvold
@etseidl I wonder if you have any thoughts on this PR?
I double checked that this PR catches the important validation path in parquet. There are some other places where utf8 is validated, but they seem like they are relatively
It also appears this library is used by polars which gives me some confidence it is stable and will have community support if there are issues: https://crates.io/crates/simdutf8/reverse_dependencies
Thus I think we should proceed and add a flag to disable the feature as a follow on PR in case anyone would like to disable this
None that haven't already been voiced. It seems like a fairly low risk (especially if made optional) way to get a significant speed up in string handling. +1 |
Which issue does this PR close?
Closes #6667
Rationale for this change
Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding):
What changes are included in this PR?
Are there any user-facing changes?