Faster parquet utf8 validation using `simdjson` #6668

Dandandan · 2024-10-31T19:22:52Z

Which issue does this PR close?

Rationale for this change

Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding):

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs
                        time:   [740.81 µs 746.51 µs 752.11 µs]
                        change: [-5.8127% -5.2637% -4.6414%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, no NULLs
                        time:   [743.62 µs 748.70 µs 754.14 µs]
                        change: [-4.2825% -3.6551% -3.0212%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/plain encoded, optional, half NULLs
                        time:   [633.43 µs 638.47 µs 643.71 µs]
                        change: [-5.1930% -4.5414% -3.8189%] (p = 0.00 < 0.05)

What changes are included in this PR?

Are there any user-facing changes?

Dandandan · 2024-10-31T19:34:46Z

parquet/src/arrow/array_reader/byte_view_array.rs

        Ok(_) => Ok(()),
-        Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)),
+        Err(_) => {
+            let e = simdutf8::compat::from_utf8(val).unwrap_err();


We call compat from_utf8 again to get the same error.

the role of simdutf8::basic::from_utf8 and re-run with simdutf8::compat -- does this deserve a code comment?

(at least the .unwrap_err() safety deserves one)

same in offset_buffer.rs

Yeah I agree deserves some comments explaining why we rerun it in case of error.

If there is a positive sentiment about using simdutf8 for faster validation, I can do so.

Could we have our own from_utf8 that wraps the simdutf8 implementation? Then the weird basic/compat path would only need to be documented once (and make it easier to replace other from_utf8 invocations that @alamb identified).

I'll make a PR that does this in the next few days if no one beats me to it

Dandandan · 2024-10-31T19:35:15Z

parquet/Cargo.toml

@@ -69,6 +69,7 @@ paste = { version = "1.0" }
 half = { version = "2.1", default-features = false, features = ["num-traits"] }
 sysinfo = { version = "0.32.0", optional = true, default-features = false, features = ["system"] }
 crc32fast = { version = "1.4.2", optional = true, default-features = false }
+simdutf8 = { version = "0.1.5"}


It could be optional as well.

How mature is the library and its dependencies?
My random spike led me to https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/aarch64/neon.rs#L16 and https://docs.rs/flexpect/latest/flexpect/ lacks documentation.
Should we help simdutf8 to bring it to arrow's maturity level?

It seems it is just some macro helper for clippy split off as crate / dependency. Doesn't seem too bad.

tustvold · 2024-10-31T20:17:55Z

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

Dandandan · 2024-10-31T20:31:11Z

I'm not sure that 5% really justifies an additional dependency, especially one that uses so much unsafe...

Hm yeah wondering about that.

I think that 5% speed up for Parquet might be quite valuable though, given that it often translates in close to 5% faster query execution for queries where Parquet scan is a bottleneck (quite some DF benchmarks actually involving string data).

Dandandan · 2024-11-01T12:12:01Z

FWIW some other projects are using simdutf8 as well, like polars https://github.com/pola-rs/polars/blob/main/Cargo.toml#L77 and simd-json

alamb · 2024-11-07T22:57:06Z

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

Proposal: Add unsafe option to disable UTF8 validation on parquet read #6701

Dandandan · 2024-11-08T05:23:30Z

I am not sure exactly the usecase here, but what about simply disabling utf8 validation for known good data?

Proposal: Add unsafe option to disable UTF8 validation on parquet read #6701

The "use case" of this PR is just that utf8 validation takes time, this PR improves the performance.

I think having a option to disable it makes sense, but would be good to minimize the cost of validation as well.

alamb · 2024-12-17T20:03:22Z

So what shall we do with this PR? Make it an optional opt-in feature of the parquet crate that people can enable if they want more performance?

doki23 · 2024-12-24T03:19:11Z

I believe this PR is solely for performance enhancement. Introducing an optional opt-in feature deserves a separate PR.

XiangpengHao · 2025-01-01T16:22:57Z

Improves performance for about 4-5% (on M1 Pro) on strings (plain encoding)

Coming from #6921 (comment), I have seen much larger performance (~15%) improvements using simdutf8 with StringViewArray + x86, especially when strings are long (>128 byte).

alamb · 2025-01-01T17:40:15Z

What I suggest we do with this PR is get some end to end performance numbers (aka run the DataFusion clickbench benchmark with a pinned arrow version with this change)

Assuming that looks promising I recommend creating a PR that has an optional feature (enabled by default) for using simdjson for utf8 validation.

XiangpengHao · 2025-01-02T16:05:02Z

This is my benchmark results with Clickbench non-partitioned and filter pushdown enabled. Benchmarked on x86 AMD 9900x. Some scan-dominate queries can get 20% improvements.

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ no-simd-utf8 ┃ simd-utf8 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │       0.35ms │    0.36ms │     no change │
│ QQuery 1     │      50.50ms │   50.15ms │     no change │
│ QQuery 2     │      68.05ms │   67.37ms │     no change │
│ QQuery 3     │      89.46ms │   88.81ms │     no change │
│ QQuery 4     │     463.69ms │  459.30ms │     no change │
│ QQuery 5     │     527.10ms │  481.84ms │ +1.09x faster │
│ QQuery 6     │      50.25ms │   50.92ms │     no change │
│ QQuery 7     │      55.59ms │   55.34ms │     no change │
│ QQuery 8     │     562.75ms │  561.54ms │     no change │
│ QQuery 9     │     580.23ms │  580.26ms │     no change │
│ QQuery 10    │     159.95ms │  155.25ms │     no change │
│ QQuery 11    │     172.33ms │  168.08ms │     no change │
│ QQuery 12    │     730.38ms │  653.11ms │ +1.12x faster │
│ QQuery 13    │    1205.87ms │ 1100.43ms │ +1.10x faster │
│ QQuery 14    │     739.05ms │  666.65ms │ +1.11x faster │
│ QQuery 15    │     550.73ms │  552.73ms │     no change │
│ QQuery 16    │    1188.05ms │ 1159.64ms │     no change │
│ QQuery 17    │    1159.66ms │ 1146.29ms │     no change │
│ QQuery 18    │    2641.59ms │ 2652.32ms │     no change │
│ QQuery 19    │      81.54ms │   90.09ms │  1.10x slower │
│ QQuery 20    │     610.90ms │  589.75ms │     no change │
│ QQuery 21    │     705.48ms │  663.11ms │ +1.06x faster │
│ QQuery 22    │    1659.48ms │ 1357.19ms │ +1.22x faster │
│ QQuery 23    │    3534.78ms │ 3639.67ms │     no change │
│ QQuery 24    │     533.81ms │  475.18ms │ +1.12x faster │
│ QQuery 25    │     470.34ms │  373.89ms │ +1.26x faster │
│ QQuery 26    │     576.26ms │  476.26ms │ +1.21x faster │
│ QQuery 27    │    1121.43ms │ 1056.52ms │ +1.06x faster │
│ QQuery 28    │    4291.35ms │ 4288.26ms │     no change │
│ QQuery 29    │     228.93ms │  235.60ms │     no change │
│ QQuery 30    │     589.98ms │  541.70ms │ +1.09x faster │
│ QQuery 31    │     716.59ms │  702.51ms │     no change │
│ QQuery 32    │    2593.06ms │ 2528.18ms │     no change │
│ QQuery 33    │    2362.52ms │ 2336.75ms │     no change │
│ QQuery 34    │    2360.44ms │ 2334.17ms │     no change │
│ QQuery 35    │     705.78ms │  696.32ms │     no change │
│ QQuery 36    │     163.69ms │  156.20ms │     no change │
│ QQuery 37    │     129.14ms │   97.25ms │ +1.33x faster │
│ QQuery 38    │      88.72ms │   90.76ms │     no change │
│ QQuery 39    │     218.61ms │  213.61ms │     no change │
│ QQuery 40    │      71.73ms │   73.40ms │     no change │
│ QQuery 41    │      68.23ms │   67.56ms │     no change │
│ QQuery 42    │      69.22ms │   68.76ms │     no change │
└──────────────┴──────────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (no-simd-utf8)   │ 34947.62ms │
│ Total Time (simd-utf8)      │ 33803.08ms │
│ Average Time (no-simd-utf8) │   812.74ms │
│ Average Time (simd-utf8)    │   786.12ms │
│ Queries Faster              │         12 │
│ Queries Slower              │          1 │
│ Queries with No Change      │         30 │
└─────────────────────────────┴────────────┘

alamb

Thank you @Dandandan and @findepi, @XiangpengHao @doki23 and @tustvold

@etseidl I wonder if you have any thoughts on this PR?

I double checked that this PR catches the important validation path in parquet. There are some other places where utf8 is validated, but they seem like they are relatively

https://github.com/search?q=repo%3Aapache%2Farrow-rs+from_utf8+path%3A%2F%5Eparquet%5C%2Fsrc%5C%2F%2F&type=code

It also appears this library is used by polars which gives me some confidence it is stable and will have community support if there are issues: https://crates.io/crates/simdutf8/reverse_dependencies

Thus I think we should proceed and add a flag to disable the feature as a follow on PR in case anyone would like to disable this

etseidl · 2025-01-08T23:45:35Z

@etseidl I wonder if you have any thoughts on this PR?

None that haven't already been voiced. It seems like a fairly low risk (especially if made optional) way to get a significant speed up in string handling.

+1

Faster utf8 validation

eeb57e3

github-actions bot added the parquet Changes to the parquet crate label Oct 31, 2024

Move dependency

adbd07a

Dandandan commented Oct 31, 2024

View reviewed changes

alamb mentioned this pull request Jan 1, 2025

[POC] Experimental parquet decoder with first-class selection pushdown support #6921

Draft

alamb changed the title ~~Faster utf8 validation~~ Faster parquet utf8 validation using simdjson Jan 1, 2025

alamb approved these changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster parquet utf8 validation using `simdjson` #6668

Faster parquet utf8 validation using `simdjson` #6668

Dandandan commented Oct 31, 2024 •

edited

Loading

Dandandan Oct 31, 2024 •

edited

Loading

findepi Nov 2, 2024

findepi Nov 2, 2024

Dandandan Nov 2, 2024

Dandandan Nov 2, 2024

etseidl Jan 9, 2025

alamb Jan 9, 2025

Dandandan Oct 31, 2024

findepi Nov 2, 2024

Dandandan Nov 2, 2024

tustvold commented Oct 31, 2024

Dandandan commented Oct 31, 2024

Dandandan commented Nov 1, 2024 •

edited

Loading

alamb commented Nov 7, 2024

Dandandan commented Nov 8, 2024 •

edited

Loading

alamb commented Dec 17, 2024

doki23 commented Dec 24, 2024

XiangpengHao commented Jan 1, 2025

alamb commented Jan 1, 2025

XiangpengHao commented Jan 2, 2025 •

edited

Loading

alamb left a comment •

edited

Loading

etseidl commented Jan 8, 2025

Faster parquet utf8 validation using simdjson #6668

Are you sure you want to change the base?

Faster parquet utf8 validation using simdjson #6668

Conversation

Dandandan commented Oct 31, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Dandandan Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Oct 31, 2024

Dandandan commented Oct 31, 2024

Dandandan commented Nov 1, 2024 • edited Loading

alamb commented Nov 7, 2024

Dandandan commented Nov 8, 2024 • edited Loading

alamb commented Dec 17, 2024

doki23 commented Dec 24, 2024

XiangpengHao commented Jan 1, 2025

alamb commented Jan 1, 2025

XiangpengHao commented Jan 2, 2025 • edited Loading

alamb left a comment • edited Loading

Choose a reason for hiding this comment

etseidl commented Jan 8, 2025

Faster parquet utf8 validation using `simdjson` #6668

Faster parquet utf8 validation using `simdjson` #6668

Dandandan commented Oct 31, 2024 •

edited

Loading

Dandandan Oct 31, 2024 •

edited

Loading

Dandandan commented Nov 1, 2024 •

edited

Loading

Dandandan commented Nov 8, 2024 •

edited

Loading

XiangpengHao commented Jan 2, 2025 •

edited

Loading

alamb left a comment •

edited

Loading