Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster parquet utf8 validation using simdjson #6668

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions parquet/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ paste = { version = "1.0" }
half = { version = "2.1", default-features = false, features = ["num-traits"] }
sysinfo = { version = "0.32.0", optional = true, default-features = false, features = ["system"] }
crc32fast = { version = "1.4.2", optional = true, default-features = false }
simdutf8 = { version = "0.1.5"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be optional as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How mature is the library and its dependencies?
My random spike led me to https://github.com/rusticstuff/simdutf8/blob/main/src/implementation/aarch64/neon.rs#L16 and https://docs.rs/flexpect/latest/flexpect/ lacks documentation.
Should we help simdutf8 to bring it to arrow's maturity level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is just some macro helper for clippy split off as crate / dependency. Doesn't seem too bad.


[dev-dependencies]
base64 = { version = "0.22", default-features = false, features = ["std"] }
Expand Down
7 changes: 5 additions & 2 deletions parquet/src/arrow/array_reader/byte_view_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -679,9 +679,12 @@ impl ByteViewArrayDecoderDelta {

/// Check that `val` is a valid UTF-8 sequence
pub fn check_valid_utf8(val: &[u8]) -> Result<()> {
match std::str::from_utf8(val) {
match simdutf8::basic::from_utf8(val) {
Ok(_) => Ok(()),
Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)),
Err(_) => {
let e = simdutf8::compat::from_utf8(val).unwrap_err();
Copy link
Contributor Author

@Dandandan Dandandan Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call compat from_utf8 again to get the same error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the role of simdutf8::basic::from_utf8 and re-run with simdutf8::compat -- does this deserve a code comment?

(at least the .unwrap_err() safety deserves one)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in offset_buffer.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree deserves some comments explaining why we rerun it in case of error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a positive sentiment about using simdutf8 for faster validation, I can do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have our own from_utf8 that wraps the simdutf8 implementation? Then the weird basic/compat path would only need to be documented once (and make it easier to replace other from_utf8 invocations that @alamb identified).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR that does this in the next few days if no one beats me to it

Err(general_err!("encountered non UTF-8 data: {}", e))
}
}
}

Expand Down
8 changes: 6 additions & 2 deletions parquet/src/arrow/buffer/offset_buffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -117,9 +117,13 @@ impl<I: OffsetSizeTrait> OffsetBuffer<I> {
///
/// [`Self::try_push`] can perform this validation check on insertion
pub fn check_valid_utf8(&self, start_offset: usize) -> Result<()> {
match std::str::from_utf8(&self.values.as_slice()[start_offset..]) {
match simdutf8::basic::from_utf8(&self.values.as_slice()[start_offset..]) {
Ok(_) => Ok(()),
Err(e) => Err(general_err!("encountered non UTF-8 data: {}", e)),
Err(_) => {
let e = simdutf8::compat::from_utf8(&self.values.as_slice()[start_offset..])
.unwrap_err();
Err(general_err!("encountered non UTF-8 data: {}", e))
}
}
}

Expand Down
Loading