-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[blocked] Switch to Utf8View for TPC-H #476
Conversation
@@ -115,3 +117,18 @@ warnings = "deny" | |||
[workspace.lints.clippy] | |||
all = { level = "deny", priority = -1 } | |||
or_fun_call = "deny" | |||
|
|||
[patch.crates-io] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will revert all these cheeky bits once upstream releases with fixes
FWIW this is unblocked if you were to use master of datafusion |
@@ -170,7 +170,7 @@ impl TableProvider for VortexMemTable { | |||
/// The array is flattened directly into the nearest Arrow-compatible encoding. | |||
async fn scan( | |||
&self, | |||
state: &SessionState, | |||
state: &dyn Session, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like datafusion mainline changed the API here: apache/datafusion#11516
Initial TPC-H benchmarks comparison:
Going to profile some of the big boys, starting with q19 |
Alright, a bit warmer now:
Digging in on q10 and q21, it looks like Arrow's |
PR upstream for better take kernel: apache/arrow-rs#6168 |
|
||
impl IntoCanonical for VarBinArray { | ||
fn into_canonical(self) -> VortexResult<Canonical> { | ||
Ok(Canonical::VarBin(self)) | ||
fn into_byteview(array: &VarBinArray) -> ArrayRef { | ||
let mut builder = GenericByteViewBuilder::<BinaryViewType>::with_capacity(array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wrong: this is only if the max(offsets) of the varbin is <= u32::MAX
if it's >u32::MAX (i.e. if more than 4GB of strings in a single array) then should construct via the iterator
s.field("inline", &"i".to_string()); | ||
} else { | ||
s.field("ref", unsafe { &self._ref }); | ||
s.field("ref", &"r".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix these
Looks like this still need C Data interface fixes upstream |
Also improve some comments
@@ -196,33 +196,32 @@ fn pack_primitives( | |||
/// | |||
/// It is expected this function is only called from [try_canonicalize_chunks], and thus all chunks have | |||
/// been checked to have the same DType already. | |||
#[allow(unused)] | |||
fn pack_views( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw you might find it easier to use the arrow view array builder to avoid alignment issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
until we have our own bytes type this is pretty fiddly/errorprone
CI won't succeed until this is merged: Benches will continue to be slower than regular Utf8 until this merges: We may also want to consider how we handle arrays that canonicalize to something that doesn't fit in Arrow. It's not crazy to build a dictionary-encoded array that has >2GB of string data in a single buffer. That would add complexity to either our internal logic, or the into_arrow logic. Discussion upstream on i32/u32 for BinaryView happening at apache/arrow-rs#6172 |
Alright, the above 2 PRs have merged into arrow-rs, which means we now need to wait for them to make their way into DataFusion to get the pytests passing. Looks like there's some amount of agreement on the discussion ticket that i32 should actually be used for BinaryView. We basically have a few options on our end for how we want our VarBinView array to work:
|
Superceded by #757 |
Blocked on apache/datafusion#11518 and apache/arrow-rs#6077