Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021

Open
XinyuZeng opened this issue Oct 12, 2024 · 6 comments
Open

Comments

@XinyuZeng
Copy link

Hi Vortex, I am not sure this is the desired behavior. For example, if we compress a LargeBinary or LargeUtf8 Arrow Array into Vortex's ConstantArray and then canonicalize it back, we will get Binary or Utf8 Arrow Array. This is because VarBinArray::from_iter always uses the u32 offsets builder:

let mut builder = VarBinBuilder::<u32>::with_capacity(iter.size_hint().0);

This can be reproduced by running the round_trip_arrow_compressed test. It is ignored but Arrow now supports comparing Structs:

// Ignoring since Struct arrays don't currently support equality.
// https://github.com/apache/arrow-rs/issues/5199
#[ignore]
#[test]
fn round_trip_arrow_compressed() {

The taxi dataset has a field store_and_fwd_flag which is mostly N. It is reasonable for a ConstantArray to just use u32 offset but if we have a ChunkedArray where the first chunk is Constant and the second chunk is not, we may have inconsistent Arrow schema between output RecordBatches? (while this may be the problem of Arrow missing a logical type)

@a10y
Copy link
Contributor

a10y commented Oct 14, 2024

Hey @XinyuZeng , thanks for taking a look! You are correct that in general, while we do our best to preserve round-tripping between Arrow arrays and the nearest Vortex encoding, after compression we can't guarantee to give you back the exact same Arrow encoding you started with. For example, if you provided a StringViewArray or BinaryViewArray, we would still give you back a StringArray / BinaryArray.

This is because of the Vortex type system encodes logical types (e.g. "these bytes are Utf8 encoded") VS the Arrow physical encoding types. Each of our logical types has a blessed "canonical" encoding which can represent values of that type while being zero-copy to Arrow.

To provide an exact round-trip guarantee would require storing information somewhere about the original Arrow encoding, which I don't think is something we're considering right now.

@XinyuZeng
Copy link
Author

Got it, thanks! Still wondering will this have a potential issue when connecting to DataFusion since they are using Arrow physical schema directly. For example, we always got Utf8 in schema (

DType::Utf8(_) => DataType::Utf8,
) but the canonical Array may be LargeUtf8 (
PType::I64 => Arc::new(unsafe {
LargeStringArray::new_unchecked(
). Maybe a corner case though, just accidentally think of.

@robert3005
Copy link
Member

This is indeed an issue. This particular case would be solved by #757. I don't recall why we didn't take the schema from the array. Likely needs to add a new trait i.e. ArrowDType and implement it for all our arrays to solve this issue. Then we could produce the true canonical dtype.

@robert3005
Copy link
Member

robert3005 commented Oct 15, 2024

@a10y pointed out that we might have enough metadata in our files and that we are still missing metadata in our inmemory arrays. However, we should make batches returned to datafusion have consistent dtypes

@XinyuZeng
Copy link
Author

I see. Btw datafusion has a plan to separate logical type out: apache/datafusion#11513. If that proposal is implemented then maybe there is no need for vortex to ensure consistent dtypes for output batches.

@a10y
Copy link
Contributor

a10y commented Oct 15, 2024

Yep we're tracking that keenly and agree it will be likely to help here 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants