-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(arrow-select): concat
kernel will merge dictionary values for list of dictionaries
#6893
feat(arrow-select): concat
kernel will merge dictionary values for list of dictionaries
#6893
Conversation
TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test
concat
kernel will merge dictionary values for list of dictionaries
arrow-select/src/concat.rs
Outdated
@@ -129,12 +130,106 @@ fn concat_dictionaries<K: ArrowDictionaryKeyType>( | |||
Ok(Arc::new(array)) | |||
} | |||
|
|||
fn concat_list_of_dictionaries<OffsetSize: OffsetSizeTrait, K: ArrowDictionaryKeyType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would significantly reduce the codegen to instead split concatenating the dictionary values, from re-encoding the offsets. As an added bonus this could also be done recursively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, please let me know it that was what you meant, I think it will be slower this way, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a quick look and left some suggestions on how to simplify this, in particular separating the concatenation of the list values and the list offsets will result in less code, require less codegen, and likely perform better.
i.e. something like (not tested)
DataType::List(f) => {
let values: Vec<_> = arrays.iter().map(|x| x.as_list::<i32>().values()).collect();
let new_values = concat(&values)?;
let offsets = OffsetBuffer::from_lengths(
arrays.iter().flat_map(|x| x.as_list::<i32>().offsets()
.windows(2).map(|w| w[1].as_usize() - w[0].as_usize()))
);
let nulls = ...
ListArray::new(f, offsets, new_values, nulls)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very promising, just a small comment r.e. avoiding ArrayData
I'll try to find some time for a more thorough review in the next few days if nobody else gets there first
Which issue does this PR close?
Closes #6888
What changes are included in this PR?
merge
function inBufferOffset
for merging multipleBuffersOffset
into one (which needed for the concat merge)Are there any user-facing changes?
Yes, the
merge
function in theBufferOffset
struct