Support vectorized append and compare for multi group by (#12996) · apache/datafusion@345117b

Commit

Support vectorized append and compare for multi group by (#12996)

* simple support vectorized append.

* fix tests.

* some logs.

* add `append_n` in `MaybeNullBufferBuilder`.

* impl basic append_batch

* fix equal to.

* define `GroupIndexContext`.

* define the structs useful in vectorizing.

* re-define some structs for vectorized operations.

* impl some vectorized logics.

* impl chekcing hashmap stage.

* fix compile.

* tmp

* define and impl `vectorized_compare`.

* fix compile.

* impl `vectorized_equal_to`.

* impl `vectorized_append`.

* finish the basic vectorized ops logic.

* impl `take_n`.

* fix `renaming clear` and `groups fill`.

* fix death loop due to rehashing.

* fix vectorized append.

* add counter.

* use extend rather than resize.

* remove dbg!.

* remove reserve.

* refactor the codes to make simpler and more performant.

* clear `scalarized_indices` in `intern` to avoid some corner case.

* fix `scalarized_equal_to`.

* fallback to total scalarized `GroupValuesColumn` in streaming aggregation.

* add unit test for `VectorizedGroupValuesColumn`.

* add unit test for emitting first n in `VectorizedGroupValuesColumn`.

* sort out tests codes in for group columns and add vectorized tests for primitives.

* add vectorized test for byte builder.

* add vectorized test for byte view builder.

* add test for the all nulls or not nulls branches in vectorized.

* fix clippy.

* fix fmt.

* fix compile in rust 1.79.

* improve comments.

* fix doc.

* add more comments to explain the really complex vectorized intern process.

* add comments to explain why we still need origin `GroupValuesColumn`.

* remove some stale comments.

* fix clippy.

* add comments for `vectorized_equal_to` and `vectorized_append`.

* fix clippy.

* use zip to simplify codes.

* use izip to simplify codes.

* Update datafusion/physical-plan/src/aggregates/group_values/group_column.rs

Co-authored-by: Jay Zhan <[email protected]>

* first_n attempt

Signed-off-by: jayzhan211 <[email protected]>

* add test

Signed-off-by: jayzhan211 <[email protected]>

* improve hashtable modifying in emit first n test.

* add `emit_group_index_list_buffer` to avoid allocating new `Vec` to store the remaining gourp indices.

* make comments in VectorizedGroupValuesColumn::intern simpler and clearer.

* define `VectorizedOperationBuffers` to hold buffers used in vectorized operations to make code clearer.

* unify `VectorizedGroupValuesColumn` and `GroupValuesColumn`.

* fix fmt.

* fix comments.

* fix clippy.

---------

Signed-off-by: jayzhan211 <[email protected]>
Co-authored-by: Jay Zhan <[email protected]>

Loading branch information

Rachelint and jayzhan211 authored Nov 6, 2024

1 parent c3a9847 commit 345117b

datafusion/common/src/utils/memory.rs

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -102,7 +102,7 @@ pub fn estimate_memory_size<T>(num_elements: usize, fixed_size: usize) -> Result
  
    #[cfg(test)]

    mod tests {

        use std::collections::HashSet;

        use std::{collections::HashSet, mem::size_of};

        use super::estimate_memory_size;

datafusion/core/tests/user_defined/user_defined_aggregates.rs

-Original file line number
+Diff line change
@@ Expand Up / @@ -19,6 +19,7 @@ @@
     //! user defined aggregate functions
     use std::hash::{DefaultHasher, Hash, Hasher};
+    use std::mem::{size_of, size_of_val};
     use std::sync::{
         atomic::{AtomicBool, Ordering},
         Arc,
@@ Expand Down @@

0 comments on commit `345117b`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `345117b`

Commit

There are no files selected for viewing

0 comments on commit 345117b

0 comments on commit `345117b`