-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make aggregate accumulators storage column-based #956
Comments
FYI @alamb this is a follow up proposal on improving the hash aggregate implementation. |
I think the idea of making accumulators (effectively) columnar is a great idea, FWIW. Mutable arrow arrays would also be awesome if/when that is available |
@Dandandan struct Accumulators {
/// Logically maps group values to an index in `group_states`
///
/// Uses the raw API of hashbrown to avoid actually storing the
/// keys in the table
///
/// keys: u64 hashes of the GroupValue
/// values: (hash, index into `group_states`)
map: RawTable<(u64, usize)>,
// Accumulator state, keeps state of each group state
accumulator_items: Vec<AccumulatorItem>,
//group_states: Vec<GroupState>,
group_by_values: Vec<Vec<ScalarValue>>,
group_indices : Vec<Vec<u32>>,
} group_aggregate_batch: fn group_aggregate_batch(
mode: &AggregateMode,
random_state: &RandomState,
group_expr: &[Arc<dyn PhysicalExpr>],
aggr_expr: &[Arc<dyn AggregateExpr>],
batch: RecordBatch,
mut accumulators: Accumulators,
aggregate_expressions: &[Vec<Arc<dyn PhysicalExpr>>],
) -> Result<Accumulators> {
// evaluate the grouping expressions
let group_values = evaluate(group_expr, &batch)?;
// evaluate the aggregation expressions.
// We could evaluate them after the `take`, but since we need to evaluate all
// of them anyways, it is more performant to do it while they are together.
let aggr_input_values = evaluate_many(aggregate_expressions, &batch)?;
// 1.1 construct the key from the group values
// 1.2 construct the mapping key if it does not exist
// 1.3 add the row' index to `indices`
// track which entries in `accumulators` have rows in this batch to aggregate
let mut groups_with_rows = vec![];
// 1.1 Calculate the group keys for the group values
let mut batch_hashes = vec![0; batch.num_rows()];
create_hashes(&group_values, random_state, &mut batch_hashes)?;
for (row, hash) in batch_hashes.into_iter().enumerate() {
let Accumulators { map, accumulator_items,group_by_values, group_indices } = &mut accumulators;
let entry = map.get_mut(hash, |(_hash, group_idx)| {
let group_state_c = &group_by_values[*group_idx];
group_values
.iter()
.zip(group_state_c.iter())
.all(|(array, scalar)| scalar.eq_array(array, row))
});
match entry {
// Existing entry for this group value
Some((_hash, group_idx)) => {
let indices = &mut group_indices[*group_idx];
// 1.3
if indices.is_empty() {
groups_with_rows.push(*group_idx);
};
indices.push(row as u32); // remember this row
}
// 1.2 Need to create new entry
None => {
// Copy group values out of arrays into `ScalarValue`s
let col_group_by_values = group_values
.iter()
.map(|col| ScalarValue::try_from_array(col, row))
.collect::<Result<Vec<_>>>()?;
let group_idx = group_by_values.len();
group_by_values.push(col_group_by_values);
//TODO 这个地方需要给每个agg的状态初始化
accumulator_items[0].init_state(group_idx);
groups_with_rows.push(group_idx);
group_indices.push(vec![row as u32]);
// for hasher function, use precomputed hash value
map.insert(hash, (hash, group_idx), |(hash, _group_idx)| *hash);
}
};
}
// Collect all indices + offsets based on keys in this vec
let mut batch_indices_cc: UInt32Builder = UInt32Builder::new(0);
let mut offsets = vec![0];
let mut offset_so_far = 0;
for group_idx in groups_with_rows.iter() {
let indices = &accumulators.group_indices[*group_idx];
batch_indices_cc.append_slice(indices)?;
offset_so_far += indices.len();
offsets.push(offset_so_far);
}
let batch_indices = batch_indices_cc.finish();
// `Take` all values based on indices into Arrays
let values: Vec<Vec<Arc<dyn Array>>> = aggr_input_values
.iter()
.map(|array| {
array
.iter()
.map(|array| {
compute::take(
array.as_ref(),
&batch_indices,
None, // None: no index check
)
.unwrap()
})
.collect()
// 2.3
})
.collect();
// 2.1 for each key in this batch
// 2.2 for each aggregation
// 2.3 `slice` from each of its arrays the keys' values
// 2.4 update / merge the accumulator with the values
// 2.5 clear indices
groups_with_rows.iter()
.zip(offsets.windows(2))
.try_for_each(|(group_idx, offsets)| {
accumulators.group_indices[*group_idx].clear();
accumulators.accumulator_items.iter_mut()
.zip(values.iter())
.try_for_each(|(accumulator, aggr_array)| {
let values = aggr_array
.iter()
.map(|array| {
array.slice(offsets[0], offsets[1] - offsets[0])
})
.collect::<Vec<ArrayRef>>();
match mode {
AggregateMode::Partial => accumulator.update_batch(*group_idx, &values),
AggregateMode::FinalPartitioned | AggregateMode::Final => {
accumulator.merge_batch(*group_idx, &values)
}
}
})
});
Ok(accumulators)
} Count: pub struct CountAccumulatorFly {
count: Vec<u64>,
}
impl CountAccumulatorFly {
/// new count accumulator
pub fn new() -> Self {
Self { count: vec![] }
}
}
impl Accumulator for CountAccumulatorFly {
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
let array = &values[0];
self.count[0] += (array.len() - array.data().null_count()) as u64;
Ok(())
}
fn update(&mut self, values: &[ScalarValue]) -> Result<()> {
let value = &values[0];
if !value.is_null() {
self.count[0] += 1;
}
Ok(())
}
fn merge(&mut self, states: &[ScalarValue]) -> Result<()> {
let count = &states[0];
if let ScalarValue::UInt64(Some(delta)) = count {
self.count[0] += *delta;
} else {
unreachable!()
}
Ok(())
}
fn merge_batch(&mut self, states: &[ArrayRef]) -> Result<()> {
let counts = states[0].as_any().downcast_ref::<UInt64Array>().unwrap();
let delta = &compute::sum(counts);
if let Some(d) = delta {
self.count[0] += *d;
}
Ok(())
}
fn state(&self) -> Result<Vec<ScalarValue>> {
Ok(vec![ScalarValue::UInt64(Some(self.count[0]))])
}
fn evaluate(&self) -> Result<ScalarValue> {
Ok(ScalarValue::UInt64(Some(self.count[0])))
}
}
impl AccumulatorFly for CountAccumulatorFly {
fn init_state(&mut self, index: usize) {
assert_eq!(self.count.len(), index);
self.count.push(0);
}
fn update_batch(&mut self, index: usize, values: &[ArrayRef]) -> Result<()> {
let array = &values[0];
self.count[index] += (array.len() - array.data().null_count()) as u64;
Ok(())
}
fn update(&mut self, index: usize, values: &[ScalarValue]) -> Result<()> {
let value = &values[0];
if !value.is_null() {
self.count[index] += 1;
}
Ok(())
}
fn merge(&mut self, index: usize, states: &[ScalarValue]) -> Result<()> {
let count = &states[0];
if let ScalarValue::UInt64(Some(delta)) = count {
self.count[index] += *delta;
} else {
unreachable!()
}
Ok(())
}
fn merge_batch(&mut self, index: usize, states: &[ArrayRef]) -> Result<()> {
let counts = states[0].as_any().downcast_ref::<UInt64Array>().unwrap();
let delta = &compute::sum(counts);
if let Some(d) = delta {
self.count[index] += *d;
}
Ok(())
}
fn state(&self, index: usize) -> Result<Vec<ScalarValue>> {
Ok(vec![ScalarValue::UInt64(Some(self.count[index]))])
}
fn evaluate(&self, index: usize) -> Result<ScalarValue> {
Ok(ScalarValue::UInt64(Some(self.count[index])))
}
fn evaluate_all(&self) -> Result<ArrayRef> {
let result = ScalarValue::iter_to_array(
self.count.iter().map(|x| {
ScalarValue::UInt64(Some(*x))
}),
);
result
}
fn state_all(&self) -> Result<Vec<Vec<ScalarValue>>> {
let dt = Local::now();
let result = Ok(vec![self.count.iter().map(|x| {
ScalarValue::UInt64(Some(*x))
}).collect()]);
println!(
"state_all usage millis: {}",
Local::now().timestamp_millis() - dt.timestamp_millis()
);
result
}
} |
Did you do any performance profiling (using There isn't a lot of "low hanging fruit" left in the GroupByHash implementation -- to get "several times improvements" in performance I think we are likely to have to start special casing (e.g. for single column group by vs multi column group by, as well as vectorized aggregators for certain column types) |
@alamb In the count sql, I use pprof to count the time consumed by the following methods. The long time consumption is mainly in
Corresponding to the following code snippets groups_with_rows.iter()
.zip(offsets.windows(2))
.try_for_each(|(group_idx, offsets)| {
accumulators.group_indices[*group_idx].clear();
accumulators.accumulator_items.iter_mut()
.zip(values.iter())
.try_for_each(|(accumulator, aggr_array)| {
let values = aggr_array
.iter()
.map(|array| {
array.slice(offsets[0], offsets[1] - offsets[0])
})
.collect::<Vec<ArrayRef>>();
match mode {
AggregateMode::Partial => accumulator.update_batch(*group_idx, &values),
AggregateMode::FinalPartitioned | AggregateMode::Final => {
accumulator.merge_batch(*group_idx, &values)
}
}
})
}); 2、hashbrown get_mut let entry = map.get_mut(hash, |(_hash, group_idx)| {
let group_state_c = &group_by_values[*group_idx];
group_values
.iter()
.zip(group_state_c.iter())
.all(|(array, scalar)| scalar.eq_array(array, row)) // eq_array takes a long time
}); 3、create_hashes create_hashes(&group_values, random_state, &mut batch_hashes)?; 4、create_batch_from_map let batch = create_batch_from_map(&mode, &accumulators, group_expr.len(), &schema); |
Thanks for picking this up @ic4y If you are getting a 10% improvement, this is already a fine achievement (if we don't slow down the low cardinality too much?) The storage is one part of the story, as @alamb says it will require some changes in other places. Some ideas here:
|
@Dandandan Thank you very much for your suggestions
create_hashes(&group_values, random_state, &mut batch_hashes)?;
let mut absent_rows = vec![];
const CHUNK_SIZE: usize = 16;
let (chunks, remainder) = batch_hashes.as_chunks::<CHUNK_SIZE>();
for (num, chunk) in chunks.into_iter().enumerate() {
let Accumulators { map, group_states } = &mut accumulators;
let result = map.get_each_mut(*chunk, |index, (hash, group_idx)| {
if(batch_hashes[num*CHUNK_SIZE + index] != *hash){
false
}else {
let group_state = &group_states[*group_idx];
group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array(array, index + num * CHUNK_SIZE))
}
});
for (i, re) in result.iter().enumerate() {
let row = num*CHUNK_SIZE+i;
match re {
Ok((hash, group_idx)) => {
let group_state = &mut group_states[*group_idx];
// 1.3
if group_state.indices.is_empty() {
groups_with_rows.push(*group_idx);
};
group_state.indices.push(row as u32); // remember this row
}
Err(unavailable) => { match unavailable {
UnavailableMutError::Absent => {
absent_rows.push(row);
}
UnavailableMutError::Duplicate(index) => {
let (hash, group_idx) = &result.get(*index).unwrap().as_ref().unwrap();
let group_state = &mut group_states[*group_idx];
// 1.3
if group_state.indices.is_empty() {
groups_with_rows.push(*group_idx);
};
group_state.indices.push(row as u32); // remember this row
}
} }
}
}
}
//absent rows
for (i, row) in absent_rows.into_iter().enumerate() {
let hash= batch_hashes[row];
let Accumulators { map, group_states } = &mut accumulators;
let entry = map.get_mut(hash, |(_hash, group_idx)| {
let group_state = &group_states[*group_idx];
if (*_hash != hash){
false
}else{
group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array(array, row))
}
//*_hash == hash
});
match entry {
// Existing entry for this group value
Some((_hash, group_idx)) => {
let group_state = &mut group_states[*group_idx];
// 1.3
if group_state.indices.is_empty() {
groups_with_rows.push(*group_idx);
};
group_state.indices.push(row as u32); // remember this row
}
// 1.2 Need to create new entry
None => {
let accumulator_set = create_accumulators(aggr_expr)
.map_err(DataFusionError::into_arrow_external_error)?;
// Copy group values out of arrays into `ScalarValue`s
let group_by_values = group_values
.iter()
.map(|col| ScalarValue::try_from_array(col, row))
.collect::<Result<Vec<_>>>()?;
// Add new entry to group_states and save newly created index
let group_state = GroupState {
group_by_values: group_by_values.into_boxed_slice(),
accumulator_set,
indices: vec![row as u32], // 1.3
};
let group_idx = group_states.len();
group_states.push(group_state);
groups_with_rows.push(group_idx);
// for hasher function, use precomputed hash value
map.insert(hash, (hash, group_idx), |(hash, _group_idx)| *hash);
}
};
}
//remainder rows
for (i, &hash) in remainder.into_iter().enumerate() {
let row = i+CHUNK_SIZE*chunks.len();
let Accumulators { map, group_states } = &mut accumulators;
let entry = map.get_mut(hash, |(_hash, group_idx)| {
let group_state = &group_states[*group_idx];
group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array(array, row))
});
match entry {
// Existing entry for this group value
Some((_hash, group_idx)) => {
let group_state = &mut group_states[*group_idx];
// 1.3
if group_state.indices.is_empty() {
groups_with_rows.push(*group_idx);
};
group_state.indices.push(row as u32); // remember this row
}
// 1.2 Need to create new entry
None => {
let accumulator_set = create_accumulators(aggr_expr)
.map_err(DataFusionError::into_arrow_external_error)?;
// Copy group values out of arrays into `ScalarValue`s
let group_by_values = group_values
.iter()
.map(|col| ScalarValue::try_from_array(col, row))
.collect::<Result<Vec<_>>>()?;
// Add new entry to group_states and save newly created index
let group_state = GroupState {
group_by_values: group_by_values.into_boxed_slice(),
accumulator_set,
indices: vec![row as u32], // 1.3
};
let group_idx = group_states.len();
group_states.push(group_state);
groups_with_rows.push(group_idx);
// for hasher function, use precomputed hash value
map.insert(hash, (hash, group_idx), |(hash, _group_idx)| *hash);
}
};
}
create_hashes(&group_values, random_state, &mut batch_hashes)?;
let map = &mut accumulators.map;
let mut allid = Vec::with_capacity(batch_hashes.len());
unsafe {
for (row, &hash) in batch_hashes.iter().enumerate() {
let hash1 = map.iter_hash(hash);
allid.push(hash1.map(|bucket|{
let (_hash, group_idx) = bucket.as_ref();
*group_idx
}).collect::<Vec<_>>())
}
}
let mut gids = Vec::with_capacity(batch_hashes.len());
for (row, iterHash) in allid.iter().enumerate(){
let group_states = &mut accumulators.group_states;
let mut gid = None;
iterHash.into_iter().for_each(|group_idx| {
let group_state = &group_states[*group_idx];
if group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array(array, row))
{
gid = Some(*group_idx);
}
});
gids.push(gid);
}
for (row, op_group_idx) in gids.into_iter().enumerate() {
let Accumulators { map, group_states } = &mut accumulators;
let hash = batch_hashes[row];
match op_group_idx {
None => {
let entry = map.get_mut(hash, |(_hash, group_idx)| {
let group_state = &group_states[*group_idx];
group_values
.iter()
.zip(group_state.group_by_values.iter())
.all(|(array, scalar)| scalar.eq_array(array, row))
});
match entry {
// Existing entry for this group value
Some((_hash, group_idx)) => {
let group_state = &mut group_states[*group_idx];
// 1.3
if group_state.indices.is_empty() {
groups_with_rows.push(*group_idx);
};
group_state.indices.push(row as u32); // remember this row
}
// 1.2 Need to create new entry
None => {
let accumulator_set = create_accumulators(aggr_expr)
.map_err(DataFusionError::into_arrow_external_error)?;
// Copy group values out of arrays into `ScalarValue`s
let group_by_values = group_values
.iter()
.map(|col| ScalarValue::try_from_array(col, row))
.collect::<Result<Vec<_>>>()?;
// Add new entry to group_states and save newly created index
let group_state = GroupState {
group_by_values: group_by_values.into_boxed_slice(),
accumulator_set,
indices: vec![row as u32], // 1.3
};
let group_idx = group_states.len();
group_states.push(group_state);
groups_with_rows.push(group_idx);
// for hasher function, use precomputed hash value
map.insert(hash, (hash, group_idx), |(hash, _group_idx)| *hash);
}
};
}
Some(group_idx) => {
let group_state = &mut group_states[group_idx];
// 1.3
if group_state.indices.is_empty() {
groups_with_rows.push(group_idx);
};
group_state.indices.push(row as u32); // remember this row
}
}
} I really want to solve the performance problem of high cardinality aggregation, which has troubled me for a long time. Is there any problem with the above code? How can we further optimize? thanks |
This proposal is very similar in spirit to what @tustvold and I are proposing in #4973 (comment) |
(BTW there is lots of excitement on #4973 for anyone who wants to follow along) |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, the aggregate code keeps the states of each group by value, by storing it in a
Vec
together with the group by values.For low-cardinality aggregates, this works OK, as there are only a limited number of keys and thus accumulators.
But for medium/high cardinality aggregate, the current way is inefficient:
Vec
with a number ofAccumulatorItem
, which adds at least 2 (Box<dyn T>
) + 3 (emptyVec
) + 3 (initial ) = 8 pointers (64 bytes) of overhead per item when storing one aggregate per group (e.g. one of count/avg/sum).Vec<ScalarValue>
, and some cloning.Vec
andScalarValue
s and back to anArray
s again.This issue is for the
Accumulator
state only, but a similar thing could be done for thegroup_by_values
with a similarDescribe the solution you'd like
We should define a trait that allows storing the required state in the accumulators in a contiguous/columnar manner.
The idea here is that the required state can be stored in a
Vec
-like container, where each item at the index contains the current state.My proposal is to add an extra
index
to the methods, so the Accumulators can update the state at a certainindex
.Some methods could be added to retrieve the entire state of the accumulator and/or convert the state values to array(s) in one go. If Arrow provides mutable Arrays in the future, it could avoid the extra conversion step back to an
Array
.For
Count
this would be changed, most notably the change to store the state asVec<u64>
:And this would be changed in the datastructure of hash aggregates (moving the state out of group state, to
Accumulators
:Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: