Skip to content

Commit

Permalink
fix: collect_columns quadratic complexity (apache#11843)
Browse files Browse the repository at this point in the history
Fix accidental $O(n^2)$ in `collect_columns`.

There are the following ways to insert a clone into a hash set:

- **clone before check:** Basically `set.insert(x.clone())`. That's
  rather expensive if you have a lot of duplicates.
- **iter & clone:** That's what we do right now, but that's in $O(n^2)$.
- **check & insert:** Basically `if !set.contains(x) {set.insert(x.clone())}`
  That requires two hash probes though.
- **entry-based API:** Sadly the stdlib doesn't really offer any
  handle/entry-based APIs yet (see rust-lang/rust#60896 ),
  but `hashbrown` does, so we can use the nice
  `set.get_or_insert_owned(x)` which will only clone the reference if it
  doesn't exists yet and only hashes once.

We now use the last approach.
  • Loading branch information
crepererum authored Aug 6, 2024
1 parent e19e982 commit 16a3557
Showing 1 changed file with 3 additions and 4 deletions.
7 changes: 3 additions & 4 deletions datafusion/physical-expr/src/utils/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@

mod guarantee;
pub use guarantee::{Guarantee, LiteralGuarantee};
use hashbrown::HashSet;

use std::borrow::Borrow;
use std::collections::{HashMap, HashSet};
use std::collections::HashMap;
use std::sync::Arc;

use crate::expressions::{BinaryExpr, Column};
Expand Down Expand Up @@ -204,9 +205,7 @@ pub fn collect_columns(expr: &Arc<dyn PhysicalExpr>) -> HashSet<Column> {
let mut columns = HashSet::<Column>::new();
expr.apply(|expr| {
if let Some(column) = expr.as_any().downcast_ref::<Column>() {
if !columns.iter().any(|c| c.eq(column)) {
columns.insert(column.clone());
}
columns.get_or_insert_owned(column);
}
Ok(TreeNodeRecursion::Continue)
})
Expand Down

0 comments on commit 16a3557

Please sign in to comment.