Skip to content

Commit

Permalink
Merge branch 'main' into feature/string_to_array
Browse files Browse the repository at this point in the history
# Conflicts:
#	datafusion/sqllogictest/test_files/string/string_view.slt
  • Loading branch information
Omega359 committed Nov 16, 2024
2 parents a8ed675 + 6d8313e commit 98a2828
Show file tree
Hide file tree
Showing 78 changed files with 2,007 additions and 873 deletions.
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ recursive = "0.1.1"
regex = "1.8"
rstest = "0.23.0"
serde_json = "1"
sqlparser = { version = "0.51.0", features = ["visitor"] }
sqlparser = { version = "0.52.0", features = ["visitor"] }
tempfile = "3"
tokio = { version = "1.36", features = ["macros", "rt", "sync"] }
url = "2.2"
Expand Down
24 changes: 24 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,30 @@ steps.
The tests sort the entire dataset using several different sort
orders.

## Sort TPCH

Test performance of end-to-end sort SQL queries. (While the `Sort` benchmark focuses on a single sort executor, this benchmark tests how sorting is executed across multiple CPU cores by benchmarking sorting the whole relational table.)

Sort integration benchmark runs whole table sort queries on TPCH `lineitem` table, with different characteristics. For example, different number of sort keys, different sort key cardinality, different number of payload columns, etc.

See [`sort_tpch.rs`](src/sort_tpch.rs) for more details.

### Sort TPCH Benchmark Example Runs
1. Run all queries with default setting:
```bash
cargo run --release --bin dfbench -- sort-tpch -p '....../datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
```

2. Run a specific query:
```bash
cargo run --release --bin dfbench -- sort-tpch -p '....../datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' --query 2
```

3. Run all queries with `bench.sh` script:
```bash
./bench.sh run sort_tpch
```

## IMDB

Run Join Order Benchmark (JOB) on IMDB dataset.
Expand Down
18 changes: 18 additions & 0 deletions benchmarks/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB),
tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
parquet: Benchmark of parquet reader's filtering speed
sort: Benchmark of sorting speed
sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPCH dataset
clickbench_1: ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
clickbench_extended: ClickBench \"inspired\" queries against a single parquet (DataFusion specific)
Expand Down Expand Up @@ -175,6 +176,10 @@ main() {
# same data as for tpch
data_tpch "1"
;;
sort_tpch)
# same data as for tpch
data_tpch "1"
;;
*)
echo "Error: unknown benchmark '$BENCHMARK' for data generation"
usage
Expand Down Expand Up @@ -252,6 +257,9 @@ main() {
external_aggr)
run_external_aggr
;;
sort_tpch)
run_sort_tpch
;;
*)
echo "Error: unknown benchmark '$BENCHMARK' for run"
usage
Expand Down Expand Up @@ -549,6 +557,16 @@ run_external_aggr() {
$CARGO_COMMAND --bin external_aggr -- benchmark --partitions 4 --iterations 5 --path "${TPCH_DIR}" -o "${RESULTS_FILE}"
}

# Runs the sort integration benchmark
run_sort_tpch() {
TPCH_DIR="${DATA_DIR}/tpch_sf1"
RESULTS_FILE="${RESULTS_DIR}/sort_tpch.json"
echo "RESULTS_FILE: ${RESULTS_FILE}"
echo "Running sort tpch benchmark..."

$CARGO_COMMAND --bin dfbench -- sort-tpch --iterations 5 --path "${TPCH_DIR}" -o "${RESULTS_FILE}"
}


compare_benchmarks() {
BASE_RESULTS_DIR="${SCRIPT_DIR}/results"
Expand Down
4 changes: 3 additions & 1 deletion benchmarks/src/bin/dfbench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
#[global_allocator]
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;

use datafusion_benchmarks::{clickbench, imdb, parquet_filter, sort, tpch};
use datafusion_benchmarks::{clickbench, imdb, parquet_filter, sort, sort_tpch, tpch};

#[derive(Debug, StructOpt)]
#[structopt(about = "benchmark command")]
Expand All @@ -43,6 +43,7 @@ enum Options {
Clickbench(clickbench::RunOpt),
ParquetFilter(parquet_filter::RunOpt),
Sort(sort::RunOpt),
SortTpch(sort_tpch::RunOpt),
Imdb(imdb::RunOpt),
}

Expand All @@ -57,6 +58,7 @@ pub async fn main() -> Result<()> {
Options::Clickbench(opt) => opt.run().await,
Options::ParquetFilter(opt) => opt.run().await,
Options::Sort(opt) => opt.run().await,
Options::SortTpch(opt) => opt.run().await,
Options::Imdb(opt) => opt.run().await,
}
}
1 change: 1 addition & 0 deletions benchmarks/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@ pub mod clickbench;
pub mod imdb;
pub mod parquet_filter;
pub mod sort;
pub mod sort_tpch;
pub mod tpch;
pub mod util;
Loading

0 comments on commit 98a2828

Please sign in to comment.