Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty #14

hayesgb · 2022-09-22T15:49:04Z

I wanted to compare the performance of two releases of Dask, before and after the H2O benchmark work started. Thought I'd share the results here.

The two releases being compared are 2022.6.0 vs 2022.9.1 on the 50GB H2O Parquet Dataset. A few notes:

The queries are written from the perspective of a naive user, so we see the advantage granted by column projection, particularly visible in Queries 1 & 2.
Query 3 and 8 all fail for me with the June release with KilledWorker Exceptions, but succeed in the Sept release with the shuffle-based GroupByAgg, which I leveraged in Queries 3-7.
Query 6 can't yet be run b/c GroupBy Median is not yet implemented (Work is in flight)
The September release benefits from: a) AMM is turned on (PR for on by default is open), and b) learnings regarding bad behavior of categorical dtypes c) overall improvements in distributed
Note that even though column projection is not implemented in Queries 8 & 9, and we do not use the shuffle based GroupBy, we still see improvements.

cc: @jrbourbeau @rjzamora @ian-r-rose @ncclementi @phobson

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2022-09-23T17:59:14Z

Thanks for putting this together @hayesgb! This is a really nice way to visualize compute time/memory performance. It's also encouraging to see the impact of everyone's recent development efforts.

A couple of things come to mind:

Could you share the script / notebook you used to generate these plots?
Now that we can use p2p shuffling with groupby aggregations (when using the nightly pyarrow release) I'm curious what impact that has on performance for these queries (specifically q3)

hayesgb · 2022-09-26T14:38:18Z

I've pushed up the compare_june_sept_release branch. I'm currently working on comparing the p2p implementation with pyarrow[string] dtypes.

ncclementi · 2022-10-31T19:39:48Z

I run some of the queries that would be impacted by using pyarrow strings (created a data set where id1, id2, and id3 are pyarrow str), and used p2p in some cases too. Here is the comparison, with the previous runs that @hayesgb reported.

Note: I use arrow nightly

Turns out that q1 an q2 don't benefit from having pyarrow str, but in all the other cases we see an improvement.
That being said I noticed that in q3 it took a long time until the last task was completed.

hayesgb changed the title ~~Evaluating Performance of Dask June Release Vs Today~~ Evaluating Performance of Dask June Release vs 2022.9.1 Sep 22, 2022

ncclementi changed the title ~~Evaluating Performance of Dask June Release vs 2022.9.1~~ Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty Oct 31, 2022

ncclementi mentioned this issue Oct 31, 2022

Investigate string[pyarrow] performance on h2o groupby queries at 50 GB #18

Closed

3 tasks

This was referenced Jan 8, 2023

Test with new pyarrow dtypes and parquet improvements coiled/benchmarks#651

Open

Should add benchamrk option for p2p in test_q3 and test_q7 of h2o ? coiled/benchmarks#660

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty #14

Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty #14

hayesgb commented Sep 22, 2022

jrbourbeau commented Sep 23, 2022

hayesgb commented Sep 26, 2022

ncclementi commented Oct 31, 2022 •

edited

Loading

Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty #14

Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty #14

Comments

hayesgb commented Sep 22, 2022

jrbourbeau commented Sep 23, 2022

hayesgb commented Sep 26, 2022

ncclementi commented Oct 31, 2022 • edited Loading

ncclementi commented Oct 31, 2022 •

edited

Loading