Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty #14

Open
hayesgb opened this issue Sep 22, 2022 · 3 comments

Comments

@hayesgb
Copy link

hayesgb commented Sep 22, 2022

I wanted to compare the performance of two releases of Dask, before and after the H2O benchmark work started. Thought I'd share the results here.

The two releases being compared are 2022.6.0 vs 2022.9.1 on the 50GB H2O Parquet Dataset. A few notes:

  • The queries are written from the perspective of a naive user, so we see the advantage granted by column projection, particularly visible in Queries 1 & 2.
  • Query 3 and 8 all fail for me with the June release with KilledWorker Exceptions, but succeed in the Sept release with the shuffle-based GroupByAgg, which I leveraged in Queries 3-7.
  • Query 6 can't yet be run b/c GroupBy Median is not yet implemented (Work is in flight)
  • The September release benefits from: a) AMM is turned on (PR for on by default is open), and b) learnings regarding bad behavior of categorical dtypes c) overall improvements in distributed
  • Note that even though column projection is not implemented in Queries 8 & 9, and we do not use the shuffle based GroupBy, we still see improvements.

H2O_50GB_June_vs_Sept

cc: @jrbourbeau @rjzamora @ian-r-rose @ncclementi @phobson

@hayesgb hayesgb changed the title Evaluating Performance of Dask June Release Vs Today Evaluating Performance of Dask June Release vs 2022.9.1 Sep 22, 2022
@jrbourbeau
Copy link
Member

Thanks for putting this together @hayesgb! This is a really nice way to visualize compute time/memory performance. It's also encouraging to see the impact of everyone's recent development efforts.

A couple of things come to mind:

  1. Could you share the script / notebook you used to generate these plots?
  2. Now that we can use p2p shuffling with groupby aggregations (when using the nightly pyarrow release) I'm curious what impact that has on performance for these queries (specifically q3)

@hayesgb
Copy link
Author

hayesgb commented Sep 26, 2022

I've pushed up the compare_june_sept_release branch. I'm currently working on comparing the p2p implementation with pyarrow[string] dtypes.

@ncclementi
Copy link
Contributor

ncclementi commented Oct 31, 2022

I run some of the queries that would be impacted by using pyarrow strings (created a data set where id1, id2, and id3 are pyarrow str), and used p2p in some cases too. Here is the comparison, with the previous runs that @hayesgb reported.

Note: I use arrow nightly

Turns out that q1 an q2 don't benefit from having pyarrow str, but in all the other cases we see an improvement.
That being said I noticed that in q3 it took a long time until the last task was completed.

Screen Shot 2022-10-31 at 3 28 11 PM

@ncclementi ncclementi changed the title Evaluating Performance of Dask June Release vs 2022.9.1 Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants