-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluating Performance of Dask June Release vs 2022.9.1 vs 2022.10.0+arrow nighlty #14
Comments
Thanks for putting this together @hayesgb! This is a really nice way to visualize compute time/memory performance. It's also encouraging to see the impact of everyone's recent development efforts. A couple of things come to mind:
|
I've pushed up the |
I run some of the queries that would be impacted by using pyarrow strings (created a data set where id1, id2, and id3 are pyarrow str), and used p2p in some cases too. Here is the comparison, with the previous runs that @hayesgb reported. Note: I use arrow nightly Turns out that q1 an q2 don't benefit from having pyarrow str, but in all the other cases we see an improvement. |
I wanted to compare the performance of two releases of Dask, before and after the H2O benchmark work started. Thought I'd share the results here.
The two releases being compared are
2022.6.0
vs2022.9.1
on the 50GB H2O Parquet Dataset. A few notes:KilledWorker Exceptions
, but succeed in the Sept release with the shuffle-basedGroupByAgg
, which I leveraged in Queries 3-7.GroupBy Median
is not yet implemented (Work is in flight)categorical dtypes
c) overall improvements indistributed
shuffle
basedGroupBy
, we still see improvements.cc: @jrbourbeau @rjzamora @ian-r-rose @ncclementi @phobson
The text was updated successfully, but these errors were encountered: