-
What is your question? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
This is all about shuffle. Spark, by default, compresses all shuffle data before it is written out to disk, although it is configurable. When using the default shuffle implementation we still use the CPU compression algorithm. This is to be able to match that functionality when using the UCX based shuffle plugin, where we want to avoid going back to the CPU if possible. There are two places where this can help us.
Because of the complexity of this it is hard to predict the impact of compression on a given query/hardware setup, but we have see improvements for queries using UCX that spill to disk a lot. We are still doing profiling and looking into more performance improvements, but the numbers are promising enough to try and put it into our next release. We are still evaluating if it will be off by default or not. |
Beta Was this translation helpful? Give feedback.
-
Closing as answered. |
Beta Was this translation helpful? Give feedback.
This is all about shuffle. Spark, by default, compresses all shuffle data before it is written out to disk, although it is configurable.
When using the default shuffle implementation we still use the CPU compression algorithm. This is to be able to match that functionality when using the UCX based shuffle plugin, where we want to avoid going back to the CPU if possible.
There are two places where this can help us.