-
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Note that bigger batches == better GPU performance. One of the biggest performance pitfalls when working with GPUs is not providing enough data per operation. Giving a GPU twice as much data as you did before often results in the operation completing in less than twice the time, but then this means giving a GPU half the amount of data often means it runs far more than half the original time. Could splitting the batch help in some situations? Maybe. But we'll have to run a series of kernels and temporarily allocate double the batch memory to perform the split (which itself could trigger the OOM), and this will make the query run slower not only because it costs time to perform the split but it makes the downstream processing less efficient. It would be good to understand the scenarios where we think this is going to help and see if doing something upstream to control the batch size makes more sense than trying to put a workaround here. (e.g.: are we reading too large of an input batch size?) |
Beta Was this translation helpful? Give feedback.
-
Yes, that would be possible. The main issue would be in where to place them. Right now we place
explode, posexplode, ExpandExec (which is used with cube) we already have special cases to handle not outputting too large of a batch. join is horribly difficult because we practically have to do the join to have any idea what the output size is going to be, so to do this correctly we would probably want to get changes from cudf to help here. But in the short term we could look at putting in a splitter in front of the join in some cases. In theory, though any operator that does a project like operation could add a lot more columns and make the size of the output data much much larger. For me, then it is a question of use cases. Do we have a specific use case where we think this would help the query to run when it cannot run today? If so then what is the impact to performance of doing this? |
Beta Was this translation helpful? Give feedback.
Yes, that would be possible. The main issue would be in where to place them. Right now we place
GpuCoalesceBatches
in locations where we either know we will require a single batch, or where we think we might output small batches and combining them into a bigger batch would make processing more efficient. So we would also need to identify locations where splitting the batch into smaller pieces is worth the overhead ofGpuCoalesceBatches
.