[QST] Can AbstractGpuCoalesceIterator.next() split bigger betch into small ones if not RequireSingleBatch? #5387

JustPlay · 2020-09-24T12:17:29Z

JustPlay
Sep 24, 2020

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

Line 251 in 1a2b17e

override def next(): ColumnarBatch = {

AbstractGpuCoalesceIterator.next() will coalesce small batches into bigger one, but just let it pass through if the input batch is larger than RapidsConf.GPU_BATCH_SIZE_BYTES.
So,

can we extend the abillity to split bigger batch into small ones (when not RequireSingleBatch)?
do we need extend it (small batch means lower oom risc) ？

Answered by revans2

Sep 24, 2020

can we extend the abillity to split bigger batch into small ones (when not RequireSingleBatch)?

Yes, that would be possible. The main issue would be in where to place them. Right now we place GpuCoalesceBatches in locations where we either know we will require a single batch, or where we think we might output small batches and combining them into a bigger batch would make processing more efficient. So we would also need to identify locations where splitting the batch into smaller pieces is worth the overhead of GpuCoalesceBatches.

do we need extend it (small batch means lower oom risc) ？
I'm not 100% sure on that. There are not many places where the size of the output data could exp…

View full answer

jlowe · 2020-09-24T14:03:20Z

jlowe
Sep 24, 2020
Maintainer

Note that bigger batches == better GPU performance. One of the biggest performance pitfalls when working with GPUs is not providing enough data per operation. Giving a GPU twice as much data as you did before often results in the operation completing in less than twice the time, but then this means giving a GPU half the amount of data often means it runs far more than half the original time.

Could splitting the batch help in some situations? Maybe. But we'll have to run a series of kernels and temporarily allocate double the batch memory to perform the split (which itself could trigger the OOM), and this will make the query run slower not only because it costs time to perform the split but it makes the downstream processing less efficient. It would be good to understand the scenarios where we think this is going to help and see if doing something upstream to control the batch size makes more sense than trying to put a workaround here. (e.g.: are we reading too large of an input batch size?)

0 replies

revans2 · 2020-09-24T14:10:43Z

revans2
Sep 24, 2020
Maintainer

can we extend the abillity to split bigger batch into small ones (when not RequireSingleBatch)?

Yes, that would be possible. The main issue would be in where to place them. Right now we place GpuCoalesceBatches in locations where we either know we will require a single batch, or where we think we might output small batches and combining them into a bigger batch would make processing more efficient. So we would also need to identify locations where splitting the batch into smaller pieces is worth the overhead of GpuCoalesceBatches.

do we need extend it (small batch means lower oom risc) ？
I'm not 100% sure on that. There are not many places where the size of the output data could explode.

explode, posexplode, ExpandExec (which is used with cube) we already have special cases to handle not outputting too large of a batch.

join is horribly difficult because we practically have to do the join to have any idea what the output size is going to be, so to do this correctly we would probably want to get changes from cudf to help here. But in the short term we could look at putting in a splitter in front of the join in some cases.

In theory, though any operator that does a project like operation could add a lot more columns and make the size of the output data much much larger.

For me, then it is a question of use cases. Do we have a specific use case where we think this would help the query to run when it cannot run today? If so then what is the impact to performance of doing this? GpuCoalesceBatch is not free and adding more of them is going to cost in some ways. In the common case (the data is large enough or too large) it is almost free now, but with this change it would make the common case be expensive. Combining and splitting batches would be the norm.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Can AbstractGpuCoalesceIterator.next() split bigger betch into small ones if not RequireSingleBatch? #5387

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

[QST] Can AbstractGpuCoalesceIterator.next() split bigger betch into small ones if not RequireSingleBatch? #5387

JustPlay Sep 24, 2020

Replies: 2 comments

jlowe Sep 24, 2020 Maintainer

revans2 Sep 24, 2020 Maintainer

JustPlay
Sep 24, 2020

jlowe
Sep 24, 2020
Maintainer

revans2
Sep 24, 2020
Maintainer