Hive splits and multithreading #5552
-
My understanding is that we can provide splits to velox and each split will be run in parallel (limited by the maxDrivers configuration). In the following code snippet, I've created 4 hive splits and added them to the task. However, Velox treats them as one split and executes the task, even though it can be seen from the output that 4 splits were created. As a result, there is no parallelism. Could you please help me understand if I'm doing something wrong here or if there is a gap in my understanding?
MatrixMultiply signature
Output
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
@saifmasood I think Velox does parallelize the table scan operation at split level. The split added by Task::addSplit() used in test actually put the splits in a shared groupSplitsStores in the task which are indexed by table scan plan node id so all the table scan operators could access the added splits. Each table scan operator fetches and processes one split at time (table scan operator's getOutput method calls Task::getSplitOrFuture() to do that). I am not sure how you find out Velox processes all the four as one? There might be some time related race condition that one table scan operator run very fast and process all the four even before the second one starts but you can experiment more splits. |
Beta Was this translation helpful? Give feedback.
-
Note that the table scan works at one group at a time and if the file is small and only contains one row group, then only the first split has the actual data to process and all the rest are empty. The split doesn't need to start or end at the row group boundary and the table scan operator will continue process beyond the end of split until finish one row group. Correspondingly, the table scan operator will skip processing the row group if the split starts in the middle of a row group. |
Beta Was this translation helpful? Give feedback.
@saifmasood I think Velox does parallelize the table scan operation at split level. The split added by Task::addSplit() used in test actually put the splits in a shared groupSplitsStores in the task which are indexed by table scan plan node id so all the table scan operators could access the added splits. Each table scan operator fetches and processes one split at time (table scan operator's getOutput method calls Task::getSplitOrFuture() to do that). I am not sure how you find out Velox processes all the four as one? There might be some time related race condition that one table scan operator run very fast and process all the four even before the second one starts but you can experiment …