Replies: 4 comments
-
I have the exact same question as Yuan's. It looks like "4" is a magic number which is decided in the init commit of the repo. It would be really helpful if any one can share the thoughts after the decision. |
Beta Was this translation helpful? Give feedback.
-
Some context: #5637 |
Beta Was this translation helpful? Give feedback.
-
It's for prefetch and instruction level parallelism. In general it won't hurt if the pipeline or memory bandwidth is already full, so there is not much need to decrease it. If the pipeline or memory bandwidth is not full, there is some chance to further increase it, but we need more data to see. 4 should be enough for most platforms though. One thing makes it harder to be configurable is that this must be decided at compile time, so any configuration here need to be in the form of a macro or template parameter. |
Beta Was this translation helpful? Give feedback.
-
Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hi,
There are always "4" steps when doing hashagg/hashjoin probe:
https://github.com/facebookincubator/velox/blob/main/velox/exec/HashTable.cpp#L440-L466
https://github.com/facebookincubator/velox/blob/main/velox/exec/HashTable.cpp#L578-L598
I think this trick is to enable cache line prefetch and auto vectorization to improve performance. Is the steps "4" picked by some benchmark results on a specific hardware? do you think it make sense to make this configurable? e.g., on some small instance, 1 step maybe better.
Thanks, -yuan
Beta Was this translation helpful? Give feedback.
All reactions