-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for large num_items
to device_select.cuh
#1422
Comments
cub.bench.select.if.base: signed versus unsigned offset types[0] Tesla V100-SXM2-32GB
|
Seeing some noticeable performance drops for:
cub.bench.select.flagged.base: signed versus unsigned offset types## [0] Tesla V100-SXM2-32GB
|
Currently blocked by #1454. Turns out that there's some performance degradation from simply moving Given there's no easy choice for the offset type here, we want to revisit #1454 and come to a conclusion for a broader approach of offset type handling first, before continuing on this endeavour. |
We have some tickets potentially related to this in Pytorch like: |
Do you have an ETA for this? |
Hey @bhack, this is something we're actively working on. Are there other specific algorithms that you're interested in? |
For
cub::DeviceSelect::Flagged(nullptr, temp_storage_bytes, counting_itr, itr,
out_temp.mutable_data_ptr<int64_t>(), (int*)num_nonzeros.get(), N, stream);
temp_storage = allocator.allocate(temp_storage_bytes);
cub::DeviceSelect::Flagged(temp_storage.get(), temp_storage_bytes, counting_itr, itr,
out_temp.mutable_data_ptr<int64_t>(), (int*)num_nonzeros.get(), N, stream); |
@elstehle We are having another problem related to this with the just release (but popular= model by Meta SAM2: Any progress on this? |
Thank your for letting us know that this came up again, in another, very recent model, @bhack. We understand that this is of great importance to the community. Unfortunately, there's no straight-forward solution that would not see significant slow-downs (in some cases 50% performance drops) when moving from 32 to 64-bit offset types. We are currently investigating options that mitigate performance drops when using 64-bit offset types. One such option is tracked here #2136 |
Hi @elstehle I noticed This will unblock NVIDIA/cuCollections#576 and rapidsai/cudf#16526. |
In theory, yes, we could just make So, we're trying various ways to mitigate these performance drops that come from using a wider offset type. With a more sophisticated approach, we were able to mitigate this slowdown to only We will likely pursue a streaming approach for |
I like the streaming idea. The performance degradation with small inputs is IMO negligible since the overall runtime is no more than one millisecond. Thanks for the great work! |
Fix #576 This PR fixes the large input retrieve_all bug with a method similar to the streaming approach mentioned in NVIDIA/cccl#1422 (comment). To be reverted once the CCCL fix is in place.
Tasks
DeviceSelect
#1584DeviceSelect::Unique
#2311num_items
a template parameter ofDeviceSelect
algorithms #2312DeviceSelect
andDevicePartition
#2238The text was updated successfully, but these errors were encountered: