Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix a bug in NN descent #1869

Merged

Conversation

enp1s0
Copy link
Member

@enp1s0 enp1s0 commented Oct 3, 2023

I got errors when compiling a program using raft NN-descent. This PR fixes the bug.

Error

e.g.

/home/.../include/raft/neighbors/detail/nn_descent.cuh(1158): error: invalid narrowing conversion from "unsigned long" to "int"
      h_rev_graph_old_{static_cast<size_t>(nrow_ * NUM_SAMPLES)},
  • nvcc
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
  • gcc
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • thrust : 2.2 (This is the cause of this error [detail])

Change

Use Class(...) instead of Class{...}.

Cause

The NN-descent code calls constructors of thrust::host_vector as shown below:

graph_host_buffer_{static_cast<size_t>(nrow_ * DEGREE_ON_DEVICE)},

However, this constructor is regarded as a list initialization.
This is the same as the following code outputting 1 instead of 2.

#include <iostream>
#include <vector>

int main() {
  std::vector<float> list{2};

  std::cout << list.size() << std::endl;
}

detail

@enp1s0 enp1s0 requested a review from a team as a code owner October 3, 2023 07:46
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 3, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Oct 3, 2023
@enp1s0 enp1s0 added bug Something isn't working non-breaking Non-breaking change labels Oct 3, 2023
@enp1s0 enp1s0 self-assigned this Oct 3, 2023
@enp1s0
Copy link
Member Author

enp1s0 commented Oct 3, 2023

Thank you, @divyegala and @RayWang96, for porting NN-descent to raft. Is this modification correct?

@RayWang96
Copy link

I believe you are right. And I'm curious why it can pass the unit test with this bug.

@cjnolet
Copy link
Member

cjnolet commented Oct 3, 2023

/ok to test

@divyegala
Copy link
Member

@enp1s0 thanks for this PR. Could you post your NVCC and GCC versions in the description of the PR for hindsight? It seems like RAFT's CI NVCC+GCC are able to compile fine, so we should make note of what causes this compilation failure

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft::device_* variants of mdarray do not have an initializer-list based constructor, so I prefer to keep brace initialization where possible

h_graph_new{nrow * num_samples},
h_list_sizes_new{nrow},
h_graph_old{nrow * num_samples},
h_dists(raft::make_host_matrix<DistData_t, size_t, raft::row_major>(nrow, node_degree)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can keep brace initializer

Comment on lines 1146 to 1152
d_data_(raft::make_device_matrix<__half, Index_t, raft::row_major>(
res, nrow_, build_config.dataset_dim)),
l2_norms_(raft::make_device_vector<DistData_t, Index_t>(res, nrow_)),
graph_buffer_(
raft::make_device_matrix<ID_t, Index_t, raft::row_major>(res, nrow_, DEGREE_ON_DEVICE)),
dists_buffer_(
raft::make_device_matrix<DistData_t, Index_t, raft::row_major>(res, nrow_, DEGREE_ON_DEVICE)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

raft::make_device_matrix<DistData_t, Index_t, raft::row_major>(res, nrow_, DEGREE_ON_DEVICE)),
graph_host_buffer_(nrow_ * DEGREE_ON_DEVICE),
dists_host_buffer_(nrow_ * DEGREE_ON_DEVICE),
d_locks_(raft::make_device_vector<int, Index_t>(res, nrow_)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines 1159 to 1160
d_list_sizes_new_(raft::make_device_vector<int2, Index_t>(res, nrow_)),
d_list_sizes_old_(raft::make_device_vector<int2, Index_t>(res, nrow_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@enp1s0
Copy link
Member Author

enp1s0 commented Oct 4, 2023

Thank you, @divyegala, @RayWang96, and @cjnolet, for reviewing the PR. I have modified the code based on the comments from @divyegala.

The compilation of the unit test passes because it uses int64_t as the index type, while I use int32_t in my program. The error is invalid narrowing conversion from "unsigned long" to "int" and "int" here is the index type, int32_t. In the case of the unit test, no narrowing conversion occurs because the index type is int64_t. But I'm not sure why the unit test works well while the memory is not allocated correctly. Let me look into this a little more. And I think this PR can be merged without waiting for the investigation.

@cjnolet
Copy link
Member

cjnolet commented Oct 4, 2023

Thanks @enp1s0. Release 23.10 code freeze has passed, though, and the branch is being prepped for release. Can you base your branch on the 23.12? I'm in the process of fixing some merge conflicts to our forward merger, but we can get your changes merged shortly after.

@cjnolet cjnolet changed the base branch from branch-23.10 to branch-23.12 October 4, 2023 04:44
@cjnolet cjnolet requested review from a team as code owners October 4, 2023 04:44
@divyegala
Copy link
Member

@enp1s0 I think it's quite likely that this issue is a result of your nvcc version. NN-Descent hard-codes internal index types to int32 so whether you use int64 or int32 index types for CAGRA it should not matter.

Are you saying that the original code compiles with int64 index type for CAGRA with your nvcc version?

@enp1s0 enp1s0 force-pushed the fix-nn_desccent-constructor-bug branch from 259e5a4 to 5ee1976 Compare October 4, 2023 06:57
@enp1s0
Copy link
Member Author

enp1s0 commented Oct 4, 2023

Thank you, @divyegala, for your comment. I have found that the difference in the thrust versions is causing the compilation error. Raft uses thrust 1.17.2 while my program uses thrust 2.2, which is included in the CUDA 12.2 toolkit. And the list initializer was introduced in thrust 2.1 [release note].

So, this is not a bug. Is it better to close this PR or merge it for the future thrust version update used in raft? (cc: @cjnolet)

@divyegala
Copy link
Member

@enp1s0 thanks for the investigations! We should merge this PR. Can you also add the thrust version to the description of this PR?

@divyegala
Copy link
Member

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Oct 6, 2023

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Oct 6, 2023

/merge

@rapids-bot rapids-bot bot merged commit d9fde97 into rapidsai:branch-23.12 Oct 6, 2023
59 checks passed
@enp1s0 enp1s0 deleted the fix-nn_desccent-constructor-bug branch October 6, 2023 03:50
divyegala pushed a commit to divyegala/raft that referenced this pull request Oct 6, 2023
I got errors when compiling a program using raft NN-descent. This PR fixes the bug.

## Error
e.g.
```
/home/.../include/raft/neighbors/detail/nn_descent.cuh(1158): error: invalid narrowing conversion from "unsigned long" to "int"
      h_rev_graph_old_{static_cast<size_t>(nrow_ * NUM_SAMPLES)},
```

- nvcc

```
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
```

- gcc
```
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
```

- thrust : 2.2 (This is the cause of this error [[detail](rapidsai#1869 (comment))])

# Change
Use `Class(...)` instead of `Class{...}`.

# Cause
The NN-descent code calls constructors of `thrust::host_vector` as shown below:
```cpp
graph_host_buffer_{static_cast<size_t>(nrow_ * DEGREE_ON_DEVICE)},
```
However, this constructor is regarded as a list initialization.
This is the same as the following code outputting 1 instead of 2.
```cpp
#include <iostream>
#include <vector>

int main() {
  std::vector<float> list{2};

  std::cout << list.size() << std::endl;
}
```

[detail](https://en.cppreference.com/w/cpp/language/list_initialization)

Authors:
  - tsuki (https://github.com/enp1s0)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1869
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants