-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add nn_descent #325
base: main
Are you sure you want to change the base?
add nn_descent #325
Conversation
After discussing with @flying-sheep we decided to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why not use string templating for kernels?
Something like
dist_kernel = kernel_code_inner = r"""
extern "C" __global__
void computeDistances_{NAME}(const float* pca,
float* out,
const unsigned int* pairs,
const long long int n_samples,
const long long int n_pcs,
const long long int n_neighbors)
{
long long int i1 = blockDim.x * blockIdx.x + threadIdx.x;
if(i1 >= n_samples){
return;
}
{PREPARE}
for (long long int j = 0; j < n_neighbors; j++){
{OUTER_LOOP}
for (long long int d = 0; d < n_pcs; d++) {
{INNER_LOOP}
}
out[i1 * n_neighbors + j] = {DIST_OP}
}
}
"""
and then replacing the variables as needed?
dist_kernel.format(name="cosine", ...)
This'd make adding new kernels hopefully less work going forward and also make things a bit easier to read, I think.
|
||
kernel_code = r""" | ||
extern "C" __global__ | ||
void computeDistances(const float* pca, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this only work with pca? should it be called something else? it looks like just normal euclidean distance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this applies everywhere about argument names being pca etc.)
neighbors = cp.asarray(neighbors) | ||
distances = cp.asarray(distances) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity, what are these if not cupy arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raft device arrays. You cant slice them or anything they are very bare bones
if metric == "euclidean": | ||
metric_to_use = "sqeuclidean" | ||
else: | ||
metric_to_use = metric | ||
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if metric == "euclidean": | |
metric_to_use = "sqeuclidean" | |
else: | |
metric_to_use = metric | |
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use) | |
idxparams = nn_descent.IndexParams(graph_degree=k, metric="sqeuclidean" if metric == "euclidean" else metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or why not just
if metric == "euclidean": | |
metric_to_use = "sqeuclidean" | |
else: | |
metric_to_use = metric | |
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use) | |
if metric == "euclidean": | |
metric = "sqeuclidean" | |
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric) |
and then don't change the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I later need to know if the metric is sqeuclidean
or euclidean
to do the sqrt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ilan’s first suggestion still applies
) | ||
if metric == "euclidean": | ||
distances = cp.sqrt(distances) | ||
if metric in ("cosine", "euclidean", "sqeuclidean"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't be sqeuclidean
?
if metric in ("cosine", "euclidean", "sqeuclidean"): | |
if metric in ("cosine", "euclidean"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused logic? How is the possible? see above?
|
||
float sum_i1 = 0.0f; | ||
for (long long int d = 0; d < n_pcs; d++) { | ||
sum_i1 += pca[i1 * n_pcs + d] * pca[i1 * n_pcs + d]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__SINGLE.html?highlight=pow#_CPPv44powfff not feasible? Seems like two accesses to the same thing is not great, but would be very curious to find out if it actually is fine (i.e., how cheap this is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing below re: data access.
Co-authored-by: Ilan Gold <[email protected]>
Co-authored-by: Ilan Gold <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, but you should document that inner_product
isn’t an actual distance metric.
Talking about distances
and metrics could lead users into thinking that it is one unless you add a warning about it.
const unsigned int* pairs, | ||
const long long int n_samples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const unsigned int* pairs, | |
const long long int n_samples, | |
const std::uint32_t* pairs, | |
const std::uint64_t n_samples, |
and so on
} | ||
""" | ||
|
||
# Create the RawKernel object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these comments, they are obvious. As said: Comments should contain the “why”, the code should contain the “what”. ... = cp.RawKernel(...)
already says “create the RawKernel object” loudly.
@@ -21,7 +21,7 @@ | |||
from anndata import AnnData | |||
|
|||
AnyRandom = None | int | np.random.RandomState | |||
_Alogithms = Literal["brute", "ivfflat", "ivfpq", "cagra"] | |||
_Alogithms = Literal[_Alogithms_bbknn, Literal["nn_descent"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_Alogithms = Literal[_Alogithms_bbknn, Literal["nn_descent"]] | |
_Algorithms = _Algorithms_bbknn | Literal["nn_descent"] |
or Union
if you have to
if metric == "euclidean": | ||
metric_to_use = "sqeuclidean" | ||
else: | ||
metric_to_use = metric | ||
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ilan’s first suggestion still applies
add_self_neighbors = cp.arange(X.shape[0], dtype=cp.uint32) | ||
neighbors = cp.concatenate( | ||
(add_self_neighbors[:, None], neighbors[:, :-1]), axis=1 | ||
) | ||
add_self_distances = cp.zeros((X.shape[0], 1), dtype=cp.float32) | ||
distances = cp.concatenate((add_self_distances, distances[:, :-1]), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here explaining why you add self.
A valid comment would e.g. be something like
API xyz expects the kNN indices to contain a self column. The real distance metrics don‘t add this column since all self-distances are 0 in distance metrics, but
inner_product
, which isn’t an actual distance metric as it has non-0 values in the self column, does add it.
Co-authored-by: Ilan Gold <[email protected]>
This PR will add
cuvs.neighbors.nn_descent
to rsc. Howevernn_descent
doesn't add self. nn_descent will be supported incuvs 25.02