Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add nn_descent #325

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

add nn_descent #325

wants to merge 8 commits into from

Conversation

Intron7
Copy link
Member

@Intron7 Intron7 commented Feb 7, 2025

This PR will add cuvs.neighbors.nn_descent to rsc. However nn_descent doesn't add self. nn_descent will be supported in cuvs 25.02

@Intron7 Intron7 added the run-gpu-ci runs GPU CI label Feb 7, 2025
@Intron7 Intron7 requested a review from ilan-gold February 7, 2025 12:12
@github-actions github-actions bot removed the run-gpu-ci runs GPU CI label Feb 7, 2025
@Intron7 Intron7 added the run-gpu-ci runs GPU CI label Feb 7, 2025
@github-actions github-actions bot removed the run-gpu-ci runs GPU CI label Feb 7, 2025
@Intron7
Copy link
Member Author

Intron7 commented Feb 11, 2025

After discussing with @flying-sheep we decided to add self to nn_descent.

Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why not use string templating for kernels?

Something like

dist_kernel = kernel_code_inner = r"""
extern "C" __global__
void computeDistances_{NAME}(const float* pca,
                    float* out,
                    const unsigned int* pairs,
                    const long long int n_samples,
                    const long long int n_pcs,
                    const long long int n_neighbors)
{
    long long int i1 = blockDim.x * blockIdx.x + threadIdx.x;
    if(i1 >= n_samples){
        return;
    }
    {PREPARE}
    for (long long int j = 0; j < n_neighbors; j++){
        {OUTER_LOOP}
        for (long long int d = 0; d < n_pcs; d++) {
            {INNER_LOOP}
        }
        out[i1 * n_neighbors + j] = {DIST_OP}
    }
}
"""

and then replacing the variables as needed?

dist_kernel.format(name="cosine", ...)

This'd make adding new kernels hopefully less work going forward and also make things a bit easier to read, I think.


kernel_code = r"""
extern "C" __global__
void computeDistances(const float* pca,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this only work with pca? should it be called something else? it looks like just normal euclidean distance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this applies everywhere about argument names being pca etc.)

Comment on lines +126 to +127
neighbors = cp.asarray(neighbors)
distances = cp.asarray(distances)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just out of curiosity, what are these if not cupy arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft device arrays. You cant slice them or anything they are very bare bones

src/rapids_singlecell/preprocessing/_neighbors.py Outdated Show resolved Hide resolved
Comment on lines +210 to +214
if metric == "euclidean":
metric_to_use = "sqeuclidean"
else:
metric_to_use = metric
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if metric == "euclidean":
metric_to_use = "sqeuclidean"
else:
metric_to_use = metric
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use)
idxparams = nn_descent.IndexParams(graph_degree=k, metric="sqeuclidean" if metric == "euclidean" else metric)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or why not just

Suggested change
if metric == "euclidean":
metric_to_use = "sqeuclidean"
else:
metric_to_use = metric
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use)
if metric == "euclidean":
metric = "sqeuclidean"
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric)

and then don't change the others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I later need to know if the metric is sqeuclidean or euclidean to do the sqrt

Copy link
Member

@flying-sheep flying-sheep Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ilan’s first suggestion still applies

src/rapids_singlecell/preprocessing/_neighbors.py Outdated Show resolved Hide resolved
)
if metric == "euclidean":
distances = cp.sqrt(distances)
if metric in ("cosine", "euclidean", "sqeuclidean"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't be sqeuclidean?

Suggested change
if metric in ("cosine", "euclidean", "sqeuclidean"):
if metric in ("cosine", "euclidean"):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused logic? How is the possible? see above?

src/rapids_singlecell/preprocessing/_neighbors.py Outdated Show resolved Hide resolved

float sum_i1 = 0.0f;
for (long long int d = 0; d < n_pcs; d++) {
sum_i1 += pca[i1 * n_pcs + d] * pca[i1 * n_pcs + d];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__SINGLE.html?highlight=pow#_CPPv44powfff not feasible? Seems like two accesses to the same thing is not great, but would be very curious to find out if it actually is fine (i.e., how cheap this is).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing below re: data access.

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, but you should document that inner_product isn’t an actual distance metric.

Talking about distances and metrics could lead users into thinking that it is one unless you add a warning about it.

Comment on lines +9 to +10
const unsigned int* pairs,
const long long int n_samples,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const unsigned int* pairs,
const long long int n_samples,
const std::uint32_t* pairs,
const std::uint64_t n_samples,

and so on

}
"""

# Create the RawKernel object
Copy link
Member

@flying-sheep flying-sheep Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these comments, they are obvious. As said: Comments should contain the “why”, the code should contain the “what”. ... = cp.RawKernel(...) already says “create the RawKernel object” loudly.

@@ -21,7 +21,7 @@
from anndata import AnnData

AnyRandom = None | int | np.random.RandomState
_Alogithms = Literal["brute", "ivfflat", "ivfpq", "cagra"]
_Alogithms = Literal[_Alogithms_bbknn, Literal["nn_descent"]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_Alogithms = Literal[_Alogithms_bbknn, Literal["nn_descent"]]
_Algorithms = _Algorithms_bbknn | Literal["nn_descent"]

or Union if you have to

Comment on lines +210 to +214
if metric == "euclidean":
metric_to_use = "sqeuclidean"
else:
metric_to_use = metric
idxparams = nn_descent.IndexParams(graph_degree=k, metric=metric_to_use)
Copy link
Member

@flying-sheep flying-sheep Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ilan’s first suggestion still applies

Comment on lines +236 to +241
add_self_neighbors = cp.arange(X.shape[0], dtype=cp.uint32)
neighbors = cp.concatenate(
(add_self_neighbors[:, None], neighbors[:, :-1]), axis=1
)
add_self_distances = cp.zeros((X.shape[0], 1), dtype=cp.float32)
distances = cp.concatenate((add_self_distances, distances[:, :-1]), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here explaining why you add self.

A valid comment would e.g. be something like

API xyz expects the kNN indices to contain a self column. The real distance metrics don‘t add this column since all self-distances are 0 in distance metrics, but inner_product, which isn’t an actual distance metric as it has non-0 values in the self column, does add it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants