Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K-NN Classifier #263

Merged
merged 11 commits into from
May 14, 2024
Merged

K-NN Classifier #263

merged 11 commits into from
May 14, 2024

Conversation

krstopro
Copy link
Member

@krstopro krstopro commented May 12, 2024

There are several dilemmas I had while implementing this:

  • How to provide k-NN algorithm and algorithm-specific options? The way it's currently done is either as :algorithm_name or {:algorithm_name, algorithm_opts} (someone correct me if I'm wrong, but the latter should be idiomatic Elixir way of specifying both the module and options to be passed to the initialization function). Another way would be having a separate option for algorithm_name and algorithm-specific options algorithm_opts.
  • Literally every k-NN algorithm in Scholar takes num_neighbors as an option. I would however prefer passing it as separate option to KNNClassifier instead of nesting it inside algorithm-specific options. What I mean is doing Scholar.Neighbors.KNNClassifier.fit(x, y, num_neighbors: 3, num_classes: 2) instead of Scholar.Neighbors.KNNClassifier.fit(x, y, {:brute, [num_neighbors: 3]}, num_classes: 2). Similarly for the metric option. Also, currently it is possible to do Scholar.Neighbors.KNNClassifier.fit(x, y, {:brute, [num_neighbors: 5]}, num_neighbors: 3, num_classes: 2) and num_neighbors: 5 will override num_neighbors: 3. Perhaps an error should be raised to prevent this?

TODO:

  • I think KDTree.predict/2 should be updated to return {neighbors, distances} (currently it returns just neighbors; an unit-test is failing because of this). I might need help with this one.
  • Implement KNNClassifier.predict_proba/2.
  • Add more metrics, e.g. :euclidean.
  • Not sure, but Scholar.Options.metric might also need to be edited. Alternative is removing it from k-NN modules and specifying metrics as atoms in the docs.
  • Maybe few more unit-tests.

Last, I am sorry it took me slightly longer to implement this than I said. I suffered a horrible bike crash this week. I am fine, but still recovering, both physically and mentally. Started implementing KNNRegressor in parallel with this one - shouldn't take long.

@josevalim
Copy link
Contributor

How to provide k-NN algorithm and algorithm-specific options?

One option is to pass all options altogether. Then KNNClassifier "splits" (via Keyword.split) the options it uses and passes all other options to the underlying algorithm, which will also use NimbleOptions to validate and raise in case of unknown/bad options.

FWIW, I'd also call it simply :algorithm.

@josevalim
Copy link
Contributor

josevalim commented May 12, 2024

Not sure, but Scholar.Options.metric might also need to be edited. Alternative is removing it from k-NN modules and specifying metrics as atoms in the docs.

What do you want to edit? We should probably make it consistent and make it always return a two-arity function. Is this what you want?

@josevalim
Copy link
Contributor

I think KDTree.predict/2 should be updated to return {neighbors, distances} (currently it returns just neighbors; an unit-test is failing because of this). I might need help with this

@msluszniak could you please give a hand on this one? 🙌

Last, I am sorry it took me slightly longer to implement this than I said. I suffered a horrible bike crash this week. I am fine, but still recovering, both physically and mentally.

Sorry to hear but also glad to you are fine! Have a speedy recovery!

@krstopro
Copy link
Member Author

Not sure, but Scholar.Options.metric might also need to be edited. Alternative is removing it from k-NN modules and specifying metrics as atoms in the docs.

What do you want to edit? We should probably make it consistent and make it always return a two element function. Is this what you want?

I am not sure if we want to specify the metric option as

type: {:custom, Scholar.Options, :metric, []},

or simply as type: {:in, [:minkowski, :cosine]}. Especially if we want to add more metrics, it might become an issue which of those are supported by different k-NN algorithms. Another thing is, as you say, whether normalization should be performed inside Scholar.Options.metric or inside the modules where metric can be specified (as mentioned here).

@krstopro
Copy link
Member Author

krstopro commented May 12, 2024

How to provide k-NN algorithm and algorithm-specific options?

One option is to pass all options altogether. Then KNNClassifier "splits" (via Keyword.split) the options it uses and passes all other options to the underlying algorithm, which will also use NimbleOptions to validate and raise in case of unknown/bad options.

FWIW, I'd also call it simply :algorithm.

Yeah, this should be the way. :)

@josevalim
Copy link
Contributor

I am not sure if we want to specify the metric option as

Let's open up a separate issue to normalize how metric is handled. My suggestion would be to use {:custom, Scholar.Options, :metric, []} everywhere and, if you cannot handle any metric, then you explicitly opt-in to the only ones you can.

@msluszniak
Copy link
Contributor

@msluszniak could you please give a hand on this one? 🙌

Sure, I'll work on that.

Last, I am sorry it took me slightly longer to implement this than I said. I suffered a horrible bike crash this week. I am fine, but still recovering, both physically and mentally.

I'm sorry, I wish you a quick recovery.

defnp predict_n(tree, point, opts) do
k = opts[:k]
defnp predict_n(tree, point) do
k = tree.num_neighbors
Copy link
Contributor

@msluszniak msluszniak May 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we now pass num_neighbors in fit, I think we may add a note that there is no need to compute all KDTree from scratch for a different number of nearest neighbors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but then we do the same in BruteKNN and RandomProjectionForest. They all now take num_neighbors as an option to fit.

@msluszniak
Copy link
Contributor

I think KDTree.predict/2 should be updated to return {neighbors, distances} (currently it returns just neighbors; an unit-test is failing because of this). I might need help with this one.

I've sent a PR with improvement

@krstopro
Copy link
Member Author

I think KDTree.predict/2 should be updated to return {neighbors, distances} (currently it returns just neighbors; an unit-test is failing because of this). I might need help with this one.

I've sent a PR with improvement

Very quick, thanks. :)

@krstopro
Copy link
Member Author

krstopro commented May 14, 2024

Alright, almost done. I think there is a bug in predict_proba/2. For example, if we have

x_train = Nx.tensor([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y_train = Nx.tensor([0, 0, 0, 1, 1])
model = Scholar.Neighbors.KNNClassifier.fit(x_train, y_train, num_neighbors: 3, num_classes: 2)
x = Nx.tensor([[1, 3], [4, 2], [3, 6]])

Then Scholar.Neighbors.KNNClassifier.predict(model, x) gives

#Nx.Tensor<
  s64[3]
  [0, 0, 1]
>

while Scholar.Neighbors.KNNClassifier.predict_proba(model, x) gives

#Nx.Tensor<
  f32[3][2]
  [
    [1.0, 0.0],
    [1.0, 0.0],
    [1.0, 0.0]
  ]
>

This doesn't seem right. I am having a look at it.

]
)
"""
defn predict_proba(model, x) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename it to predict_probability, because that's what we call these functions everywhere else! However, if they are incorrect and we can't figure out why, we can remove this for now and add it in future PRs. :) Your call!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather investigate it now. I don't expect it to take long, but then, you never know. :)


indices =
Nx.stack(
[Nx.iota(Nx.shape(labels_pred), axis: 0), Nx.take(model.labels, labels_pred)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what we want?

Suggested change
[Nx.iota(Nx.shape(labels_pred), axis: 0), Nx.take(model.labels, labels_pred)],
[Nx.iota(Nx.shape(labels_pred), axis: 0), labels_pred],

Copy link
Member Author

@krstopro krstopro May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Let me rename labels_pred to neighbor_labels, I think it is a more suitable name.

@krstopro krstopro merged commit ffaac87 into elixir-nx:main May 14, 2024
2 checks passed
@krstopro krstopro deleted the knn-classifier branch May 15, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants