Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed silhouette #255

Merged
merged 16 commits into from
Oct 3, 2023
Merged

Conversation

nomadbl
Copy link
Contributor

@nomadbl nomadbl commented Jun 25, 2023

This PR implements a distributed implementation of silhouette score
https://arxiv.org/pdf/2303.14102.pdf

This implementation only supports Square Euclidean metric at the moment, but the extension to some (not all) other metrics is straightforward. See the paper.

This PR includes:

  • Docs incl. usage example (see below)
  • Implementation
  • a bit of a refactor of the existing silhouette code (break it down into more functions to reuse in the new distributed implementation).
  • Unit tests of the new implementation, including a comparison with the usual implementation.

Example including benchmarking

Julia> k=10; d=3; bs=100; n=10000;
Julia> pre = SilhouettesDistsPrecompute(k, d, Float32);
Julia> x = reshape(collect(Float32, 1:30000), 3, n); # direct computation is impractical on such a big dataset
Julia> a = reshape(repeat(collect(1:k),trunc(Int, n/k)), n);
Julia> batches_x = eachslice(reshape(x, 3, trunc(Int, n/bs), bs); dims=3); batches_a = eachslice(reshape(a, trunc(Int, n/bs), bs); dims=2);
Julia> @time [silhouettes_precompute_batch!(k, aa, xx, pre) for (xx, aa) in zip(batches_x, batches_a)]; # precompute vectors on big data
0.838358 seconds (2.06 M allocations: 141.495 MiB, 4.24% gc time, 99.00% compilation time)
Julia> @time sil = vcat([silhouettes(xx, aa, pre) for (xx, aa) in zip(batches_x, batches_a)]); # calculate silhouette scores in batched fashion
0.049759 seconds (53.78 k allocations: 4.440 MiB, 98.54% compilation time)
Julia> size(sil)
(10000,)

@codecov-commenter
Copy link

codecov-commenter commented Jun 25, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Files Coverage Δ
src/Clustering.jl 100.00% <ø> (ø)
src/silhouette.jl 100.00% <100.00%> (+1.63%) ⬆️
src/utils.jl 95.23% <100.00%> (+1.48%) ⬆️
src/cluster_distances.jl 92.07% <92.07%> (ø)

📢 Thoughts on this report? Let us know!.

Copy link
Member

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
There are two major considerations:

  • could you please provide a benchmark similar to the original paper that shows how faster is this method in comparison with the original implementation?
  • since @jaksle suggested to contribute alternative clustering evaluation metrics (Providing additional intrinsic evaluation metrices #253), could you please coordinate with him that this PR would not clash with his code? I've provided some initial code review, but to avoid doing extra work, I suggest figuring out this question first.

src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
@jaksle
Copy link
Contributor

jaksle commented Jun 25, 2023

Interesting algorithm! There should be no conflicts with what I plan to implement, other metrics do not need to use silhouettes code in any way.

@nomadbl
Copy link
Contributor Author

nomadbl commented Jun 25, 2023

Thanks for the thorough review!
I'll make the changes as you suggested and provide details in the comments as necessary.

@nomadbl
Copy link
Contributor Author

nomadbl commented Jun 26, 2023

I updated the API in the spirit of @alyst 's suggestion.

  • Silhouettes uses the streaming implementation on supported metrics by default (can be overriden, and falls back to basic implementation for unsupported metrics)
  • The precomputation struct can be used to call the precomputation method as a functor. This simplifies things for the user and allows easy extension later, because:
  • A new user facing function silhouettes_cache(::Type, metric::Union{Metric, SemiMetric}, nclusters, dims) handles the constructors so the user specifies the metric he is using and gets the correct struct (he is also thrown an error in case of unsupported metric).
  • etc, see changes in the code.

Also, I added a benchmark comparing the vanilla and streaming implementation. You can see the results on my machine below, clearly displaying O(N) vs O(N^2) behavior. Note the distances calculation is included in the benchmark time, I wasn't sure if this was appropriate but left it there for now.

"silhouette" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "n=20,000" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(2.102 ms)
		  "without_precalculation" => Trial(4.916 s)
	  "n=10,000" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(1.027 ms)
		  "without_precalculation" => Trial(913.840 ms)
	  "n=1,000" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(91.534 μs)
		  "without_precalculation" => Trial(3.986 ms)
	  "n=100" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(10.111 μs)
		  "without_precalculation" => Trial(42.006 μs)

@nomadbl nomadbl requested a review from alyst June 26, 2023 21:10
Copy link
Member

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the benchmarking code!

Could you please share the plots (preferred) or textual output that compares the current silhouettes() performance with the new "all-at-once"? We have to confirm that there is an improvement, this is a prerequisite for merging the PR.

I see that you have advanced with the streaming/precomputed/distributed implementation. I understand your motivation, but let's move it out of scope of this PR:

  • Clustering.jl provides basic unsupervised learning algorithms, none of which were specifically optimized for very large datasets or distributed processing. Providing and maintaining such capabilities would require much more resources than the community can dedicate. Also, if the users cannot cluster their very large datasets with Clustering.jl, it would be of little use if silhouettes() can evaluate the quality of these non-existing clusters.
  • To really exploit the benefits of "streaming" silhouettes, you would need to support multi-processing or multi-threading, which would complicate the code even more. In fact, under-the-hood, this PR would provide precompute_silhouettes()! and silhouettes(precomputed::PrecomputedSilhouettes) functions, so you can still explore batched computation, multithreading and multiprocessing on the side.

So I suggest this PR only uses "all-at-once" for silhouettes() (given it is faster than the classical one).

Let's try to iron out the API for the next iteration. Afterwards it should be mostly cosmetic things.

test/silhouette.jl Outdated Show resolved Hide resolved
test/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/Clustering.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/utils.jl Outdated Show resolved Hide resolved
src/utils.jl Outdated Show resolved Hide resolved
@nomadbl
Copy link
Contributor Author

nomadbl commented Jun 30, 2023

@alyst
Thanks for your feedback.
I think we are getting sidetracked with this API thing. I provided it since it looked useful but I now see that we are going to get bogged down in it. It is hard to communicate over github on the API and all the details related to it.
The API I provided is near identical to the one you specified in your detailed comment. Take a look at line 239.
Please lets leave the final API to another PR. This API is completely functional, but you can change it to your liking separately.

I already provided benchmarking exactly as you asked. The two benchmarks compare the silhouettes() function with/without precomputation. I don't know how to get the plots, so you can run the benchmark yourself and see them. Here is the text results again for your convenience.

"silhouette" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "n=20,000" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(2.102 ms)
		  "without_precalculation" => Trial(4.916 s)
	  "n=10,000" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(1.027 ms)
		  "without_precalculation" => Trial(913.840 ms)
	  "n=1,000" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(91.534 μs)
		  "without_precalculation" => Trial(3.986 ms)
	  "n=100" => 2-element BenchmarkTools.BenchmarkGroup:
		  tags: []
		  "with_precalculation" => Trial(10.111 μs)
		  "without_precalculation" => Trial(42.006 μs)

@nomadbl
Copy link
Contributor Author

nomadbl commented Jun 30, 2023

Another important thing I forgot to write just now:

The cached workflow is the main way I am using this in my code.
The reason, as I alluded to it before, is that frequently the data is so big that it is impractical or impossible to hold it all at once in RAM, so there might not even be a way to call the silhouettes() function in practice.
Additionally, I don't see how it is the job of a library to dictate to the user how he should compute things. If there are two ways of doing something, there is no reason to hide one of them.

@alyst
Copy link
Member

alyst commented Jun 30, 2023

The API I provided is near identical to the one you specified in your detailed comment. Take a look at line 239.

I have seen it. Yes, it is "near identical" in the sense that it has all the parameters, but in different order, and different parameters are positional vs keyword.
From the perspective of the user of Clustering.jl that would mean that the user would have no intuition how functions are organized, and should always refer to the manual.
The particular order is also not just a matter of taste: the input data are provided as positional arguments, method parameters are keywords with the default values corresponding to the most common use case.

Again, I very much welcome your contribution of this improvement to the community.
I just assume that you are also committed to put a bit of effort to integrate it into the existing infrastructure for the benefit of the community.
I am sorry that it means extra work for you, but I have mentioned this issue in the first review.

I already provided benchmarking exactly as you asked. The two benchmarks compare the silhouettes() function with/without precomputation. I don't know how to get the plots, so you can run the benchmark yourself and see them. Here is the text results again for your convenience.

Thank you, the improvement looks impressive.

The cached workflow is the main way I am using this in my code.
The reason, as I alluded to it before, is that frequently the data is so big that it is impractical or impossible to hold it all at once in RAM, so there might not even be a way to call the silhouettes() function in practice.
Additionally, I don't see how it is the job of a library to dictate to the user how he should compute things. If there are two ways of doing something, there is no reason to hide one of them.

Exporting and advertising certain API in the documentation means that the maintainers are committed to maintain both the functionality and the API.
Breaking API compatibility between the versions is painful both for the users and the maintainers.
I hope we both agree that enabling distributed processing is a big thing, which requires thorough consideration.
I personally would not do it before consulting with the other maintainers.
I also totally understand that you don't want to resolve all these API issues -- that why I proposed to not export this functionality in this PR.
But you can still use non-exported functions and types from your code.
For example, we were using Base.permute!!(), which was never exported in Julia Base, and recently we had to remove it, because Julia devs decided that it brings no benefits (#229).
Had permute!!() been exported, Julia devs would have to wait until 2.0 to break the API.

@nomadbl
Copy link
Contributor Author

nomadbl commented Jul 1, 2023

Thank you for your detailed explanations and comments. It is very much appreciated.
I adjusted the API according to your comment.
Additionally, the cache struct and function are not exported. I am totally fine with this for my usage, but I absolutely think it should be exported and properly documented once the API is ironed out. I'll help as much as I can with this. I think a discussion in another medium of communication would be helpful.

@nomadbl nomadbl requested a review from alyst July 3, 2023 18:13
@nomadbl
Copy link
Contributor Author

nomadbl commented Jul 20, 2023

@alyst
I think I addressed your concerns with the last commits, but let me know if I missed anything please.
Also, I understand if you don't have more time for this, so if that is the case kindly offer an additional reviewer.
Your efforts are very much appreciated in either case!

@alyst
Copy link
Member

alyst commented Jul 21, 2023

@nomadbl Sorry, I didn't have time to finalize this PR. It's on my agenda, I hope to do it over the weekend. There were some comments regarding how the code is divided into functions that were not completely addressed, but I figured out that instead of another review round, it would be easier for both of us if I just commit them myself. Anyway, thanks for your contribution, it is a very nice improvement for silhouettes!

@alyst alyst force-pushed the Distributed_Silhouette branch from a2e73cc to fede633 Compare July 26, 2023 15:47
Copy link
Member

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having a closer look at the PR, I've realized that SilhouettesCache actually does not do anything silhouette-specific. It is essentially a method for optimized cluster-to-point distances. So I've refactored the code renaming the SilhouettesCache into ClusterDistance and moving it to the new cluster_distance.jl source file. That way it is more visible and could be reused for other algorithms. The "classic" distance matrix-based approach is handled by the SimpleClusterDistance class. It also handles any metrics that don't have a dedicated implementation.

I have committed my changes directly to your branch.
There were also some changes to master branch to fix the CI for the new version of Distances.jl, so I had to rebase this PR and force-push it to fix the CI.
You would need a fresh local check-out if you want to continue working on it.

So please take a look at my changes. Unfortunately, it was a bit more than I have anticipated, but it was required to make sure that Clustering.jl codebase remains consistent.
Let me know if you have any comments. If you are fine, I'll merge the PR.

Thanks again for your work, the benchmarks show some dramatic improvements for SqEuclidean.
I think after this refactoring implementing #258 (CosineClusterDistance) should be easier.

src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
src/silhouette.jl Outdated Show resolved Hide resolved
@nomadbl
Copy link
Contributor Author

nomadbl commented Jul 26, 2023

@alyst
Thanks a lot!
I'll give this a proper look soon and see if there's any specific comments.
A preliminary comment/clarification is that the method doesn't store distances, and it does not compute them either. What it does do is cache quantities used to compute sums of distances. So for example, here, the sums of distances of points in some cluster to another point.
But yes, I do agree on the general point that this can be possibly used anywhere such distance sums are used, to compute them more efficiently (time and memory wise), so it is a good idea to make a separate file for it.

Again, I'll take a closer look soon, and thanks for your help and input!

@alyst
Copy link
Member

alyst commented Sep 27, 2023

@nomadbl Did you have a change to look at the PR?

@nomadbl
Copy link
Contributor Author

nomadbl commented Sep 27, 2023 via email

Copy link
Contributor Author

@nomadbl nomadbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great and honestly hard to tell the difference from the previous version aside from some naming changes. But I suppose the bigger changes are actually on the side of the "usual" calculation?
The suggestion I made is small but potentially important, so I'd appreciate it.

Again, sorry for the delay with this review.

src/cluster_distances.jl Outdated Show resolved Hide resolved
src/cluster_distances.jl Outdated Show resolved Hide resolved
@alyst
Copy link
Member

alyst commented Sep 30, 2023

@nomadbl

Looks great and honestly hard to tell the difference from the previous version aside from some naming changes. But I suppose the bigger changes are actually on the side of the "usual" calculation?

There were no changes to the formulas, but some refactorings to avoid code duplication and simplify logic by utilizing Julia method dispatch mechanism.
Calculation of cluster distances (ex-SilhouettesCache) is now separated from silhouette() code.

Thank you for the review. I have incorporated your feedback, and if there are no further concerns, I will merge it.

@nomadbl
Copy link
Contributor Author

nomadbl commented Sep 30, 2023

@alyst
I think merging is fine. It's better if I give improvement suggestions based on actual use from this point.

1 similar comment
@nomadbl
Copy link
Contributor Author

nomadbl commented Sep 30, 2023

@alyst
I think merging is fine. It's better if I give improvement suggestions based on actual use from this point.

alyst added 2 commits October 2, 2023 16:49
rename "cache" into ClusterDistances and
move to separate source file
@alyst alyst force-pushed the Distributed_Silhouette branch from 66aeb5a to 7da02f5 Compare October 2, 2023 23:50
@alyst alyst merged commit 24dba09 into JuliaStats:master Oct 3, 2023
@alyst
Copy link
Member

alyst commented Oct 8, 2023

Released as 0.15.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants