-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed silhouette #255
Conversation
Codecov ReportAttention:
📢 Thoughts on this report? Let us know!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
There are two major considerations:
- could you please provide a benchmark similar to the original paper that shows how faster is this method in comparison with the original implementation?
- since @jaksle suggested to contribute alternative clustering evaluation metrics (Providing additional intrinsic evaluation metrices #253), could you please coordinate with him that this PR would not clash with his code? I've provided some initial code review, but to avoid doing extra work, I suggest figuring out this question first.
Interesting algorithm! There should be no conflicts with what I plan to implement, other metrics do not need to use silhouettes code in any way. |
Thanks for the thorough review! |
I updated the API in the spirit of @alyst 's suggestion.
Also, I added a benchmark comparing the vanilla and streaming implementation. You can see the results on my machine below, clearly displaying O(N) vs O(N^2) behavior. Note the distances calculation is included in the benchmark time, I wasn't sure if this was appropriate but left it there for now.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the benchmarking code!
Could you please share the plots (preferred) or textual output that compares the current silhouettes()
performance with the new "all-at-once"? We have to confirm that there is an improvement, this is a prerequisite for merging the PR.
I see that you have advanced with the streaming/precomputed/distributed implementation. I understand your motivation, but let's move it out of scope of this PR:
- Clustering.jl provides basic unsupervised learning algorithms, none of which were specifically optimized for very large datasets or distributed processing. Providing and maintaining such capabilities would require much more resources than the community can dedicate. Also, if the users cannot cluster their very large datasets with Clustering.jl, it would be of little use if silhouettes() can evaluate the quality of these non-existing clusters.
- To really exploit the benefits of "streaming" silhouettes, you would need to support multi-processing or multi-threading, which would complicate the code even more. In fact, under-the-hood, this PR would provide
precompute_silhouettes()!
andsilhouettes(precomputed::PrecomputedSilhouettes)
functions, so you can still explore batched computation, multithreading and multiprocessing on the side.
So I suggest this PR only uses "all-at-once" for silhouettes()
(given it is faster than the classical one).
Let's try to iron out the API for the next iteration. Afterwards it should be mostly cosmetic things.
@alyst I already provided benchmarking exactly as you asked. The two benchmarks compare the
|
Another important thing I forgot to write just now: The cached workflow is the main way I am using this in my code. |
I have seen it. Yes, it is "near identical" in the sense that it has all the parameters, but in different order, and different parameters are positional vs keyword. Again, I very much welcome your contribution of this improvement to the community.
Thank you, the improvement looks impressive.
Exporting and advertising certain API in the documentation means that the maintainers are committed to maintain both the functionality and the API. |
Thank you for your detailed explanations and comments. It is very much appreciated. |
@alyst |
@nomadbl Sorry, I didn't have time to finalize this PR. It's on my agenda, I hope to do it over the weekend. There were some comments regarding how the code is divided into functions that were not completely addressed, but I figured out that instead of another review round, it would be easier for both of us if I just commit them myself. Anyway, thanks for your contribution, it is a very nice improvement for silhouettes! |
a2e73cc
to
fede633
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having a closer look at the PR, I've realized that SilhouettesCache
actually does not do anything silhouette-specific. It is essentially a method for optimized cluster-to-point distances. So I've refactored the code renaming the SilhouettesCache
into ClusterDistance
and moving it to the new cluster_distance.jl
source file. That way it is more visible and could be reused for other algorithms. The "classic" distance matrix-based approach is handled by the SimpleClusterDistance
class. It also handles any metrics that don't have a dedicated implementation.
I have committed my changes directly to your branch.
There were also some changes to master branch to fix the CI for the new version of Distances.jl, so I had to rebase this PR and force-push it to fix the CI.
You would need a fresh local check-out if you want to continue working on it.
So please take a look at my changes. Unfortunately, it was a bit more than I have anticipated, but it was required to make sure that Clustering.jl codebase remains consistent.
Let me know if you have any comments. If you are fine, I'll merge the PR.
Thanks again for your work, the benchmarks show some dramatic improvements for SqEuclidean.
I think after this refactoring implementing #258 (CosineClusterDistance
) should be easier.
@alyst Again, I'll take a closer look soon, and thanks for your help and input! |
@nomadbl Did you have a change to look at the PR? |
Sorry about that, I'll take a good look tomorrow.
…On Wed, Sep 27, 2023 at 11:18 PM Alexey Stukalov ***@***.***> wrote:
@nomadbl <https://github.com/nomadbl> Did you have a change to look at
the PR?
—
Reply to this email directly, view it on GitHub
<#255 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEJJFSDJIGSERRJFYSULTT3X4SCYDANCNFSM6AAAAAAZTFRHLM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great and honestly hard to tell the difference from the previous version aside from some naming changes. But I suppose the bigger changes are actually on the side of the "usual" calculation?
The suggestion I made is small but potentially important, so I'd appreciate it.
Again, sorry for the delay with this review.
There were no changes to the formulas, but some refactorings to avoid code duplication and simplify logic by utilizing Julia method dispatch mechanism. Thank you for the review. I have incorporated your feedback, and if there are no further concerns, I will merge it. |
@alyst |
1 similar comment
@alyst |
rename "cache" into ClusterDistances and move to separate source file
66aeb5a
to
7da02f5
Compare
Released as 0.15.5 |
This PR implements a distributed implementation of silhouette score
https://arxiv.org/pdf/2303.14102.pdf
This implementation only supports Square Euclidean metric at the moment, but the extension to some (not all) other metrics is straightforward. See the paper.
This PR includes:
Example including benchmarking