Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-processing recommendations? #4

Open
16thomja opened this issue Aug 26, 2022 · 4 comments
Open

Pre-processing recommendations? #4

16thomja opened this issue Aug 26, 2022 · 4 comments

Comments

@16thomja
Copy link

Hello!

First of all, bravo on making hierarchical density clustering applicable to big data! I'm considering applying your algorithm to a set of ~40 million 200-dimensional document embeddings.

However, I have a question that isn't strictly related to FISHDBC. I struggled with the curse of dimensionality (distance uniformity) with HDBSCAN, and I suspect I will also have trouble here. Do you know of a remedy (dimensionality reduction, a different metric, etc.) that can be efficiently applied to such a large volume of data?

@matteodellamico
Copy link
Owner

I have two answers for you.

The first, and simplest one, is to please try FISHDBC out before dismissing it. The algorithm itself scales well even with very high dimensionality, if the embedding is good (i.e., the intrinsic dimensionality is low). 40M might be too large for the current implementation, but try with a subsample of, e.g., 100k/1M to start with. With a lot of memory and patience, you might try to work even with a larger dataset.

The second answer requires more and different work. Are we sure an embedding is the right thing? FISHDBC is designed to work with any distance function, so why should we embed our docs in an Euclidean space? Documents are not that, and I think it's quite likely that an embedding might lose information. Can you design an ad-hoc distance/dissimilarity function (which doesn't need to be metric, i.e., doesn't need to respect the triangle inequality) to feed FISHDBC? My experience tells me that this often gives great results where embeddings may fall short.

In any case, please let me know, I'd love to know how this works out!

@16thomja
Copy link
Author

Awesome, thanks for the advice! I suppose using an exotic distance would force me to move away from my current embedding algorithm, FastText. I've been reading about the ability of shared nearest neighbor distance (calculated using Euclidean distance) to alleviate the curse to some degree, but that would be a prohibitively expensive calculation. A real head scratcher!

@matteodellamico
Copy link
Owner

Can you give me a pointer to the shared nearest neighbor? I suppose you'd need an embedding in any case. By intuition it doesn't look like you should need it, because density-based clustering should do something similar already.

@16thomja
Copy link
Author

Here's the paper from which I learned it:
https://imada.sdu.dk/~zimek/publications/SSDBM2010/SNN-SSDBM2010-preprint.pdf
I believe you're right in that the core distance used in density clustering effectively does this already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants