Pre-processing recommendations? #4

16thomja · 2022-08-26T19:00:03Z

Hello!

First of all, bravo on making hierarchical density clustering applicable to big data! I'm considering applying your algorithm to a set of ~40 million 200-dimensional document embeddings.

However, I have a question that isn't strictly related to FISHDBC. I struggled with the curse of dimensionality (distance uniformity) with HDBSCAN, and I suspect I will also have trouble here. Do you know of a remedy (dimensionality reduction, a different metric, etc.) that can be efficiently applied to such a large volume of data?

matteodellamico · 2022-08-27T06:28:19Z

I have two answers for you.

The first, and simplest one, is to please try FISHDBC out before dismissing it. The algorithm itself scales well even with very high dimensionality, if the embedding is good (i.e., the intrinsic dimensionality is low). 40M might be too large for the current implementation, but try with a subsample of, e.g., 100k/1M to start with. With a lot of memory and patience, you might try to work even with a larger dataset.

The second answer requires more and different work. Are we sure an embedding is the right thing? FISHDBC is designed to work with any distance function, so why should we embed our docs in an Euclidean space? Documents are not that, and I think it's quite likely that an embedding might lose information. Can you design an ad-hoc distance/dissimilarity function (which doesn't need to be metric, i.e., doesn't need to respect the triangle inequality) to feed FISHDBC? My experience tells me that this often gives great results where embeddings may fall short.

In any case, please let me know, I'd love to know how this works out!

16thomja · 2022-08-29T13:51:29Z

Awesome, thanks for the advice! I suppose using an exotic distance would force me to move away from my current embedding algorithm, FastText. I've been reading about the ability of shared nearest neighbor distance (calculated using Euclidean distance) to alleviate the curse to some degree, but that would be a prohibitively expensive calculation. A real head scratcher!

matteodellamico · 2022-08-29T14:44:54Z

Can you give me a pointer to the shared nearest neighbor? I suppose you'd need an embedding in any case. By intuition it doesn't look like you should need it, because density-based clustering should do something similar already.

16thomja · 2022-08-29T14:50:55Z

Here's the paper from which I learned it:
https://imada.sdu.dk/~zimek/publications/SSDBM2010/SNN-SSDBM2010-preprint.pdf
I believe you're right in that the core distance used in density clustering effectively does this already.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-processing recommendations? #4

Pre-processing recommendations? #4

16thomja commented Aug 26, 2022

matteodellamico commented Aug 27, 2022

16thomja commented Aug 29, 2022

matteodellamico commented Aug 29, 2022

16thomja commented Aug 29, 2022

Pre-processing recommendations? #4

Pre-processing recommendations? #4

Comments

16thomja commented Aug 26, 2022

matteodellamico commented Aug 27, 2022

16thomja commented Aug 29, 2022

matteodellamico commented Aug 29, 2022

16thomja commented Aug 29, 2022