Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperparameters optimisation #65

Open
wiktorolszowy opened this issue Oct 2, 2024 · 0 comments
Open

Hyperparameters optimisation #65

wiktorolszowy opened this issue Oct 2, 2024 · 0 comments

Comments

@wiktorolszowy
Copy link

wiktorolszowy commented Oct 2, 2024

Hi! Thanks a lot for this package. I am interested in how best to choose values of the hyperparameters. There are five of them that seem particularly relevant:

  1. d: the number of hash functions, used to initialize the LSH forest data structure, by default 128.
  2. l: the number of prefix trees, used to initialize the LSH forest data structure, by default 8.
  3. k: the number of nearest neighbors used to create the k-nearest neighbor graph, by default 10.
  4. $k_c$: the scalar by which k is multiplied before querying the LSH forest, by default 10.
  5. p: the size of the nodes, which affects the magnitude of their repelling force, by default 1/65.

The first two parameters are from tmap.LSHForest and their default values are defined here. The remaining parameters are from tmap.layout_from_lsh_forest and their default values are defined here.

From the supplement (https://ndownloader.figstatic.com/files/21710592) it seems that p is particularly important (cf. figures S1+S2+S3+S7). I often see tmap visualizations that are too sparse, in particular that some branches are very long and that some branches are very short (e.g., with the leaves). The paper and the corresponding analysis of the hyperparameters are already from 4 years ago. I am wondering whether there is someone who has used this tool extensively, who has experimented with these hyperparameters, and who maybe has developed some rules of thumb how to optimize these hyperparameters, especially p, for example dependent on the number of data points, and maybe also dependent on the approximate number of suspected clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant