Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When to consider using TopOMetry over UMAP? #8

Open
sgbaird opened this issue Mar 18, 2022 · 2 comments
Open

When to consider using TopOMetry over UMAP? #8

sgbaird opened this issue Mar 18, 2022 · 2 comments
Assignees
Labels
documentation Improvements or additions to documentation question Further information is requested
Milestone

Comments

@sgbaird
Copy link

sgbaird commented Mar 18, 2022

I think some people (myself included) will come across this method being familiar with UMAP and classical dimensionality reduction techniques, but it might not be clear when to use UMAP vs. TopOMetry. Could you comment on this?

Might consider adding to the docs and/or README, but feel free to ignore the suggestion as well.

@sgbaird
Copy link
Author

sgbaird commented Mar 18, 2022

Came across this via Leland's twitter post btw.

@davisidarta davisidarta added documentation Improvements or additions to documentation question Further information is requested labels Mar 18, 2022
@davisidarta
Copy link
Owner

Hi @sgbaird! Thank you for your interest in TopOMetry.

it might not be clear when to use UMAP vs. TopOMetry

I imagined people would consider reading the introduction, specifically pages 6-8 from our preprint. I'm not telling people to 'ditch UMAP'. TopOMetry assumptions on data structure are looser than UMAP's. We basically assume the k number of neighbors divided by the total number of samples approaches zero (i.e. data comprises a set of topological manifolds, that is, we can do calculus). When data topology is highly non-uniform, such as in biological information, TopOMetry yields greater details, such as in the PBMC68K example (Fig. 2 of the manuscript). Even in non-biological data, such as in Natural Language Processing, TopOMetry can better separate clusters and provide denoised affinity matrices for further clustering algorithms to be trained on. An important hint that data may fall outside UMAP's assumptions is if embeddings are too different.

A second point is TopOMetry is intended to be a comprehensive framework. Separate steps can be pipelined at the user's will (i.e. use only a first diffusion model and then a specific layout technique, or use the same model to duplicate any steps). I'm not saying the default workflows are necessarily the best, nor the best methods for approximating the LBO, they are only currently the best based on really solid mathematical ground. The idea is that TopOMetry works within a scikit-learn compatible workflow and that users can yield its approximate kNN, affinity learning, orthogonal decomposition, and layout optimization modules separately, at any possible combination, on their will. My intent is to allow the community to provide their thoughts and contributions and extensions on this initial work. After all, I did everything so far by myself.

Might consider adding to the docs and/or README

I'm indeed considering, as this was my first question after sharing the manuscript. Will do it this week, along with some new tutorials.

Came across this via Leland's twitter post btw.

Prof. Leland was very helpful in providing his insights and believing in me in the early stages of this project. I'm thankful he shared this. UMAP is seminal, groundbreaking work, and if I could see a little further it was by standing on the shoulders of giants.

@davisidarta davisidarta self-assigned this Mar 18, 2022
@davisidarta davisidarta added this to the update docs milestone Mar 18, 2022
@davisidarta davisidarta pinned this issue Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants