Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Landmarked Re-Trainable Parametric UMAP #1153

Merged
merged 29 commits into from
Oct 19, 2024
Merged

Conversation

jacobgolding
Copy link
Contributor

This pull request adds major functionality and documentation for re-training Parametric UMAP models, and using landmark re-training to keep the embedding space consistent. This feature allows for the smooth updating of the embedding space, allowing for the transformation of new data without relying on good generalization. In addition to changes to the library, I've added a new documentation page and notebook motivating this update and explaining how to use it.

Changes include:

  • Allowing ParametricUMAP models to be re-trained by checking in _fit_embed_data() if an encoder is already present before making a new one.
  • Adding a _landmark_loss function to the UMAPModel class, along with landmark_loss_fn and landmark_loss_weight to both the UMAPModel and ParametricUMAP classes.
  • Editing construct_edge_dataset() to pass on landmark_positions as provided to .fit() or .fit_transform().
  • Fixing a bug where the user-defined optimizer was not actually passed to UMAPModel.
  • Making changes to properly add continued training to the ._history, in line with other keras model behavior.
  • Adding the ParametricUMAP class to the UMAP API Guide page of the documentation, since it now has enough additional documentation that should be readily available.
  • Updates to the existing Parametric UMAP documentation page to mention re-training, and include new parameters for the .fit() method.
  • A small update to the existing "Transforming New Data With UMAP" page to indicate potential downsides of not re-training the model, and to link to the new documentation page.
  • A new documentation page, "Transforming New Data with Parametric UMAP" which motivates the problem and provides examples on how to use the new feature.
  • A new example notebook, MNIST_Landmarks.ipynb, which mirrors the documentation page.
  • Adding the load_ParametricUMAP function to __init__.py so that it can be imported from the library.

Questions for review:

  • Should any specific tests be written for ParametricUMAP that involve re-training?
  • Are the changes to the existing "Transforming New Data With UMAP" page sufficient? Should they be more comprehensive?
  • Running black over the repo as suggested in the contributing documentation has resulted in a number of changes across extra files. Should I remove these?
  • The landmark positions are somewhat awkwardly stored within the ParametricUMAP object itself so that they can be passed to construct_edge_dataset(). This was done in an effort to avoid making changes to the UMAP class code. There might be a cleaner way of doing this where the .fit() and .fit_transform() methods on UMAP can take and pass on arbitrary kwargs to self._fit_embed_data(). This would also reduce the awkwardness of checking if the landmarks have already been set in .fit_transform() when you get to .fit(). If that would be preferable I can have a crack at making a cleaner version.
  • I am somewhat unhappy with the way training epochs are handled by ParametricUMAP at this stage. This might be a separate issue, but IMO it's magnified by the ability to re-train. Is there a change in nomenclature or notes in the documentation that could improve this?

@pep8speaks
Copy link

pep8speaks commented Oct 12, 2024

Hello @jacobgolding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 17:89: E501 line too long (89 > 88 characters)

Line 150:23: E231 missing whitespace after ','

Line 739:53: E203 whitespace before ':'
Line 742:45: E203 whitespace before ':'
Line 1953:89: E501 line too long (115 > 88 characters)

Comment last updated at 2024-10-19 13:22:48 UTC

@lmcinnes
Copy link
Owner

This looks super-exciting. I'm not going to have time to look through it until later next week, but please don't mistake my slowness for a lack of enthusiasm.

@timsainb
Copy link
Collaborator

This looks awesome! I am on my phone at the moment and can't read through it but here is some quick feedback.

I wonder if it's possible to merge with this PR (which I should have integrated months ago, sorry). This PR was based on fchollet's update to support the new keras. It didnt fully support PyTorch so this PR was written to add support.

#1123

I agree about the awkward method of handing epochs and would be interested in any ideas about how to do it better. (We also talked about this in the fchollet PR). The reason epochs are weird is because each edge is sampled a different number of times (up to 200 I think) per epoch based on the edge weight. One thought is that we could create an iterator that samples from edges with a probability, rather than a fixed number of times per epoch or something.

Copy link
Owner

@lmcinnes lmcinnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The black changes are all fine. Its probably just a case of accumulated instances of black not getting run on files that got committed. I reviewed them all, and all is well, so we may as well just leave them in here and get that all cleaned up.

I left some minor comments in docs. The rest looks good to me, especially since we had discussed the landmark loss setup already.

I do think it makes sense to have arbitrary keywords pass through on _fit_embed_data, especially if it cleans things up from your perspective. It is something that would also allow future extensions more easily, and costs very little in terms of complexity/changes now.

As for tests ... in an ideal world we would have some. At least a test ensuring that (re-)training with landmark loss actually runs. Ensuring that it "does what we expect" is harder. Maybe a very simple case with, say pendigits and a leave-one-out class and simply verifying that it largely forms a cluster by itself? I don't think that kind of test is necessary for this PR right now.

As for getting Tim's pending changes and yours synced up... is there any chance we can merge this now, and get your changes merged over that later Tim? I think there is also some other work on dataset handling that some other people are working on.

doc/transform_landmarked_pumap.rst Outdated Show resolved Hide resolved
umap/parametric_umap.py Outdated Show resolved Hide resolved
typo in doc string.

Co-authored-by: Leland McInnes <[email protected]>
@jacobgolding
Copy link
Contributor Author

Thanks Leland and Tim,

Glad to hear it looks ok, and good to confirm that the black changes are good to go. I'm not fussed either way in terms of implementing this first or the keras backend one, happy either way. There are some changes that will have to be made in construct_edge_dataset with new dataset generators, which I'm happy to implement/help with.

I'm happy to write up a small test at some point in the future, I've got some pretty lightweight ones looking at classification accuracy on the pre/post retrained embeddings. I guess it depends on whether it's cheaper to do a classifier or clustering in the testing pipeline.

I had a go at the arbitrary kwarg passing to _fit_embed_data and it's cleaned up the ParametricUMAP fit and fit_transform functions a fair amount, so I'm happier with this version.

RE Training epochs, I think the only concrete suggestion I have is to look at changing the names of some of the attributes? When I see self.n_training_epochs = 1 in the code and then keras shows me 10 training epochs it's a little unintuitive, and takes some looking around to work out what's happening. It's compounded by each sample in keras' progress reports being an edge, not a data sample passed to fit. Neither the number of samples nor the number of epochs match the naive guess at what they should be. I do think the current implementation is one of the cleaner options to the problem though, and maybe the fix is more a documentation one. Even a consolidated comment somewhere might help.

@lmcinnes
Copy link
Owner

LGTM.

Do you have a strong need to get your changes merged in first Tim, or can we merge this now and deal with the dataset changes later once you have the time to work through all the details?

@AMS-Hippo
Copy link
Contributor

Jacob, this looks great!

Very minor comments on the tutorial:

  • We ran the tutorial notebook on a "clean" installation, and it looks good.
  • The tutorial currently imports pandas. However, pandas doesn't seem to get used - the notebook ran without it. Since UMAP doesn't require pandas, I'd suggest removing it.
  • This last is me relaying somebody else's comments: I think the "standard" in tutorials is (i) to not use "! pip install," and (ii) to add a comment in the readme about any modules required by the tutorial but not the main module. In this case, the latter is just matplotlib and (if kept) pandas.

@timsainb
Copy link
Collaborator

Do you have a strong need to get your changes merged in first Tim, or can we merge this now and deal with the dataset changes later once you have the time to work through all the details?

Looks good to me, I can merge later!

@lmcinnes lmcinnes merged commit f123b91 into lmcinnes:master Oct 19, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants