Landmarked Re-Trainable Parametric UMAP #1153

jacobgolding · 2024-10-12T07:46:40Z

This pull request adds major functionality and documentation for re-training Parametric UMAP models, and using landmark re-training to keep the embedding space consistent. This feature allows for the smooth updating of the embedding space, allowing for the transformation of new data without relying on good generalization. In addition to changes to the library, I've added a new documentation page and notebook motivating this update and explaining how to use it.

Changes include:

Allowing ParametricUMAP models to be re-trained by checking in _fit_embed_data() if an encoder is already present before making a new one.
Adding a _landmark_loss function to the UMAPModel class, along with landmark_loss_fn and landmark_loss_weight to both the UMAPModel and ParametricUMAP classes.
Editing construct_edge_dataset() to pass on landmark_positions as provided to .fit() or .fit_transform().
Fixing a bug where the user-defined optimizer was not actually passed to UMAPModel.
Making changes to properly add continued training to the ._history, in line with other keras model behavior.
Adding the ParametricUMAP class to the UMAP API Guide page of the documentation, since it now has enough additional documentation that should be readily available.
Updates to the existing Parametric UMAP documentation page to mention re-training, and include new parameters for the .fit() method.
A small update to the existing "Transforming New Data With UMAP" page to indicate potential downsides of not re-training the model, and to link to the new documentation page.
A new documentation page, "Transforming New Data with Parametric UMAP" which motivates the problem and provides examples on how to use the new feature.
A new example notebook, MNIST_Landmarks.ipynb, which mirrors the documentation page.
Adding the load_ParametricUMAP function to __init__.py so that it can be imported from the library.

Questions for review:

Should any specific tests be written for ParametricUMAP that involve re-training?
Are the changes to the existing "Transforming New Data With UMAP" page sufficient? Should they be more comprehensive?
Running black over the repo as suggested in the contributing documentation has resulted in a number of changes across extra files. Should I remove these?
The landmark positions are somewhat awkwardly stored within the ParametricUMAP object itself so that they can be passed to construct_edge_dataset(). This was done in an effort to avoid making changes to the UMAP class code. There might be a cleaner way of doing this where the .fit() and .fit_transform() methods on UMAP can take and pass on arbitrary kwargs to self._fit_embed_data(). This would also reduce the awkwardness of checking if the landmarks have already been set in .fit_transform() when you get to .fit(). If that would be preferable I can have a crack at making a cleaner version.
I am somewhat unhappy with the way training epochs are handled by ParametricUMAP at this stage. This might be a separate issue, but IMO it's magnified by the ability to re-train. Is there a change in nomenclature or notes in the documentation that could improve this?

…sform

pep8speaks · 2024-10-12T07:46:59Z

Hello @jacobgolding! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file umap/tests/test_spectral.py:

Line 17:89: E501 line too long (89 > 88 characters)

In the file umap/tests/test_umap_validation_params.py:

Line 150:23: E231 missing whitespace after ','

In the file umap/umap_.py:

Line 739:53: E203 whitespace before ':'
Line 742:45: E203 whitespace before ':'
Line 1953:89: E501 line too long (115 > 88 characters)

Comment last updated at 2024-10-19 13:22:48 UTC

…encoder/decoder

lmcinnes · 2024-10-12T13:37:01Z

This looks super-exciting. I'm not going to have time to look through it until later next week, but please don't mistake my slowness for a lack of enthusiasm.

timsainb · 2024-10-12T19:15:40Z

This looks awesome! I am on my phone at the moment and can't read through it but here is some quick feedback.

I wonder if it's possible to merge with this PR (which I should have integrated months ago, sorry). This PR was based on fchollet's update to support the new keras. It didnt fully support PyTorch so this PR was written to add support.

#1123

I agree about the awkward method of handing epochs and would be interested in any ideas about how to do it better. (We also talked about this in the fchollet PR). The reason epochs are weird is because each edge is sampled a different number of times (up to 200 I think) per epoch based on the edge weight. One thought is that we could create an iterator that samples from edges with a probability, rather than a fixed number of times per epoch or something.

lmcinnes

The black changes are all fine. Its probably just a case of accumulated instances of black not getting run on files that got committed. I reviewed them all, and all is well, so we may as well just leave them in here and get that all cleaned up.

I left some minor comments in docs. The rest looks good to me, especially since we had discussed the landmark loss setup already.

I do think it makes sense to have arbitrary keywords pass through on _fit_embed_data, especially if it cleans things up from your perspective. It is something that would also allow future extensions more easily, and costs very little in terms of complexity/changes now.

As for tests ... in an ideal world we would have some. At least a test ensuring that (re-)training with landmark loss actually runs. Ensuring that it "does what we expect" is harder. Maybe a very simple case with, say pendigits and a leave-one-out class and simply verifying that it largely forms a cluster by itself? I don't think that kind of test is necessary for this PR right now.

As for getting Tim's pending changes and yours synced up... is there any chance we can merge this now, and get your changes merged over that later Tim? I think there is also some other work on dataset handling that some other people are working on.

doc/transform_landmarked_pumap.rst

umap/parametric_umap.py

typo in doc string. Co-authored-by: Leland McInnes <[email protected]>

… by .fit() and .fit_transform()

jacobgolding · 2024-10-17T07:57:32Z

Thanks Leland and Tim,

Glad to hear it looks ok, and good to confirm that the black changes are good to go. I'm not fussed either way in terms of implementing this first or the keras backend one, happy either way. There are some changes that will have to be made in construct_edge_dataset with new dataset generators, which I'm happy to implement/help with.

I'm happy to write up a small test at some point in the future, I've got some pretty lightweight ones looking at classification accuracy on the pre/post retrained embeddings. I guess it depends on whether it's cheaper to do a classifier or clustering in the testing pipeline.

I had a go at the arbitrary kwarg passing to _fit_embed_data and it's cleaned up the ParametricUMAP fit and fit_transform functions a fair amount, so I'm happier with this version.

RE Training epochs, I think the only concrete suggestion I have is to look at changing the names of some of the attributes? When I see self.n_training_epochs = 1 in the code and then keras shows me 10 training epochs it's a little unintuitive, and takes some looking around to work out what's happening. It's compounded by each sample in keras' progress reports being an edge, not a data sample passed to fit. Neither the number of samples nor the number of epochs match the naive guess at what they should be. I do think the current implementation is one of the cleaner options to the problem though, and maybe the fix is more a documentation one. Even a consolidated comment somewhere might help.

lmcinnes · 2024-10-17T19:55:00Z

LGTM.

Do you have a strong need to get your changes merged in first Tim, or can we merge this now and deal with the dataset changes later once you have the time to work through all the details?

AMS-Hippo · 2024-10-18T13:31:12Z

Jacob, this looks great!

Very minor comments on the tutorial:

We ran the tutorial notebook on a "clean" installation, and it looks good.
The tutorial currently imports pandas. However, pandas doesn't seem to get used - the notebook ran without it. Since UMAP doesn't require pandas, I'd suggest removing it.
This last is me relaying somebody else's comments: I think the "standard" in tutorials is (i) to not use "! pip install," and (ii) to add a comment in the readme about any modules required by the tutorial but not the main module. In this case, the latter is just matplotlib and (if kept) pandas.

timsainb · 2024-10-18T14:27:17Z

Do you have a strong need to get your changes merged in first Tim, or can we merge this now and deal with the dataset changes later once you have the time to work through all the details?

Looks good to me, I can merge later!

jacob golding added 23 commits September 23, 2024 18:37

Bug fix - optimizer is now passed to UMAPModel

8e5a0f5

Initial Retrainable Parametric UMAP.

6e4f0e1

landmarks parameter added to ParametricUMAP.__init__()

64c9e34

Landmark loss added. Can be included with a kwarg on fit and fit_tran…

e617210

…sform

Docstring teplates - TODO

7649987

fit() and fit_transform() docstrings

bdfbc6c

ran black formatting

587ae4f

optional landmark loss function

25d4b27

landmark loss weight

7432b01

Random seed passed to keras, bug fixes

42bb2a4

Inference batch size

5d1529a

landmarks_ -> landmark_

a05acf3

Retrain PUMAP docs

6422e16

Parametric UMAP added to doc api page.

622f4f4

black cleanup

c72bb25

Bug fix - cast landmark_positions to float32

20864e0

Debugging - nan loss on retrain. check_array on landmarks

1be36e2

Debug updates (not working yet)

2c8718d

Landmarks notebook included.

271c03f

relu loss added, fit_transform -> fit behaviour corrected.

67b8186

Retrain PUMAP documentation and example notebook

ad8fe0c

Merge remote-tracking branch 'upstream/master'

630c5dd

Black cleanup

5151325

jacob golding added 2 commits October 12, 2024 20:19

Check for parametric_model existing, not encoder to allow for custom …

69e5913

…encoder/decoder

PEP 8 fixes

1cbe6ab

lmcinnes reviewed Oct 16, 2024

View reviewed changes

doc/transform_landmarked_pumap.rst Outdated Show resolved Hide resolved

umap/parametric_umap.py Outdated Show resolved Hide resolved

Update umap/parametric_umap.py

7abfd5b

typo in doc string. Co-authored-by: Leland McInnes <[email protected]>

PR review. landmark_positions and other arbitrary kwargs now accepted…

803e23e

… by .fit() and .fit_transform()

timsainb mentioned this pull request Oct 18, 2024

Pickle issue with load_ParametricUMAP #1134

Open

lmcinnes added 2 commits October 19, 2024 09:15

Update MNIST_Landmarks.ipynb per AMS suggestions

45191cb

Esiditn raw JSON notebooks is hard

80d4bae

lmcinnes merged commit f123b91 into lmcinnes:master Oct 19, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Landmarked Re-Trainable Parametric UMAP #1153

Landmarked Re-Trainable Parametric UMAP #1153

jacobgolding commented Oct 12, 2024

pep8speaks commented Oct 12, 2024 •

edited

Loading

lmcinnes commented Oct 12, 2024

timsainb commented Oct 12, 2024

lmcinnes left a comment

jacobgolding commented Oct 17, 2024

lmcinnes commented Oct 17, 2024

AMS-Hippo commented Oct 18, 2024

timsainb commented Oct 18, 2024

Landmarked Re-Trainable Parametric UMAP #1153

Landmarked Re-Trainable Parametric UMAP #1153

Conversation

jacobgolding commented Oct 12, 2024

pep8speaks commented Oct 12, 2024 • edited Loading

Comment last updated at 2024-10-19 13:22:48 UTC

lmcinnes commented Oct 12, 2024

timsainb commented Oct 12, 2024

lmcinnes left a comment

Choose a reason for hiding this comment

jacobgolding commented Oct 17, 2024

lmcinnes commented Oct 17, 2024

AMS-Hippo commented Oct 18, 2024

timsainb commented Oct 18, 2024

pep8speaks commented Oct 12, 2024 •

edited

Loading