Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Landmarked Re-Trainable Parametric UMAP #1153

Merged
merged 29 commits into from
Oct 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
8e5a0f5
Bug fix - optimizer is now passed to UMAPModel
Sep 23, 2024
6e4f0e1
Initial Retrainable Parametric UMAP.
Sep 23, 2024
64c9e34
landmarks parameter added to ParametricUMAP.__init__()
Sep 23, 2024
e617210
Landmark loss added. Can be included with a kwarg on fit and fit_tran…
Sep 24, 2024
7649987
Docstring teplates - TODO
Sep 24, 2024
bdfbc6c
fit() and fit_transform() docstrings
Sep 24, 2024
587ae4f
ran black formatting
Sep 24, 2024
25d4b27
optional landmark loss function
Sep 30, 2024
7432b01
landmark loss weight
Sep 30, 2024
42bb2a4
Random seed passed to keras, bug fixes
Oct 2, 2024
5d1529a
Inference batch size
Oct 3, 2024
a05acf3
landmarks_ -> landmark_
Oct 4, 2024
6422e16
Retrain PUMAP docs
Oct 4, 2024
622f4f4
Parametric UMAP added to doc api page.
Oct 4, 2024
c72bb25
black cleanup
Oct 4, 2024
20864e0
Bug fix - cast landmark_positions to float32
Oct 11, 2024
1be36e2
Debugging - nan loss on retrain. check_array on landmarks
Oct 11, 2024
2c8718d
Debug updates (not working yet)
Oct 11, 2024
271c03f
Landmarks notebook included.
Oct 11, 2024
67b8186
relu loss added, fit_transform -> fit behaviour corrected.
Oct 12, 2024
ad8fe0c
Retrain PUMAP documentation and example notebook
Oct 12, 2024
630c5dd
Merge remote-tracking branch 'upstream/master'
Oct 12, 2024
5151325
Black cleanup
Oct 12, 2024
69e5913
Check for parametric_model existing, not encoder to allow for custom …
Oct 12, 2024
1cbe6ab
PEP 8 fixes
Oct 12, 2024
7abfd5b
Update umap/parametric_umap.py
jacobgolding Oct 17, 2024
803e23e
PR review. landmark_positions and other arbitrary kwargs now accepted…
Oct 17, 2024
45191cb
Update MNIST_Landmarks.ipynb per AMS suggestions
lmcinnes Oct 19, 2024
80d4bae
Esiditn raw JSON notebooks is hard
lmcinnes Oct 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion doc/api.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
UMAP API Guide
==============

UMAP has only a single class :class:`UMAP`.
UMAP has only two classes, :class:`UMAP`, and :class:`ParametricUMAP`, which inherits from it.

UMAP
----

.. autoclass:: umap.umap_.UMAP
:members:

ParametricUMAP
----

.. autoclass:: umap.parametric_umap.ParametricUMAP
:members:

A number of internal functions can also be accessed separately for more fine tuned work.

Useful Functions
Expand Down
4 changes: 2 additions & 2 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
import os
import sys

sys.path.insert(0, os.path.abspath('.'))
sys.path.insert(0, os.path.abspath('..'))
sys.path.insert(0, os.path.abspath("."))
sys.path.insert(0, os.path.abspath(".."))


# -- General configuration ------------------------------------------------
Expand Down
Binary file added doc/images/retrain_pumap_emb_x1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_emb_x2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_p_emb_x1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_p_emb_x2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/retrain_pumap_summary_2_removed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ PyPI install, presuming you have numba and sklearn and all its requirements
transform
inverse_transform
parametric_umap
transform_landmarked_pumap
sparse
supervised
clustering
Expand Down
6 changes: 5 additions & 1 deletion doc/parametric_umap.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ This loads both the UMAP object and the parametric networks it contains.

Plotting loss
-------------
Parametric UMAP monitors loss during training using Keras. That loss will be printed after each epoch during training. This loss is saved in :python:`embedder.history`, and can be plotted:
Parametric UMAP monitors loss during training using Keras. That loss will be printed after each epoch during training. This loss is saved in :python:`embedder._history`, and can be plotted:

.. code:: python3

Expand All @@ -103,6 +103,8 @@ Parametric UMAP monitors loss during training using Keras. That loss will be pri

.. image:: images/umap-loss.png

Much like other keras models, if you continue to train your model via the :python:`fit` method of the model, the :python:`embedder._history` will be updated with further training epoch losses.

Parametric inverse_transform (reconstruction)
---------------------------------------------
To use a second neural network to learn an inverse mapping between data and embeddings, we simply need to pass `parametric_reconstruction= True` to the ParametricUMAP.
Expand Down Expand Up @@ -205,6 +207,8 @@ Additional important parameters
* **optimizer:** The optimizer used to train the neural network. by default Adam (:python:`tf.keras.optimizers.Adam(1e-3)`) is used. You might be able to speed up or improve training by using a different optimizer.
* **parametric_embedding:** If set to false, a non-parametric embedding is learned, using the same code as the parametric embedding, which can serve as a direct comparison between parametric and non-parametric embedding using the same optimizer. The parametric embeddings are performed over the entire dataset simultaneously.
* **global_correlation_loss_weight:** Whether to additionally train on correlation of global pairwise relationships (multidimensional scaling)
* **landmark_loss_fn:** The loss function to use when re-training on landmarked data, where you have provided a desired location in the embedding space to the :python:`fit` method of the model. By default, euclidean loss is used. For more information on re-training, landmarks, and why you might use them, see :doc:`transform_landmarked_pumap`.
* **landmark_loss_weight:** How to weight the landmark loss relative to umap loss, by default 1.0.

Extending the model
-------------------
Expand Down
5 changes: 5 additions & 0 deletions doc/transform.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ the latent space the classifier uses. Fortunately UMAP makes this
possible, albeit more slowly than some other transformers that allow
this.

This tutorial will step through a simple case where we expect the overall
distribution in our higher-dimensional vectors to be consistent between the
training and testing data. For more detail on how this can go wrong, and
how we can fix it using Parametric UMAP, see :doc:`transform_landmarked_pumap`.

To demonstrate this functionality we'll make use of
`scikit-learn <http://scikit-learn.org/stable/index.html>`__ and the
digits dataset contained therein (see :doc:`basic_usage` for an example
Expand Down
227 changes: 227 additions & 0 deletions doc/transform_landmarked_pumap.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@

Transforming New Data with Parametric UMAP
==========================================

There are many cases where one may want to take an existing UMAP model and use it to embed new data into the learned space. For a simple example where the overall distribution of the higher-dimensional training data matches that of the new data being embedded, see :doc:`transform`. We can't always be sure that this will be the case, however. To simulate a case where we have novel behaviour that we want to include in our embedding space, we will use the MNIST digits dataset (see :doc:`basic_usage` for a basic example).

To follow along with this example, see the MNIST_Landmarks notebook on the `GitHub repository <https://github.com/lmcinnes/umap/tree/master/notebooks/>`_

.. code :: python3

import keras
from sklearn.model_selection import train_test_split

from umap import UMAP, ParametricUMAP

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

We'll start by loading in the dataset, and splitting it into 2 equal parts with ``sklearn``'s ``train_test_split`` function. This will give us two partitions to work with, one to train our original embedding and another to test it. In order to simulate new behaviour appearing in our data we remove one of the MNIST categories ``N`` from the ``x1`` partition. In this case we'll use ``N=2``, so our model will be trained on all of the digits other than 2.

.. code:: python3

(X, y), (_, _) = keras.datasets.mnist.load_data()
x1, x2, y1, y2 = train_test_split(X, y, test_size=0.5, random_state=42)

# Reshape to 1D vectors
x1 = x1.reshape((x1.shape[0], 28*28))
x2 = x2.reshape((x2.shape[0], 28*28))

# Remove one category from the train dataset.
# In the case of MNIST digits, this will be the digit we are removing.
N = 2

x1 = x1[y1 != N]
y1 = y1[y1 != N]

print(x1.shape, x2.shape)

.. parsed-literal::

(26995, 784) (30000, 784)

New data with UMAP
------------------

To start with, we'll identify the issues with using UMAP as-is in this case, and then we'll see how to fix them with Parametric UMAP. First off, we need to train a ``UMAP`` model on our ``x1`` partition:

.. code:: python3

embedder = UMAP()

emb_x1 = embedder.fit_transform(x1)

Visualising our results:

.. code:: python3

plt.scatter(emb_x1[:,0], emb_x1[:,1], c=y1, cmap='Spectral', s=2, alpha=0.2)

.. image:: images/retrain_pumap_emb_x1.png


This is a clean and successful embedding, as we would expect from UMAP on this relatively-simple example. We see the normal structure one would expect from embedding MNIST, but without any of the 2s. The ``UMAP`` class is built to be compatible with ``scikit-learn``, so passing new data through is as simple as using the ``transform`` method and passing through the new data. We'll pass through ``x2``, which contains unseen examples of the original classes, and also samples from our holdout class, ``N`` (the 2s).

To make samples from ``N`` Stand out more, we'll over-plot them in black.

.. code:: python3

emb_x2 = embedder.transform(x2)

.. code:: python3

plt.scatter(emb_x2[:,0], emb_x2[:,1], c=y2, cmap='Spectral', s=2, alpha=0.2)
plt.scatter(emb_x2[y2==N][:,0], emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5)

.. image:: images/retrain_pumap_emb_x2.png

While our ``UMAP`` embedder has correctly handled the classes present in ``x1`` it has treated examples from our holdout class ``N`` poorly. Many of these points are concentrated on top of existing classes, with some spread out between them. This inability to generalize is not unique to UMAP, but is more generally a difficulty with learned embeddings. It also may or may not be an issue, depending on your use case.

New data with Parametric UMAP
-----------------------------

We can improve this outcome with Parametric UMAP. Parametric UMAP differs from UMAP in that it learns the relationship between the data and embedding with a neural network, instead of learning embeddings directly. This means we can incorporate new data by continuing to train the neural network, updating the weights to incorporate our new information.

.. image:: images/pumap-only.png

For more complete information on Parametric UMAP and the many options it provides, see :doc:`parametric_umap`.

We will start adressing this by training a ``ParametricUMAP`` embedding model, and running the same experiment:

.. code:: python3

p_embedder = ParametricUMAP()

p_emb_x1 = p_embedder.fit_transform(x1)

.. code:: python3

plt.scatter(p_emb_x1[:,0], p_emb_x1[:,1], c=y1, cmap='Spectral', s=2, alpha=0.2)

.. image:: images/retrain_pumap_p_emb_x1.png

Again, we get good results on our initial embedding of ``x1``. If we pass ``x2`` through without re-training, we get a similar problem to our ``UMAP`` model:

.. code:: python3

p_emb_x2 = p_embedder.transform(x2)

.. code:: python3

plt.scatter(p_emb_x2[:,0], p_emb_x2[:,1], c=y2, cmap='Spectral', s=2, alpha=0.2)
plt.scatter(p_emb_x2[y2==N][:,0], p_emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5)

.. image:: images/retrain_pumap_p_emb_x2.png

Re-training Parametric UMAP with landmarks
------------------------------------------

To update our embedding to include the new class, we'll fine-tune our existing ``ParametricUMAP`` model. Doing this without any other changes will start from where we left off, but our embedding space's structure may drift and change. This is because the UMAP loss function is invariant to translation and rotation, as it is only concerned with the relative positions and distances between points.

In order to keep our embedding space more consistent, we'll use the landmarks option for ``ParametricUMAP``. We retrain the model on the ``x2`` partition, along with some points chosen as landmarks from ``x1``. We'll choose 1% of the samples in ``x1`` to be included, along with their current position in the embedding space to be used in the landmarks loss function.

The default ``landmark_loss_fn`` is the euclidean distance between the point's original position and it's current one. The only change we'll make is to set ``landmark_loss_weight=0.01``.

.. code:: python3

# Select landmarks indexes from x1.
#
landmark_idx = list(np.random.choice(range(x1.shape[0]), int(x1.shape[0]/100), replace=False))

# Add the landmark points to x2 for training.
#
x2_lmk = np.concatenate((x2, x1[landmark_idx]))
y2_lmk = np.concatenate((y2, y1[landmark_idx]))

# Make our landmarks vector, which is nan where we have no landmark information.
#
landmarks = np.stack(
[np.array([np.nan, np.nan])]*x2.shape[0] + list(
p_embedder.transform(
x1[landmark_idx]
)
)
)

# Set landmark loss weight and continue training our Parametric UMAP model.
#
p_embedder.landmark_loss_weight = 0.01
p_embedder.fit(x2_lmk, landmark_positions=landmarks)
p_emb2_x2 = p_embedder.transform(x2)

# Check how x1 looks when embedded in the space retrained on x2 and landmarks.
#
p_emb2_x1 = p_embedder.transform(x1)


Plotting all of the different embeddings to compare them:

.. code:: python3

fig, axs = plt.subplots(3, 2, figsize=(16, 24), sharex=True, sharey=True)

axs[0,0].scatter(
emb_x1[:, 0], emb_x1[:, 1], c=y1, cmap='Spectral', s=2, alpha=0.2,
)
axs[0,0].set_ylabel('UMAP Embedding', fontsize=20)

axs[0,1].scatter(
emb_x2[:, 0], emb_x2[:, 1], c=y2, cmap='Spectral', s=2, alpha=0.2,
)
axs[0,1].scatter(
emb_x2[y2==N][:,0], emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5,
)

axs[1,0].scatter(
p_emb_x1[:, 0], p_emb_x1[:, 1], c=y1, cmap='Spectral', s=2, alpha=0.2,
)
axs[1,0].set_ylabel('Initial P-UMAP Embedding', fontsize=20)

axs[1,1].scatter(
p_emb_x2[:, 0], p_emb_x2[:, 1], c=y2, cmap='Spectral', s=2, alpha=0.2,
)
axs[1,1].scatter(
p_emb_x2[y2==N][:,0], p_emb_x2[y2==N][:,1], c='k', s=2, alpha=0.5
)

axs[2,0].scatter(
p_emb2_x1[:, 0], p_emb2_x1[:, 1], c=y1, cmap='Spectral', s=2, alpha=0.2,
)
axs[2,0].set_ylabel('Updated P-UMAP Embedding', fontsize=20)
axs[2,0].set_xlabel(f'x1, No {N}s', fontsize=20)

axs[2,1].scatter(
p_emb2_x2[:, 0], p_emb2_x2[:, 1], c=y2, cmap='Spectral', s=2, alpha=0.2,
)
axs[2,1].scatter(
p_emb2_x2[y2==N][:,0], p_emb2_x2[y2==N][:,1], c='k', s=2, alpha=0.5,
)
axs[2,1].set_xlabel('x2, All Classes', fontsize=20)

plt.tight_layout()

.. image:: images/retrain_pumap_summary_2_removed.png

Here we see that our approach has been successful, The embedding space has been kept consistent and we now have a clear cluster of our new class, the 2s. This new cluster shows up in a sensible part of the embedding space, and the rest of the structure is preserved.

It is worth double checking here that the landmark loss is not too constraining, we still would like a good UMAP structure.
To do so, we can interrogate the history of our embedder, which will retain the history through our re-training steps.

.. code:: python3

plt.plot(p_embedder._history['loss'])
plt.ylabel('Loss')
plt.xlabel('Epoch')

.. image:: images/retrain_pumap_history.png

We can identify the spike in loss where we introduce ``x2``, and can confirm that the resulting loss is comparable to the loss from our initial training on ``x1``. This tells us that the model is not having to compromise too much between the UMAP loss and the landmark loss. If this were not the case, it could potentially be improved by lowering the ``landmark_loss_weight`` attribute of our embedder object. There is a tradeoff to be made here between the consistency of the space and minimizing UMAP loss, but the key is we have smooth variation in the embedding space, which will make downstream tasks easier to adjust. In this case, we could probably stand to increase the ``landmark_loss_weight`` to keep the space more consistent.

In addition to ``landmark_loss_weight``, there are a number of other options available to us to try and get better results on this or other examples:

- Continuing the training with a larger portion of points from the original data, in our case ``x1``. Not all of these points need to be landmarked, but they can contribute to a consistent graph structure in higher dimensions.
- Changing the ``landmark_loss_fn``. For example, if we want to allow for points to move if they have to we could truncate the default euclidean loss function, allowing the metaphorical rubber band to snap at a certain point and prioritising a good UMAP structure once we discover that sticking to the landmark position is not correct.
- Being more intelligent with our selection of landmark points, for example using submodular optimization with a package like `apricot-select <https://apricot-select.readthedocs.io/en/latest/>`__ or chosing points from different parts of a heirarchical clustering like `HDBSCAN <https://hdbscan.readthedocs.io/en/latest/index.html>`__

6 changes: 3 additions & 3 deletions examples/mnist_torus_sphere_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def torus_euclidean_grad(x, y, torus_dimensions=(2 * np.pi, 2 * np.pi)):
for i in range(x.shape[0]):
a = abs(x[i] - y[i])
if 2 * a < torus_dimensions[i]:
distance_sqr += a ** 2
distance_sqr += a**2
g[i] = x[i] - y[i]
else:
distance_sqr += (torus_dimensions[i] - a) ** 2
Expand All @@ -74,7 +74,7 @@ def torus_euclidean_grad(x, y, torus_dimensions=(2 * np.pi, 2 * np.pi)):
# Plot a torus
R = 2
r = 1
values = (R - np.sqrt(x ** 2 + y ** 2)) ** 2 + z ** 2 - r ** 2
values = (R - np.sqrt(x**2 + y**2)) ** 2 + z**2 - r**2
mlab.contour3d(x, y, z, values, color=(1.0, 1.0, 1.0), contours=[0])

# torus angles -> 3D
Expand Down Expand Up @@ -105,7 +105,7 @@ def torus_euclidean_grad(x, y, torus_dimensions=(2 * np.pi, 2 * np.pi)):

# Plot a sphere
r = 3
values = x ** 2 + y ** 2 + z ** 2 - r ** 2
values = x**2 + y**2 + z**2 - r**2
mlab.contour3d(x, y, z, values, color=(1.0, 1.0, 1.0), contours=[0])

# latitude, longitude -> 3D
Expand Down
1 change: 1 addition & 0 deletions examples/plot_algorithm_comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
the equator and black to white from the south
to north pole.
"""

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Expand Down
1 change: 1 addition & 0 deletions examples/plot_fashion-mnist_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
(as shown in this example), or by continuous variables,
or by density (as is common in datashader examples).
"""

import umap
import numpy as np
import pandas as pd
Expand Down
5 changes: 3 additions & 2 deletions examples/plot_feature_extraction_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
used as a feature extraction technique. This small change results in a
substantial improvement compared to the model where raw data is used.
"""

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
Expand All @@ -45,7 +46,7 @@

# Classification with a linear SVM
svc = LinearSVC(dual=False, random_state=123)
params_grid = {"C": [10 ** k for k in range(-3, 4)]}
params_grid = {"C": [10**k for k in range(-3, 4)]}
clf = GridSearchCV(svc, params_grid)
clf.fit(X_train, y_train)
print(
Expand All @@ -58,7 +59,7 @@
params_grid_pipeline = {
"umap__n_neighbors": [5, 20],
"umap__n_components": [15, 25, 50],
"svc__C": [10 ** k for k in range(-3, 4)],
"svc__C": [10**k for k in range(-3, 4)],
}


Expand Down
1 change: 1 addition & 0 deletions examples/plot_mnist_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
0, and grouping triplets of 3,5,8 and 4,7,9 which can
blend into one another in some cases.
"""

import umap
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
Expand Down
Loading