Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archipelago - dict transformer for vectorizing persistence diagrams #1017

Closed
wants to merge 29 commits into from

Conversation

martinroyer
Copy link
Collaborator

Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods. One provides persistence diagram vectorizers (by way of either island or island_dict), and the Archipelago object will |fit on| and |transform = vectorize| list or series of persistence diagrams (in pandas format). The object is sklearn-API consistent.

So the difference between Archipelago and methods from gudhi.representations.vector_methods is that this operates on the "whole" persistence diagrams e.g. [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))], whereas methods from vector_methods tend to operate on a single dimension e.g. list of numpy arrays in $R^2$.

So far it feels nice to have something like Archipelago(island_dict={2: BettiCurve(resolution=4), 0:TopologicalVector(threshold=3)}) work, and also it looks to be working on both lists of diagrams and pandas.Series of diagrams.

This PR also contains minor edits to Atol like default quantiser.

Copy link
Member

@mglisse mglisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/python/gudhi/representations/vector_methods.py Outdated Show resolved Hide resolved
diagrams according to homology_dimensions.

Examples
>>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit reluctant to add more code that uses the old format for persistence diagrams, when we are trying to replace it with something more convenient (see #925, https://github.com/GUDHI/gudhi-devel/blob/master/src/python/gudhi/sklearn/cubical_persistence.py or #691). Of course we could change archipelago later to accept the new format, but it may become complicated if we try to be compatible with both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better. Do you mean you want me to wait for those PR to come online before processing this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better.

It is the one we discussed before the holidays. Instead of a list of (dim,(birth,death)), if the user asked for homology in dimensions [i,j,k], they get [dgm_i, dgm_j, dgm_k] where each dgm is a numpy array of shape (*,2).

It is better in that you already have numpy arrays for each dimension. However, because it is a list and not a dict, you do not have access to the labels [i,j,k], the user has to provide them separately if you need them for something. It also makes it ambiguous, if we refer to the 0-th diagram, whether it corresponds to homology in dimension 0, or in dimension i (the first of the list).

Do you mean you want me to wait for those PR to come online before processing this one?

I don't know what I want 😉

Comment on lines 19 to 20
|fit on| and |transform = vectorize| list or series of persistence diagrams (in pandas format).
The object is sklearn-API consistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the pandas format controlled by transform_output in scikit-learn?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but that feature I have not implemented: at the moment I force the output in pandas format. Will look into that and if it's easy I will implement it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is worked out within the TransformerMixin of sklearn, so I branched it up now (not sure if very safe though?) and it nicely gets rid of the annoying pandas dependency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not sure if very safe though?)

Do you mean "safe" in terms of the version of sklearn that is required, or something else?
(the doc may need a small update to reflect this change)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the number (and names) of features out of Archipelago can only be known at transform run time, I change a class attribute accordingly, and I'm not sure if it is very safe...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the number (and names) of features out of Archipelago can only be known at transform run time,

I see. It could be computed at fit time, which would better match the sklearn philosophy, but that wouldn't be convenient. I have no idea how bad it is, if it can interact strangely with cloning or something. For now, I don't see any obvious problem...

continue
this_dim_list_pdiags = [pdiags[dimension] for pdiags in by_dim_list_pdiags]
vectorized_dgms = self.archipelago_[dimension].transform(this_dim_list_pdiags)
running_transform_names += [f"{dimension} Center {i + 1}" for i in range(vectorized_dgms.shape[1])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "feature" is more generic than "center" if the island is not necessarily atol. Although the name archipelago may look strange if we don't use atol at all.


class Archipelago(BaseEstimator, TransformerMixin):
"""
Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this may be more natural with the new format)
Maybe we should say that this class is helpful to avoid going through DimensionSelector+FeatureUnion or similar. It could even be nice to have an example somewhere comparing the 2 (we can add it later though).

Comment on lines 28 to 40
>>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))]
>>> pdiagram2 = [(0, (0.0, 3.34)), (0, (0.0, 2.956)), (1, (0.536, 1.856)), (2, (1.202, 2.734))]
>>> pdiagram3 = [(0, (1.0, 4.34)), (0, (2.0, 3.956)), (1, (1.536, 2.856)), (2, (3.202, 4.734))]
>>> list_pdiags = [pdiagram1, pdiagram2, pdiagram3]
>>> archipelago = Archipelago(island=Atol())
>>> archipelago.fit(X=list_pdiags)
>>> archipelago.transform(X=list_pdiags)
>>> archipelago = Archipelago(island_dict={2: BettiCurve(resolution=4), 0:Atol()})
>>> import pandas as pd
>>> series_pdiags = pd.Series(list_pdiags)
>>> archipelago.set_output(transform="pandas")
>>> archipelago.fit(X=series_pdiags)
>>> archipelago.transform(X=series_pdiags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html, using the new format of persistence diagrams, and ignoring pandas for now, is this equivalent to the following?

from gudhi.representations import BettiCurve, Atol, Archipelago
import numpy as np
from sklearn.compose import ColumnTransformer
Xpdiagram1 = [np.array([[0.0, 2.34], [0.0, 0.956]]), np.array([[0.536, 0.856]]), np.array([[1.202, 1.734]])]
Xpdiagram2 = [np.array([[0.0, 3.34], [0.0, 2.956]]), np.array([[0.536, 1.856]]), np.array([[1.202, 2.734]])]
Xpdiagram3 = [np.array([[1.0, 3.34], [2.0, 2.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]
Xlist_pdiags = [Xpdiagram1, Xpdiagram2, Xpdiagram3]
ct=ColumnTransformer([("0",Atol(),0),("1",Atol(),1),("2",Atol(),2)])
print(ct.fit_transform(Xlist_pdiags))
ct=ColumnTransformer([("0",Atol(),0),("2",BettiCurve(resolution=4),2)])
print(ct.fit_transform(Xlist_pdiags))

I had to do a couple changes in BettiCurve to let this run (I don't know exactly what not X is meant to test for)

@@ -381,7 +381,7 @@
         if not self.is_fitted():
             raise NotFittedError("Not fitted.")
 
-        if not X:
+        if X is None or len(X) == 0:
             X = [np.zeros((0, 2))]
         
         N = len(X)
@@ -408,7 +408,7 @@
 
         return np.array(bettis, dtype=int)[:, 0:-1]
 
-    def fit_transform(self, X):
+    def fit_transform(self, X, y=None):
         """
         The result is the same as fit(X) followed by transform(X), but potentially faster.
         """

and I get similar results, but they differ from what Archipelago returns for dimension 0 (maybe it is just related to random generators).

Copy link
Collaborator Author

@martinroyer martinroyer Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, they differ because you made a mistake in copying it should be

Xpdiagram3 = [np.array([[1.0, 4.34], [2.0, 3.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]

and then the results are equal indeed.

Also this seems to be working:

ct.set_output(transform="pandas")
Xdf_pdiags = pd.DataFrame(Xlist_pdiags)
print(ct.fit_transform(Xdf_pdiags))

so this is possibly the end for this PR, with the new pdiags format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the one thing this PR does better than your proposition is handle dimensions that fail at fit, by not registering the key into the archipelago_ dict.
It seems desirable for now?
Which makes me think a lot of methods from vector_methods are not protected against things like
X is None or len(X) == 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further discussion with Marc, I believe the better philosophy for gudhi is what he suggests:

  • if one asks for vectorizing of some homology dimensions (say 0, 1 and 2) we provide the structure for the dimensions (for instance with ColumnTransformers from this conversation),
  • and because there can be no homology features at some dimensions, if it turns out there is no data to vectorize, we should print a warning (e.g. "no information in dimension d, problems ahead, consider asking for different homology dimensions") but still let it go to crashing, instead of what I was doing which was removing the problematic dimension.

So we can definitely drop this PR now.

We should (somehow?) salvage the changes to vector_methods: default method for Atol so that it can be initialized with empty arguments, and some tweaks that can be generalized to all vector methods such as implementing get_feature_names_out so that we can use the set_output feature from sklearn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should still include the other changes.

For reference, the discussion was mostly about how to handle empty diagrams in some dimensions. If the user asks for persistence in dimensions 0, 1 and 2, but the diagrams in dimension 2 are all empty (especially during fit), we have several options:

  1. we can throw an exception (possibly just for the vectorizers that do need fitting like Atol, or Landscape without an explicit sample_range)
  2. we can (print a warning and) continue, and for this dimension transform will either (preferably) return a harmless constant, or return nonsense (including NaN).
  3. we can (print a warning and) skip that dimension henceforth, transform will act as if the user had only asked for dimensions 0 and 1.

@martinroyer chose the last option in this PR. From a pure ML perspective, that option is convenient. A user can ask the pipeline to aggregate the results in several dimensions and not care about the details, if one dimension is useless and the pipeline can automatically ignore it, that user is happy. Ignoring that dimension also has the advantage over returning a constant that it avoids a useless increase in dimension. It also goes well with the dictionary-like pandas.
From a lower level perspective, the last option can be confusing. The output of transform does not have the expected shape (I can ask for 3 landscapes with 5 points each but get a vector of dimension 10 because one dimension was ignored).
Both options have their advantages. Note that this choice can be handled either inside each vectorizer (say Atol(skip_on_fit_error=True) for which fit([]) defines an empty list of centers, and after that transform returns a list of empty arrays, which hopefully disappear when ColumnTransformer tries to append them to the rest), or in a wrapper like ColumnTransformer.
In any case, we should check how we handle (lists of) empty diagrams.

Copy link
Collaborator Author

@martinroyer martinroyer Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PS: redrafted to reflect that this is consistent with @mglisse's option number 2]

Adding to that, I believe there are two "empty diagrams" cases to discuss, that of fit and that of transform.

As for fit, if we hold the view that once someone asks for vectorizing (say in dimensions 0 and 1) we should provide the structure for doing so (Marc's option 2 "provide the structure at all costs"):
for Atol that implies that even without data there should always be centers at the end of fit - so owe could for instance do random generation:

    def fit(self, X, y=None, sample_weight=None):
        if not hasattr(self.quantiser, 'fit'):
            raise TypeError("quantiser %s has no `fit` attribute." % (self.quantiser))

        # In fitting we remove infinite birth/death time points so that every center is finite
        X = [dgm[~np.isinf(dgm).any(axis=1), :] for dgm in X if len(dgm)]

        if len(X) < self.quantiser.n_clusters:
            # If no point to fit, let's arbitrarily put centers in [0, 1)
            X += [np.array([[np.random.random(), np.random.random()]]) for _ in range(self.quantiser.n_clusters - len(X))]

(This would ensure that quantiser has enough points to fit n_clusters regardless of entry X.)

As for transform, one convention could be that empty diagrams are thought of as empty measure therefore vectorized as 0 = no mass. This ensures that no error is thrown which would be pretty desirable at transform time?
For Atol one way of achieving this is to replace empty diagrams with np.array([[np.inf, np.inf]]).

That looks good to me.

Copy link
Collaborator

@MathieuCarriere MathieuCarriere Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am globally in favor of option (2), that is, return a constant for all data instance if the corresponding persistence diagrams are all empty. I think it matches better with ML practices, where you usually use homology dimensions without thinking too much beforehand if the persistence diagrams make sense or not. If they do not (and thus constants are returned), pipelines will not crash, but will contain redundant, useless constants, that can be spotted a posteriori with post-analysis of the results, as is sometimes done in other contexts.

That being said, I don't know if this is the behavior of the different representation methods that we have...

Copy link
Collaborator Author

@martinroyer martinroyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentRouvreau hopefully this is the bugfixes you need!

Comment on lines -722 to +724
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentRouvreau I believe this is what you need part1

Comment on lines -121 to +127
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10))
# Atol will do
# X = np.concatenate([a,b,c])
# kmeans = KMeans(n_clusters=2, random_state=202006).fit(X)
# kmeans = KMeans(n_clusters=2, random_state=202006, n_init=10).fit(X)
# kmeans.labels_ will be : array([1, 0, 1, 0, 0, 1, 0, 0])
first_cluster = np.asarray([a[0], a[2], b[2]])
second_cluster = np.asarray([a[1], b[0], b[2], c[0], c[1]])
second_cluster = np.asarray([a[1], b[0], b[1], c[0], c[1]])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentRouvreau I believe this is what you need part2

quantiser=KMeans(n_clusters=1, random_state=202006),
quantiser=KMeans(n_clusters=1, random_state=202006, n_init=10),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentRouvreau I believe this is what you need part3/3

Martin ROYER added 3 commits June 7, 2024 16:40
- arbitrary centers when not enough points
- better filtering of infinite birth/death time points
- weighting before filtering
@martinroyer
Copy link
Collaborator Author

martinroyer commented Jun 7, 2024

So I implemented what we discussed in February in this PR: the ability (for Atol) to handle empty diagrams (on fit and transform), that translates naturally into the same ability for a dict transformer.

I added a test file that checks if a vectorizer is able to handle various scenarii with input persistence diagram (new format): fit/transform/fit when empty/transform when empty/panda's set_output/ColumnTransformer.

We could rewrite and keep only the tests that best interest us if we want to have a sort of common interface for vectorization classes down the way, and add the classes to the test once they do what we want them to do.

I removed archipelago because I believe it is well handled by scikit-learn's�ColumnTransformer (see example in the test file).

@martinroyer
Copy link
Collaborator Author

martinroyer commented Jun 7, 2024

So this PR has now become only a "Atol robust update" coupled with some interesting tests for a shared vectorisation method interface. I can redraft it into one/two PRs if we want things to be clearer.

@martinroyer
Copy link
Collaborator Author

This is continued in PR #1096.

@martinroyer martinroyer deleted the archipelago branch June 28, 2024 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants