Archipelago - dict transformer for vectorizing persistence diagrams #1017

martinroyer · 2024-01-04T17:32:26Z

Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods. One provides persistence diagram vectorizers (by way of either island or island_dict), and the Archipelago object will |fit on| and |transform = vectorize| list or series of persistence diagrams (in pandas format). The object is sklearn-API consistent.

So the difference between Archipelago and methods from gudhi.representations.vector_methods is that this operates on the "whole" persistence diagrams e.g. [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))], whereas methods from vector_methods tend to operate on a single dimension e.g. list of numpy arrays in $R^2$.

So far it feels nice to have something like Archipelago(island_dict={2: BettiCurve(resolution=4), 0:TopologicalVector(threshold=3)}) work, and also it looks to be working on both lists of diagrams and pandas.Series of diagrams.

This PR also contains minor edits to Atol like default quantiser.

… from gudhi

mglisse

Related to https://github.com/scikit-learn-contrib/sklearn-pandas.

src/python/gudhi/representations/vector_methods.py

mglisse · 2024-01-04T18:40:55Z

src/python/gudhi/representations/archipelago.py

+            diagrams according to homology_dimensions.
+
+    Examples
+    >>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))]


I am a bit reluctant to add more code that uses the old format for persistence diagrams, when we are trying to replace it with something more convenient (see #925, https://github.com/GUDHI/gudhi-devel/blob/master/src/python/gudhi/sklearn/cubical_persistence.py or #691). Of course we could change archipelago later to accept the new format, but it may become complicated if we try to be compatible with both.

I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better. Do you mean you want me to wait for those PR to come online before processing this one?

I'm not sure what the new format is but it seems it will be better optimized for Archipelago, so all the better.

It is the one we discussed before the holidays. Instead of a list of (dim,(birth,death)), if the user asked for homology in dimensions [i,j,k], they get [dgm_i, dgm_j, dgm_k] where each dgm is a numpy array of shape (*,2).

It is better in that you already have numpy arrays for each dimension. However, because it is a list and not a dict, you do not have access to the labels [i,j,k], the user has to provide them separately if you need them for something. It also makes it ambiguous, if we refer to the 0-th diagram, whether it corresponds to homology in dimension 0, or in dimension i (the first of the list).

Do you mean you want me to wait for those PR to come online before processing this one?

I don't know what I want 😉

mglisse · 2024-01-04T18:47:01Z

src/python/gudhi/representations/archipelago.py

+    |fit on| and |transform = vectorize| list or series of persistence diagrams (in pandas format).
+    The object is sklearn-API consistent.


Isn't the pandas format controlled by transform_output in scikit-learn?

It is, but that feature I have not implemented: at the moment I force the output in pandas format. Will look into that and if it's easy I will implement it.

Actually it is worked out within the TransformerMixin of sklearn, so I branched it up now (not sure if very safe though?) and it nicely gets rid of the annoying pandas dependency.

(not sure if very safe though?)

Do you mean "safe" in terms of the version of sklearn that is required, or something else?
(the doc may need a small update to reflect this change)

Since the number (and names) of features out of Archipelago can only be known at transform run time, I change a class attribute accordingly, and I'm not sure if it is very safe...

Since the number (and names) of features out of Archipelago can only be known at transform run time,

I see. It could be computed at fit time, which would better match the sklearn philosophy, but that wouldn't be convenient. I have no idea how bad it is, if it can interact strangely with cloning or something. For now, I don't see any obvious problem...

…pandas

mglisse · 2024-01-05T11:38:28Z

src/python/gudhi/representations/archipelago.py

+                continue
+            this_dim_list_pdiags = [pdiags[dimension] for pdiags in by_dim_list_pdiags]
+            vectorized_dgms = self.archipelago_[dimension].transform(this_dim_list_pdiags)
+            running_transform_names += [f"{dimension} Center {i + 1}" for i in range(vectorized_dgms.shape[1])]


Maybe "feature" is more generic than "center" if the island is not necessarily atol. Although the name archipelago may look strange if we don't use atol at all.

mglisse · 2024-01-05T11:44:27Z

src/python/gudhi/representations/archipelago.py

+
+class Archipelago(BaseEstimator, TransformerMixin):
+    """
+    Transformer that dictionary-wraps persistence diagram vectorizers, i.e. objects from gudhi.representations.vector_methods.


(this may be more natural with the new format)
Maybe we should say that this class is helpful to avoid going through DimensionSelector+FeatureUnion or similar. It could even be nice to have an example somewhere comparing the 2 (we can add it later though).

mglisse · 2024-01-14T16:35:31Z

src/python/gudhi/representations/archipelago.py

+    >>> pdiagram1 = [(0, (0.0, 2.34)), (0, (0.0, 0.956)), (1, (0.536, 0.856)), (2, (1.202, 1.734))]
+    >>> pdiagram2 = [(0, (0.0, 3.34)), (0, (0.0, 2.956)), (1, (0.536, 1.856)), (2, (1.202, 2.734))]
+    >>> pdiagram3 = [(0, (1.0, 4.34)), (0, (2.0, 3.956)), (1, (1.536, 2.856)), (2, (3.202, 4.734))]
+    >>> list_pdiags = [pdiagram1, pdiagram2, pdiagram3]
+    >>> archipelago = Archipelago(island=Atol())
+    >>> archipelago.fit(X=list_pdiags)
+    >>> archipelago.transform(X=list_pdiags)
+    >>> archipelago = Archipelago(island_dict={2: BettiCurve(resolution=4), 0:Atol()})
+    >>> import pandas as pd
+    >>> series_pdiags = pd.Series(list_pdiags)
+    >>> archipelago.set_output(transform="pandas")
+    >>> archipelago.fit(X=series_pdiags)
+    >>> archipelago.transform(X=series_pdiags)


Looking at https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html, using the new format of persistence diagrams, and ignoring pandas for now, is this equivalent to the following?

from gudhi.representations import BettiCurve, Atol, Archipelago import numpy as np from sklearn.compose import ColumnTransformer Xpdiagram1 = [np.array([[0.0, 2.34], [0.0, 0.956]]), np.array([[0.536, 0.856]]), np.array([[1.202, 1.734]])] Xpdiagram2 = [np.array([[0.0, 3.34], [0.0, 2.956]]), np.array([[0.536, 1.856]]), np.array([[1.202, 2.734]])] Xpdiagram3 = [np.array([[1.0, 3.34], [2.0, 2.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])] Xlist_pdiags = [Xpdiagram1, Xpdiagram2, Xpdiagram3] ct=ColumnTransformer([("0",Atol(),0),("1",Atol(),1),("2",Atol(),2)]) print(ct.fit_transform(Xlist_pdiags)) ct=ColumnTransformer([("0",Atol(),0),("2",BettiCurve(resolution=4),2)]) print(ct.fit_transform(Xlist_pdiags))

I had to do a couple changes in BettiCurve to let this run (I don't know exactly what not X is meant to test for)

@@ -381,7 +381,7 @@ if not self.is_fitted(): raise NotFittedError("Not fitted.") - if not X: + if X is None or len(X) == 0: X = [np.zeros((0, 2))] N = len(X) @@ -408,7 +408,7 @@ return np.array(bettis, dtype=int)[:, 0:-1] - def fit_transform(self, X): + def fit_transform(self, X, y=None): """ The result is the same as fit(X) followed by transform(X), but potentially faster. """

and I get similar results, but they differ from what Archipelago returns for dimension 0 (maybe it is just related to random generators).

Hmm, they differ because you made a mistake in copying it should be

Xpdiagram3 = [np.array([[1.0, 4.34], [2.0, 3.956]]), np.array([[1.536, 2.856]]), np.array([[3.202, 4.734]])]

and then the results are equal indeed.

Also this seems to be working:

ct.set_output(transform="pandas") Xdf_pdiags = pd.DataFrame(Xlist_pdiags) print(ct.fit_transform(Xdf_pdiags))

so this is possibly the end for this PR, with the new pdiags format.

So the one thing this PR does better than your proposition is handle dimensions that fail at fit, by not registering the key into the archipelago_ dict.
It seems desirable for now?
Which makes me think a lot of methods from vector_methods are not protected against things like
X is None or len(X) == 0

After further discussion with Marc, I believe the better philosophy for gudhi is what he suggests:

if one asks for vectorizing of some homology dimensions (say 0, 1 and 2) we provide the structure for the dimensions (for instance with ColumnTransformers from this conversation),

and because there can be no homology features at some dimensions, if it turns out there is no data to vectorize, we should print a warning (e.g. "no information in dimension d, problems ahead, consider asking for different homology dimensions") but still let it go to crashing, instead of what I was doing which was removing the problematic dimension.

So we can definitely drop this PR now.

We should (somehow?) salvage the changes to vector_methods: default method for Atol so that it can be initialized with empty arguments, and some tweaks that can be generalized to all vector methods such as implementing get_feature_names_out so that we can use the set_output feature from sklearn.

Yes, we should still include the other changes.

For reference, the discussion was mostly about how to handle empty diagrams in some dimensions. If the user asks for persistence in dimensions 0, 1 and 2, but the diagrams in dimension 2 are all empty (especially during fit), we have several options:

we can throw an exception (possibly just for the vectorizers that do need fitting like Atol, or Landscape without an explicit sample_range)

we can (print a warning and) continue, and for this dimension transform will either (preferably) return a harmless constant, or return nonsense (including NaN).

we can (print a warning and) skip that dimension henceforth, transform will act as if the user had only asked for dimensions 0 and 1.

@martinroyer chose the last option in this PR. From a pure ML perspective, that option is convenient. A user can ask the pipeline to aggregate the results in several dimensions and not care about the details, if one dimension is useless and the pipeline can automatically ignore it, that user is happy. Ignoring that dimension also has the advantage over returning a constant that it avoids a useless increase in dimension. It also goes well with the dictionary-like pandas.
From a lower level perspective, the last option can be confusing. The output of transform does not have the expected shape (I can ask for 3 landscapes with 5 points each but get a vector of dimension 10 because one dimension was ignored).
Both options have their advantages. Note that this choice can be handled either inside each vectorizer (say Atol(skip_on_fit_error=True) for which fit([]) defines an empty list of centers, and after that transform returns a list of empty arrays, which hopefully disappear when ColumnTransformer tries to append them to the rest), or in a wrapper like ColumnTransformer.
In any case, we should check how we handle (lists of) empty diagrams.

[PS: redrafted to reflect that this is consistent with @mglisse's option number 2]

Adding to that, I believe there are two "empty diagrams" cases to discuss, that of fit and that of transform.

As for fit, if we hold the view that once someone asks for vectorizing (say in dimensions 0 and 1) we should provide the structure for doing so (Marc's option 2 "provide the structure at all costs"):
for Atol that implies that even without data there should always be centers at the end of fit - so owe could for instance do random generation:

def fit(self, X, y=None, sample_weight=None): if not hasattr(self.quantiser, 'fit'): raise TypeError("quantiser %s has no `fit` attribute." % (self.quantiser)) # In fitting we remove infinite birth/death time points so that every center is finite X = [dgm[~np.isinf(dgm).any(axis=1), :] for dgm in X if len(dgm)] if len(X) < self.quantiser.n_clusters: # If no point to fit, let's arbitrarily put centers in [0, 1) X += [np.array([[np.random.random(), np.random.random()]]) for _ in range(self.quantiser.n_clusters - len(X))]

(This would ensure that quantiser has enough points to fit n_clusters regardless of entry X.)

As for transform, one convention could be that empty diagrams are thought of as empty measure therefore vectorized as 0 = no mass. This ensures that no error is thrown which would be pretty desirable at transform time?
For Atol one way of achieving this is to replace empty diagrams with np.array([[np.inf, np.inf]]).

That looks good to me.

I am globally in favor of option (2), that is, return a constant for all data instance if the corresponding persistence diagrams are all empty. I think it matches better with ML practices, where you usually use homology dimensions without thinking too much beforehand if the persistence diagrams make sense or not. If they do not (and thus constants are returned), pipelines will not crash, but will contain redundant, useless constants, that can be spotted a posteriori with post-analysis of the results, as is sometimes done in other contexts.

That being said, I don't know if this is the behavior of the different representation methods that we have...

martinroyer

@VincentRouvreau hopefully this is the bugfixes you need!

martinroyer · 2024-02-15T10:28:42Z

src/python/gudhi/representations/vector_methods.py

-    >>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
+    >>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10))


@VincentRouvreau I believe this is what you need part1

martinroyer · 2024-02-15T10:29:05Z

src/python/test/test_representations.py

-    atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
+    atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10))
    # Atol will do
    # X = np.concatenate([a,b,c])
-    # kmeans = KMeans(n_clusters=2, random_state=202006).fit(X) 
+    # kmeans = KMeans(n_clusters=2, random_state=202006, n_init=10).fit(X)
    # kmeans.labels_ will be : array([1, 0, 1, 0, 0, 1, 0, 0])
    first_cluster = np.asarray([a[0], a[2], b[2]])
-    second_cluster = np.asarray([a[1], b[0], b[2], c[0], c[1]])
+    second_cluster = np.asarray([a[1], b[0], b[1], c[0], c[1]])


@VincentRouvreau I believe this is what you need part2

martinroyer · 2024-02-15T10:29:18Z

src/python/test/test_representations.py

-                quantiser=KMeans(n_clusters=1, random_state=202006),
+                quantiser=KMeans(n_clusters=1, random_state=202006, n_init=10),


@VincentRouvreau I believe this is what you need part3/3

- arbitrary centers when not enough points - better filtering of infinite birth/death time points - weighting before filtering

martinroyer · 2024-06-07T18:05:02Z

So I implemented what we discussed in February in this PR: the ability (for Atol) to handle empty diagrams (on fit and transform), that translates naturally into the same ability for a dict transformer.

I added a test file that checks if a vectorizer is able to handle various scenarii with input persistence diagram (new format): fit/transform/fit when empty/transform when empty/panda's set_output/ColumnTransformer.

We could rewrite and keep only the tests that best interest us if we want to have a sort of common interface for vectorization classes down the way, and add the classes to the test once they do what we want them to do.

I removed archipelago because I believe it is well handled by scikit-learn's�ColumnTransformer (see example in the test file).

martinroyer · 2024-06-07T18:08:26Z

So this PR has now become only a "Atol robust update" coupled with some interesting tests for a shared vectorisation method interface. I can redraft it into one/two PRs if we want things to be clearer.

martinroyer · 2024-06-28T09:05:49Z

This is continued in PR #1096.

Martin ROYER and others added 13 commits November 28, 2023 12:35

add archipelago class

de408e9

give default quantiser sklearn.KMeans to Atol method

927c789

homology_dimensions and settlers for Archipelago class

3e95419

n_init parameter for sklearn KMeans

3bc232a

archipelago island

431a692

mature version compatible with gudhi.representations.vector_methods

aaef59f

Atol try/catch

adc8668

fix docstrings

4685748

typo

2cf60e4

Merge branch 'GUDHI:master' into archipelago

6524320

docstring correct

5561d09

refactor removing input preprocessing, instead we take raw dgm format…

5367ea1

… from gudhi

prints

95bd156

mglisse requested a review from MathieuCarriere January 4, 2024 17:52

mglisse reviewed Jan 4, 2024

View reviewed changes

martinroyer and others added 6 commits January 5, 2024 09:44

Revert try/catch optimizer fit in Atol

b4de687

fix set_output from sklearn so as to return pandas without importing …

89be488

…pandas

default KMeans parameter

f5dc92d

change confusing Atol __call__ function

6bfb164

define get_feature_names_out for Atol

23a5e47

test fixes

a59af1b

mglisse reviewed Jan 5, 2024

View reviewed changes

Martin ROYER added 7 commits January 5, 2024 14:36

hopefully fix atol test following n_init="auto" in KMeans

9fadd61

revert value changes to doc

0b1c6b9

updated docstring

aa0d3cb

tentative change n_init value for test compatibility 3.7

4afb0ef

remove try except

2df14d9

call vectorizer get_geature_names_out if exists

f01ac19

more sklearn logic

c43b190

mglisse reviewed Jan 14, 2024

View reviewed changes

martinroyer commented Feb 15, 2024

View reviewed changes

VincentRouvreau mentioned this pull request May 27, 2024

Fix Atol for scikit-learn >= 1.4.0 #1056

Merged

Martin ROYER added 3 commits June 7, 2024 16:40

atol fixes:

b74dcae

- arbitrary centers when not enough points - better filtering of infinite birth/death time points - weighting before filtering

add test for representations interface fit/transform/...

26eef83

remove archipelago

5d8af99

martinroyer mentioned this pull request Jun 28, 2024

Some robustifying fixes on Atol fit and tests for vectorizers #1096

Merged

martinroyer closed this Jun 28, 2024

martinroyer deleted the archipelago branch June 28, 2024 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archipelago - dict transformer for vectorizing persistence diagrams #1017

Archipelago - dict transformer for vectorizing persistence diagrams #1017

martinroyer commented Jan 4, 2024

mglisse left a comment

mglisse Jan 4, 2024

martinroyer Jan 5, 2024

mglisse Jan 5, 2024

mglisse Jan 4, 2024

martinroyer Jan 5, 2024

martinroyer Jan 5, 2024

mglisse Jan 5, 2024

martinroyer Jan 5, 2024

mglisse Jan 5, 2024

mglisse Jan 5, 2024

mglisse Jan 5, 2024

mglisse Jan 14, 2024

martinroyer Jan 15, 2024 •

edited

Loading

martinroyer Jan 23, 2024

martinroyer Jan 31, 2024

mglisse Feb 4, 2024

martinroyer Feb 13, 2024 •

edited

Loading

MathieuCarriere Feb 16, 2024 •

edited

Loading

martinroyer left a comment

martinroyer Feb 15, 2024

martinroyer Feb 15, 2024

martinroyer Feb 15, 2024

martinroyer commented Jun 7, 2024 •

edited

Loading

martinroyer commented Jun 7, 2024 •

edited

Loading

martinroyer commented Jun 28, 2024

		\|fit on\| and \|transform = vectorize\| list or series of persistence diagrams (in pandas format).
		The object is sklearn-API consistent.

		>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006))
		>>> atol_vectoriser = Atol(quantiser=KMeans(n_clusters=2, random_state=202006, n_init=10))

		quantiser=KMeans(n_clusters=1, random_state=202006),
		quantiser=KMeans(n_clusters=1, random_state=202006, n_init=10),

Archipelago - dict transformer for vectorizing persistence diagrams #1017

Archipelago - dict transformer for vectorizing persistence diagrams #1017

Conversation

martinroyer commented Jan 4, 2024

mglisse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinroyer Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinroyer Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

MathieuCarriere Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

martinroyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinroyer commented Jun 7, 2024 • edited Loading

martinroyer commented Jun 7, 2024 • edited Loading

martinroyer commented Jun 28, 2024

martinroyer Jan 15, 2024 •

edited

Loading

martinroyer Feb 13, 2024 •

edited

Loading

MathieuCarriere Feb 16, 2024 •

edited

Loading

martinroyer commented Jun 7, 2024 •

edited

Loading

martinroyer commented Jun 7, 2024 •

edited

Loading