Some robustifying fixes on Atol fit and tests for vectorizers #1096

martinroyer · 2024-06-28T09:03:19Z

This introduces fixes that allow Atol to not crash on fitting, essentially by adding random points to the target.
This is a follow-up on PR #1017 that can now be dropped.

Option: I add a series of test that only test Atol for now, but are ready to be extended to all vectorization methods. So one could work on another vector method and directly use the same test battery to see if there's a problem with this method's:
- fit,
- fit on empty diagrams,
- transform,
- transform on empty diagrams,
- using sklearn set_output,
- using sklearn compose with ColumnTransformer (not sure how this could fail with all the other tests but who knows).
This test list comes as a sort of probe for a shared common interface of vectorizers, it could be enriched/edited on the way.

(naturally we could also work on the interface as a separate PR where we deal with the transform_one question too, see #1056 (comment).)

[0, 1)^2 is arbitrary, questionable choice.

- fit - fit empty diagrams - transform - transform empty diagrams - sklearn set_output - sklearn compose with ColumnTransformer (not sure how this could fail with all the other tests but who knows)

mglisse

Some details below, but the approach looks fine.

mglisse · 2024-06-28T14:41:46Z

src/python/gudhi/representations/vector_methods.py

+        n_clusters = self.quantiser.n_clusters
+        n_points = len(filtered_measures_concat)
+        if n_points < n_clusters:
+            # If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2


In #1062, for Landscapes, the fallback is a "useless" grid [-inf, -inf, -inf, ...] for which transform will produce only zeros. Here, instead of picking points far enough that they don't matter, you try to put them in a place where at least for some datasets they might be useful. Both approaches look ok to me (they both also print a message), I just wanted to point out the inconsistency (that won't block the PR).

So the latest commits from me are now consistent with this behaviour: if some centers cannot be produced by "fitting some data", they will be assigned to [-np.inf, -np.inf, ...] and they transform anything into 0.

src/python/gudhi/representations/vector_methods.py

mglisse · 2024-06-28T15:11:35Z

src/python/gudhi/representations/vector_methods.py

+        n_points = len(filtered_measures_concat)
+        if n_points < n_clusters:
+            # If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2
+            print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding random points in [0, 1)^2.")


Wait, is Atol restricted to 2D? I thought it wasn't restricted to diagrams and could work in any dimension. Though if we fit on no points, I don't know how to guess a dimension for the random centers.

It is not, this is a hard one!

There is still the possibility of giving up (i.e. raise an exception if there are no points at all).

I have now favored and implemented this approach!

mglisse · 2024-06-28T15:15:13Z

src/python/gudhi/representations/vector_methods.py

+            filtered_weights_concat = np.concatenate((filtered_weights_concat, np.ones(shape=(n_clusters - n_points))))
+            filtered_measures_concat = np.concatenate((filtered_measures_concat, np.random.random((n_clusters - n_points, 2))))
+
+        self.quantiser.fit(X=filtered_measures_concat, sample_weight=filtered_weights_concat)


When the number of points equals the number of requested clusters, are clustering algorithms usually good at finding all of the points as centers? Another possibility, in the "too few points" case, would be to write to self.centers directly and skip the use of the quantiser.

First of all thanks for all the great remarks.
I think on this specific question they are ok at that yes. And while I like your suggestion, if there's only say two points and we want five clusters, it is probably highly valuable to take those two points as cluster centers and do whatever for the other three.
So I am still thinking but sort of converging into doing the following:

filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-n_clusters are there in the target set)

learn what we can on those unique points and fill in the rest of the centers with [-np.inf, -np.inf] (that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.

Will work on that!

filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-n_clusters are there in the target set)

Yeah, I don't think there is a function that begins sorting or hash-tabling and stops once it has found k distinct elements... I think the common case will be that the first n_clusters points are distinct, maybe you could first check that with np.unique (does it actually work on points?), and only go for something slower in other cases?

Actually, do we really care? Having a duplicated feature is not much worse than having a feature that is always 0, and this is only for a degenerate case that really shouldn't happen anyway, so I don't think we should make too much effort beyond "warn and don't crash". I only mentioned the possibility of duplicate points in passing, but it does not bother me.

fill in the rest of the centers with [-np.inf, -np.inf] (that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.

As you like. Unless someone else chimes in, I don't really care what points you use as filler.

I agree that we do not really care, so I have dropped this!

src/python/test/test_representations_interface.py

… with null inertia

…in this instance.

src/python/gudhi/representations/vector_methods.py

VincentRouvreau · 2024-07-02T08:32:56Z

src/python/gudhi/representations/vector_methods.py

        weights_concat = np.concatenate(sample_weight)

-        self.quantiser.fit(X=measures_concat, sample_weight=weights_concat)
+        # In fitting we remove infinite birth/death time points so that every center is finite. We do not care about duplicates.


I am not sure if we should ask the user to do so before calling Atol (with DiagramSelector(use=True, point_type="finite") for instance), or if all representations that requires no infinite birth/death time points shall do the same. To be discussed.

I don't have a strong opinion on this, but a comment: Atol is slightly different from the other vectorizers in that it vectorizes $d$-dimensional measures, not just diagrams, so the filtering done in the Atol.fit is specific and needed and hopefully guarantees to an extent that the Atol.fit will work in that general case.
Perhaps there is a case to be made for filtering-but-printing/warning if a vectorizer requiring no infinite death points receives one?

No strong opinion here too... @mglisse or @MathieuCarriere ?

Perhaps there is a case to be made for filtering-but-printing/warning if a vectorizer requiring no infinite death points receives one?

Warning if the filtering removed any infinite points looks like a good idea (I think, not 100% sure). (and here the default of warnings.warn to print the warning only once looks good, since this is about explaining that users may want to filter themselves before atol, not something specific to one dataset)

It is true that it would be cleaner not to filter in Atol, but it should be cheap enough, so for convenience, it doesn't really bother me.

Today's thought: I'm not sure if it's not simpler to just have vectorizers handle infinite time points though (for instance by filtering).
Since diagrams are the primary focus of vectorizers, I think it's OK that vectorizers handle diagrams in the more general case (that is including infinite points), and leave the special treatments (using DiagramSelector) to special cases.

VincentRouvreau · 2024-07-02T08:40:46Z

src/python/gudhi/representations/vector_methods.py


+        if n_points < n_clusters:
+            # If not enough points to fit (including 0), we will arbitrarily put centers as [-np.inf]^measure_dim at the end.
+            print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.")


We could also use warnings here, no ?

Suggested change

print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.")

warnings.warn(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.", RuntimeWarning)

Also requires at the beginning of file:

import warnings

A drawback of warnings.warn is that by default, IIRC, it warns only once. That's good for something like a deprecation warning, or a missing dependency. But for this kind of runtime warning, I think we want it to appear "always" by default (unless the user explicitly overrides it of course). It is probably possible.

Yes it is up to the end user to activate it. But print may be not strong enough in this case.
But ok, we could keep it like that, but we have to think on how we want to handle these cases.

If we want the warning to appear multiple times, one possibility is

with warnings.catch_warnings(): warnings.simplefilter("always") warnings.warn("foo", RuntimeWarning)

Another one is to ensure that the message is different every time (with a global counter?).
A last one is to use print (preferably to sys.stderr).

I didn't see a convenient way to say: "if you would print the warning if this was the first time, print it anyway, but if warnings have been disabled, don't", the closest I came is replacing the second line above with warnings.simplefilter("always", append=True), i.e. "if there is no explicit filter, default to 'always' for this warning", which doesn't seem too bad.

I first implemented the warnings suggestion then rolled back on it because I couldn't clearly see the benefits over print: it doesn't seem to me (might be wrong) that the situation where we add infinite "bogus" centers is particularly dangerous / bad for the user that it needs special attention.

There is also https://docs.python.org/3/library/logging.html in the standard library, where we could for instance log this with a severity INFO.

Whenever we use plain print for a warning (i.e. not a normal output that may be saved in a file and used as input for something else), it may be good to give it the argument file=sys.stderr (with import sys at the beginning) so the 2 kinds of output don't get mixed up.

I wouldn't say this print qualifies as a warning or error or abnormal output?, I would say it is additional information on the nature of your fit data X with respect to your expectations on the size of the vectorization. Maybe I misunderstood, but it would seem weird to me to give this print a specific treatment compared to others on the file.

Maybe I need to reword it then? or we could also remove it altogether.

Let me rephrase. My comment applied to all the print in this file, not just this one (I have been lazy on enforcing this). Which also means that it won't be a requirement for this PR (we can change them all later in another PR).

The idea is:

I write a script.py that computes some Atol transform, and prints the output (maybe "1.5 2.3\n4.0 0.").

I run python script.py > vect.txt to redirect the output and store it in a file.

I now run another program that reads vect.txt (say with numpy.loadtxt)

If I have a string "adding meaningless cluster centers" in the middle of vect.txt, the program reading it will get confused and fail. If we use print("bla", file=sys.stderr) in Atol, then python script.py > vect.txt will print "bla" on the console and not in the file (it is a separate stream, which could be redirected separately with 2> log.txt), and the file remains unpolluted and usable by further programs.
stderr is for any kind of notice, warning, etc, not just errors.

Is the principle of stdout (the default for print) vs stderr clearer this way?

Yes I understand the idea of stderr / stdout / .... To that end the logging module is much cleaner if we want to go that direction.

src/python/gudhi/representations/vector_methods.py

mglisse · 2024-07-25T16:40:16Z

src/python/gudhi/representations/vector_methods.py

@@ -9,6 +9,8 @@
 #   - 2020/12 Gard: A more flexible Betti curve class capable of computing exact curves.
 #   - 2021/11 Vincent Rouvreau: factorize _automatic_sample_range

+import warnings


If there is no more warnings.warn, probably this can go as well?

Martin ROYER added 4 commits June 28, 2024 10:37

Instead of filtering then concatenating, do in inverse order.

a4576e5

add random measures for cases when there is not enough points to fit

c389eea

[0, 1)^2 is arbitrary, questionable choice.

modern print

c72d4ee

testing interface for vectorizers:

4fabf81

- fit - fit empty diagrams - transform - transform empty diagrams - sklearn set_output - sklearn compose with ColumnTransformer (not sure how this could fail with all the other tests but who knows)

martinroyer mentioned this pull request Jun 28, 2024

Archipelago - dict transformer for vectorizing persistence diagrams #1017

Closed

mglisse reviewed Jun 28, 2024

View reviewed changes

Martin ROYER added 6 commits June 28, 2024 18:02

sklearn.base.clone instead of deepcopy

632a55b

"interface tests" top file comment

03c61bb

instead of adding centers in [0, 1]^2, add infinitely far way centers…

442f401

… with null inertia

infer atol measure space dimension at fit

bd99872

bug fix

7338eaf

remove test_fit_empty, probably shouldn't be one interface behaviour …

9a94b4e

…in this instance.

VincentRouvreau reviewed Jul 2, 2024

View reviewed changes

Martin ROYER added 2 commits July 2, 2024 13:50

use ValueError instead of generic Exceptions

88a7a16

Replace and relocate print with warning

5e17323

martinroyer commented Jul 3, 2024

View reviewed changes

src/python/gudhi/representations/vector_methods.py Outdated Show resolved Hide resolved

Rollback on warning (vs print) bogus center addition

f7b14d8

mglisse approved these changes Jul 25, 2024

View reviewed changes

mglisse mentioned this pull request Jul 25, 2024

[python] warnings #296

Open

remove unused imports

040fb54

VincentRouvreau approved these changes Jul 29, 2024

View reviewed changes

VincentRouvreau added the 3.11.0 GUDHI version 3.11.0 label Jul 29, 2024

VincentRouvreau merged commit e6227ae into GUDHI:master Jul 29, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some robustifying fixes on Atol fit and tests for vectorizers #1096

Some robustifying fixes on Atol fit and tests for vectorizers #1096

martinroyer commented Jun 28, 2024

mglisse left a comment

mglisse Jun 28, 2024

martinroyer Jul 1, 2024

mglisse Jun 28, 2024

martinroyer Jun 28, 2024

mglisse Jun 28, 2024

martinroyer Jul 1, 2024

mglisse Jun 28, 2024

martinroyer Jun 28, 2024 •

edited

Loading

mglisse Jun 28, 2024

martinroyer Jul 1, 2024

VincentRouvreau Jul 2, 2024

martinroyer Jul 2, 2024

VincentRouvreau Jul 3, 2024

mglisse Jul 3, 2024

martinroyer Jul 4, 2024

VincentRouvreau Jul 2, 2024

mglisse Jul 2, 2024

VincentRouvreau Jul 3, 2024 •

edited

Loading

mglisse Jul 3, 2024

martinroyer Jul 23, 2024

mglisse Jul 24, 2024

martinroyer Jul 24, 2024 •

edited

Loading

mglisse Jul 24, 2024

martinroyer Jul 24, 2024

mglisse Jul 25, 2024

martinroyer Jul 26, 2024

	print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.")
	warnings.warn(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.", RuntimeWarning)

Some robustifying fixes on Atol fit and tests for vectorizers #1096

Some robustifying fixes on Atol fit and tests for vectorizers #1096

Conversation

martinroyer commented Jun 28, 2024

mglisse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinroyer Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VincentRouvreau Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinroyer Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinroyer Jun 28, 2024 •

edited

Loading

VincentRouvreau Jul 3, 2024 •

edited

Loading

martinroyer Jul 24, 2024 •

edited

Loading