Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some robustifying fixes on Atol fit and tests for vectorizers #1096

Merged
merged 14 commits into from
Jul 29, 2024

Conversation

martinroyer
Copy link
Collaborator

This introduces fixes that allow Atol to not crash on fitting, essentially by adding random points to the target.
This is a follow-up on PR #1017 that can now be dropped.

Option: I add a series of test that only test Atol for now, but are ready to be extended to all vectorization methods. So one could work on another vector method and directly use the same test battery to see if there's a problem with this method's:
- fit,
- fit on empty diagrams,
- transform,
- transform on empty diagrams,
- using sklearn set_output,
- using sklearn compose with ColumnTransformer (not sure how this could fail with all the other tests but who knows).
This test list comes as a sort of probe for a shared common interface of vectorizers, it could be enriched/edited on the way.

(naturally we could also work on the interface as a separate PR where we deal with the transform_one question too, see #1056 (comment).)

Martin ROYER added 4 commits June 28, 2024 10:37
[0, 1)^2 is arbitrary, questionable choice.
- fit
- fit empty diagrams
- transform
- transform empty diagrams
- sklearn set_output
- sklearn compose with ColumnTransformer (not sure how this could fail with all the other tests but who knows)
Copy link
Member

@mglisse mglisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some details below, but the approach looks fine.

n_clusters = self.quantiser.n_clusters
n_points = len(filtered_measures_concat)
if n_points < n_clusters:
# If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1062, for Landscapes, the fallback is a "useless" grid [-inf, -inf, -inf, ...] for which transform will produce only zeros. Here, instead of picking points far enough that they don't matter, you try to put them in a place where at least for some datasets they might be useful. Both approaches look ok to me (they both also print a message), I just wanted to point out the inconsistency (that won't block the PR).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the latest commits from me are now consistent with this behaviour: if some centers cannot be produced by "fitting some data", they will be assigned to [-np.inf, -np.inf, ...] and they transform anything into 0.

src/python/gudhi/representations/vector_methods.py Outdated Show resolved Hide resolved
src/python/gudhi/representations/vector_methods.py Outdated Show resolved Hide resolved
n_points = len(filtered_measures_concat)
if n_points < n_clusters:
# If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2
print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding random points in [0, 1)^2.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is Atol restricted to 2D? I thought it wasn't restricted to diagrams and could work in any dimension. Though if we fit on no points, I don't know how to guess a dimension for the random centers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not, this is a hard one!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still the possibility of giving up (i.e. raise an exception if there are no points at all).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now favored and implemented this approach!

filtered_weights_concat = np.concatenate((filtered_weights_concat, np.ones(shape=(n_clusters - n_points))))
filtered_measures_concat = np.concatenate((filtered_measures_concat, np.random.random((n_clusters - n_points, 2))))

self.quantiser.fit(X=filtered_measures_concat, sample_weight=filtered_weights_concat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the number of points equals the number of requested clusters, are clustering algorithms usually good at finding all of the points as centers? Another possibility, in the "too few points" case, would be to write to self.centers directly and skip the use of the quantiser.

Copy link
Collaborator Author

@martinroyer martinroyer Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thanks for all the great remarks.
I think on this specific question they are ok at that yes. And while I like your suggestion, if there's only say two points and we want five clusters, it is probably highly valuable to take those two points as cluster centers and do whatever for the other three.
So I am still thinking but sort of converging into doing the following:

  • filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-n_clusters are there in the target set)
  • learn what we can on those unique points and fill in the rest of the centers with [-np.inf, -np.inf] (that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.

Will work on that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-n_clusters are there in the target set)

Yeah, I don't think there is a function that begins sorting or hash-tabling and stops once it has found k distinct elements... I think the common case will be that the first n_clusters points are distinct, maybe you could first check that with np.unique (does it actually work on points?), and only go for something slower in other cases?

Actually, do we really care? Having a duplicated feature is not much worse than having a feature that is always 0, and this is only for a degenerate case that really shouldn't happen anyway, so I don't think we should make too much effort beyond "warn and don't crash". I only mentioned the possibility of duplicate points in passing, but it does not bother me.

  • fill in the rest of the centers with [-np.inf, -np.inf] (that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.

As you like. Unless someone else chimes in, I don't really care what points you use as filler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we do not really care, so I have dropped this!

src/python/test/test_representations_interface.py Outdated Show resolved Hide resolved
src/python/gudhi/representations/vector_methods.py Outdated Show resolved Hide resolved
src/python/gudhi/representations/vector_methods.py Outdated Show resolved Hide resolved
weights_concat = np.concatenate(sample_weight)

self.quantiser.fit(X=measures_concat, sample_weight=weights_concat)
# In fitting we remove infinite birth/death time points so that every center is finite. We do not care about duplicates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should ask the user to do so before calling Atol (with DiagramSelector(use=True, point_type="finite") for instance), or if all representations that requires no infinite birth/death time points shall do the same. To be discussed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this, but a comment: Atol is slightly different from the other vectorizers in that it vectorizes $d$-dimensional measures, not just diagrams, so the filtering done in the Atol.fit is specific and needed and hopefully guarantees to an extent that the Atol.fit will work in that general case.
Perhaps there is a case to be made for filtering-but-printing/warning if a vectorizer requiring no infinite death points receives one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion here too... @mglisse or @MathieuCarriere ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps there is a case to be made for filtering-but-printing/warning if a vectorizer requiring no infinite death points receives one?

Warning if the filtering removed any infinite points looks like a good idea (I think, not 100% sure). (and here the default of warnings.warn to print the warning only once looks good, since this is about explaining that users may want to filter themselves before atol, not something specific to one dataset)

It is true that it would be cleaner not to filter in Atol, but it should be cheap enough, so for convenience, it doesn't really bother me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today's thought: I'm not sure if it's not simpler to just have vectorizers handle infinite time points though (for instance by filtering).
Since diagrams are the primary focus of vectorizers, I think it's OK that vectorizers handle diagrams in the more general case (that is including infinite points), and leave the special treatments (using DiagramSelector) to special cases.


if n_points < n_clusters:
# If not enough points to fit (including 0), we will arbitrarily put centers as [-np.inf]^measure_dim at the end.
print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use warnings here, no ?

Suggested change
print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.")
warnings.warn(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.", RuntimeWarning)

Also requires at the beginning of file:

import warnings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A drawback of warnings.warn is that by default, IIRC, it warns only once. That's good for something like a deprecation warning, or a missing dependency. But for this kind of runtime warning, I think we want it to appear "always" by default (unless the user explicitly overrides it of course). It is probably possible.

Copy link
Contributor

@VincentRouvreau VincentRouvreau Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is up to the end user to activate it. But print may be not strong enough in this case.
But ok, we could keep it like that, but we have to think on how we want to handle these cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want the warning to appear multiple times, one possibility is

    with warnings.catch_warnings():
        warnings.simplefilter("always")
        warnings.warn("foo", RuntimeWarning)

Another one is to ensure that the message is different every time (with a global counter?).
A last one is to use print (preferably to sys.stderr).

I didn't see a convenient way to say: "if you would print the warning if this was the first time, print it anyway, but if warnings have been disabled, don't", the closest I came is replacing the second line above with warnings.simplefilter("always", append=True), i.e. "if there is no explicit filter, default to 'always' for this warning", which doesn't seem too bad.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first implemented the warnings suggestion then rolled back on it because I couldn't clearly see the benefits over print: it doesn't seem to me (might be wrong) that the situation where we add infinite "bogus" centers is particularly dangerous / bad for the user that it needs special attention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also https://docs.python.org/3/library/logging.html in the standard library, where we could for instance log this with a severity INFO.

Whenever we use plain print for a warning (i.e. not a normal output that may be saved in a file and used as input for something else), it may be good to give it the argument file=sys.stderr (with import sys at the beginning) so the 2 kinds of output don't get mixed up.

Copy link
Collaborator Author

@martinroyer martinroyer Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say this print qualifies as a warning or error or abnormal output?, I would say it is additional information on the nature of your fit data X with respect to your expectations on the size of the vectorization. Maybe I misunderstood, but it would seem weird to me to give this print a specific treatment compared to others on the file.

Maybe I need to reword it then? or we could also remove it altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase. My comment applied to all the print in this file, not just this one (I have been lazy on enforcing this). Which also means that it won't be a requirement for this PR (we can change them all later in another PR).

The idea is:

  • I write a script.py that computes some Atol transform, and prints the output (maybe "1.5 2.3\n4.0 0.").
  • I run python script.py > vect.txt to redirect the output and store it in a file.
  • I now run another program that reads vect.txt (say with numpy.loadtxt)

If I have a string "adding meaningless cluster centers" in the middle of vect.txt, the program reading it will get confused and fail. If we use print("bla", file=sys.stderr) in Atol, then python script.py > vect.txt will print "bla" on the console and not in the file (it is a separate stream, which could be redirected separately with 2> log.txt), and the file remains unpolluted and usable by further programs.
stderr is for any kind of notice, warning, etc, not just errors.

Is the principle of stdout (the default for print) vs stderr clearer this way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I understand the idea of stderr / stdout / .... To that end the logging module is much cleaner if we want to go that direction.

@@ -9,6 +9,8 @@
# - 2020/12 Gard: A more flexible Betti curve class capable of computing exact curves.
# - 2021/11 Vincent Rouvreau: factorize _automatic_sample_range

import warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no more warnings.warn, probably this can go as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yes!

@mglisse mglisse mentioned this pull request Jul 25, 2024
@VincentRouvreau VincentRouvreau added the 3.11.0 GUDHI version 3.11.0 label Jul 29, 2024
@VincentRouvreau VincentRouvreau merged commit e6227ae into GUDHI:master Jul 29, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11.0 GUDHI version 3.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants