-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some robustifying fixes on Atol fit and tests for vectorizers #1096
Conversation
[0, 1)^2 is arbitrary, questionable choice.
- fit - fit empty diagrams - transform - transform empty diagrams - sklearn set_output - sklearn compose with ColumnTransformer (not sure how this could fail with all the other tests but who knows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some details below, but the approach looks fine.
n_clusters = self.quantiser.n_clusters | ||
n_points = len(filtered_measures_concat) | ||
if n_points < n_clusters: | ||
# If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #1062, for Landscapes, the fallback is a "useless" grid [-inf, -inf, -inf, ...] for which transform will produce only zeros. Here, instead of picking points far enough that they don't matter, you try to put them in a place where at least for some datasets they might be useful. Both approaches look ok to me (they both also print a message), I just wanted to point out the inconsistency (that won't block the PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the latest commits from me are now consistent with this behaviour: if some centers cannot be produced by "fitting some data", they will be assigned to [-np.inf, -np.inf, ...] and they transform anything into 0.
n_points = len(filtered_measures_concat) | ||
if n_points < n_clusters: | ||
# If not enough points to fit (including 0), let's arbitrarily put centers in [0, 1)^2 | ||
print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding random points in [0, 1)^2.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, is Atol restricted to 2D? I thought it wasn't restricted to diagrams and could work in any dimension. Though if we fit on no points, I don't know how to guess a dimension for the random centers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not, this is a hard one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still the possibility of giving up (i.e. raise an exception if there are no points at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now favored and implemented this approach!
filtered_weights_concat = np.concatenate((filtered_weights_concat, np.ones(shape=(n_clusters - n_points)))) | ||
filtered_measures_concat = np.concatenate((filtered_measures_concat, np.random.random((n_clusters - n_points, 2)))) | ||
|
||
self.quantiser.fit(X=filtered_measures_concat, sample_weight=filtered_weights_concat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the number of points equals the number of requested clusters, are clustering algorithms usually good at finding all of the points as centers? Another possibility, in the "too few points" case, would be to write to self.centers directly and skip the use of the quantiser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thanks for all the great remarks.
I think on this specific question they are ok at that yes. And while I like your suggestion, if there's only say two points and we want five clusters, it is probably highly valuable to take those two points as cluster centers and do whatever for the other three.
So I am still thinking but sort of converging into doing the following:
- filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-
n_clusters
are there in the target set) - learn what we can on those unique points and fill in the rest of the centers with
[-np.inf, -np.inf]
(that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.
Will work on that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- filter more with np.unique (there is an overhead, because what we really only need to know is how many unique point up-to-
n_clusters
are there in the target set)
Yeah, I don't think there is a function that begins sorting or hash-tabling and stops once it has found k distinct elements... I think the common case will be that the first n_clusters
points are distinct, maybe you could first check that with np.unique (does it actually work on points?), and only go for something slower in other cases?
Actually, do we really care? Having a duplicated feature is not much worse than having a feature that is always 0, and this is only for a degenerate case that really shouldn't happen anyway, so I don't think we should make too much effort beyond "warn and don't crash". I only mentioned the possibility of duplicate points in passing, but it does not bother me.
- fill in the rest of the centers with
[-np.inf, -np.inf]
(that would be closer to Landscapes) and for these centers write the inertias as 0 and hopefully that would not raise any errors at transform time.
As you like. Unless someone else chimes in, I don't really care what points you use as filler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we do not really care, so I have dropped this!
… with null inertia
…in this instance.
weights_concat = np.concatenate(sample_weight) | ||
|
||
self.quantiser.fit(X=measures_concat, sample_weight=weights_concat) | ||
# In fitting we remove infinite birth/death time points so that every center is finite. We do not care about duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should ask the user to do so before calling Atol (with DiagramSelector(use=True, point_type="finite")
for instance), or if all representations that requires no infinite birth/death time points shall do the same. To be discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this, but a comment: Atol is slightly different from the other vectorizers in that it vectorizes Atol.fit
is specific and needed and hopefully guarantees to an extent that the Atol.fit
will work in that general case.
Perhaps there is a case to be made for filtering-but-printing/warning if a vectorizer requiring no infinite death points receives one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion here too... @mglisse or @MathieuCarriere ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps there is a case to be made for filtering-but-printing/warning if a vectorizer requiring no infinite death points receives one?
Warning if the filtering removed any infinite points looks like a good idea (I think, not 100% sure). (and here the default of warnings.warn to print the warning only once looks good, since this is about explaining that users may want to filter themselves before atol, not something specific to one dataset)
It is true that it would be cleaner not to filter in Atol, but it should be cheap enough, so for convenience, it doesn't really bother me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today's thought: I'm not sure if it's not simpler to just have vectorizers handle infinite time points though (for instance by filtering).
Since diagrams are the primary focus of vectorizers, I think it's OK that vectorizers handle diagrams in the more general case (that is including infinite points), and leave the special treatments (using DiagramSelector
) to special cases.
|
||
if n_points < n_clusters: | ||
# If not enough points to fit (including 0), we will arbitrarily put centers as [-np.inf]^measure_dim at the end. | ||
print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use warnings here, no ?
print(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.") | |
warnings.warn(f"[Atol] had {n_points} points to fit {n_clusters} clusters, adding meaningless cluster centers.", RuntimeWarning) |
Also requires at the beginning of file:
import warnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A drawback of warnings.warn is that by default, IIRC, it warns only once. That's good for something like a deprecation warning, or a missing dependency. But for this kind of runtime warning, I think we want it to appear "always" by default (unless the user explicitly overrides it of course). It is probably possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is up to the end user to activate it. But print
may be not strong enough in this case.
But ok, we could keep it like that, but we have to think on how we want to handle these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want the warning to appear multiple times, one possibility is
with warnings.catch_warnings():
warnings.simplefilter("always")
warnings.warn("foo", RuntimeWarning)
Another one is to ensure that the message is different every time (with a global counter?).
A last one is to use print (preferably to sys.stderr).
I didn't see a convenient way to say: "if you would print the warning if this was the first time, print it anyway, but if warnings have been disabled, don't", the closest I came is replacing the second line above with warnings.simplefilter("always", append=True)
, i.e. "if there is no explicit filter, default to 'always' for this warning", which doesn't seem too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first implemented the warnings suggestion then rolled back on it because I couldn't clearly see the benefits over print
: it doesn't seem to me (might be wrong) that the situation where we add infinite "bogus" centers is particularly dangerous / bad for the user that it needs special attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also https://docs.python.org/3/library/logging.html in the standard library, where we could for instance log this with a severity INFO.
Whenever we use plain print
for a warning (i.e. not a normal output that may be saved in a file and used as input for something else), it may be good to give it the argument file=sys.stderr
(with import sys
at the beginning) so the 2 kinds of output don't get mixed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say this print qualifies as a warning or error or abnormal output?, I would say it is additional information on the nature of your fit data X
with respect to your expectations on the size of the vectorization. Maybe I misunderstood, but it would seem weird to me to give this print a specific treatment compared to others on the file.
Maybe I need to reword it then? or we could also remove it altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase. My comment applied to all the print
in this file, not just this one (I have been lazy on enforcing this). Which also means that it won't be a requirement for this PR (we can change them all later in another PR).
The idea is:
- I write a script.py that computes some Atol transform, and prints the output (maybe "1.5 2.3\n4.0 0.").
- I run
python script.py > vect.txt
to redirect the output and store it in a file. - I now run another program that reads vect.txt (say with numpy.loadtxt)
If I have a string "adding meaningless cluster centers" in the middle of vect.txt, the program reading it will get confused and fail. If we use print("bla", file=sys.stderr)
in Atol, then python script.py > vect.txt
will print "bla" on the console and not in the file (it is a separate stream, which could be redirected separately with 2> log.txt
), and the file remains unpolluted and usable by further programs.
stderr
is for any kind of notice, warning, etc, not just errors.
Is the principle of stdout (the default for print
) vs stderr clearer this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I understand the idea of stderr / stdout / .... To that end the logging module is much cleaner if we want to go that direction.
@@ -9,6 +9,8 @@ | |||
# - 2020/12 Gard: A more flexible Betti curve class capable of computing exact curves. | |||
# - 2021/11 Vincent Rouvreau: factorize _automatic_sample_range | |||
|
|||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no more warnings.warn, probably this can go as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, yes!
This introduces fixes that allow Atol to not crash on fitting, essentially by adding random points to the target.
This is a follow-up on PR #1017 that can now be dropped.
Option: I add a series of test that only test Atol for now, but are ready to be extended to all vectorization methods. So one could work on another vector method and directly use the same test battery to see if there's a problem with this method's:
- fit,
- fit on empty diagrams,
- transform,
- transform on empty diagrams,
- using sklearn set_output,
- using sklearn compose with ColumnTransformer (not sure how this could fail with all the other tests but who knows).
This test list comes as a sort of probe for a shared common interface of vectorizers, it could be enriched/edited on the way.
(naturally we could also work on the interface as a separate PR where we deal with the
transform_one
question too, see #1056 (comment).)