Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning on RCA with chunk labeling not starting on zero or with gaps #275

Open
grudloff opened this issue Jan 20, 2020 · 2 comments
Open

Comments

@grudloff
Copy link
Contributor

Description

RCA chunks are expected to start at zero and increase one by one, this raises a warning in case it doesn't start at zero or has any gap. Although these warnings are raised this doesn't affect the result of the RCA fit.

Maybe we should be more user friendly and allow the user to specify chunk ids by arbitrary non-negative integers (negative is interpreted as "not in chunk"), even if they do start at 0 and are not contiguous, just like we (and sklearn) do for methods which are fitted on a classic class vector y.

Steps/Code to Reproduce

from metric_learn import RCA
X = [[-0.05,  3.0],[0.05, -3.0],
    [0.1, -3.55],[-0.1, 3.55],
    [-0.95, -0.05],[0.95, 0.05],
    [0.4,  0.05],[-0.4, -0.05]]
chunks = [1, 1, 2, 2, 3, 3, 4, 4]

rca = RCA()
X=rca.fit_transform(X, chunks)

Expected Results

No thrown unexpected warnings.

Actual Results

The following warnings are thrown:

./metric-learn/rca.py:28: RuntimeWarning: Mean of empty slice.
  chunk_data[mask] -= chunk_data[mask].mean(axis=0)
./numpy/core/_methods.py:154: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)

Versions

Linux-5.0.0-37-generic-x86_64-with-Ubuntu-18.04-bionic
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0]
NumPy 1.18.1
SciPy 1.4.1
Scikit-Learn 0.22.1
Metric-Learn 0.5.0

@perimosocordiae
Copy link
Contributor

Yes, it would be nicer to do pre-processing on chunk labels with np.unique(), similar to scikit-learn. It's potentially a performance hit, though, so we might want to allow users to skip it.

@grudloff
Copy link
Contributor Author

To solve this it should be necessary to replace the current max() computation by the unique() computation, I don't think the performance difference should be significant. In any case, this issue arises in the chunk mean centering which is going to be deprecated in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants