Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

ValueError: all the input array dimensions except for the concatenation axis must match exactly #29

Open
goerch opened this issue Nov 5, 2019 · 4 comments

Comments

@goerch
Copy link

goerch commented Nov 5, 2019

The following cell from my notebook

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFpr, SelectFdr, SelectFwe
from sklearn.preprocessing import FunctionTransformer, StandardScaler, MaxAbsScaler, RobustScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from stability_selection import StabilitySelection
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
vt = VarianceThreshold()
scl = RobustScaler(with_centering=False)
lr = LogisticRegression(random_state=42, solver='lbfgs', C=.05, max_iter=10000, n_jobs=-1, verbose=1)
pipe = Pipeline([('vt', vt), ('scl', scl), ('lr', lr)])
clf = StabilitySelection(base_estimator=pipe, lambda_name='lr__C', lambda_grid=np.array([.075, .05, .025]), n_bootstrap_iterations=3)

gives the output

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   34.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   32.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   32.9s finished

and then the traceback

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-f62a92b4924d> in <module>
     38 clf = StabilitySelection(base_estimator=pipe, lambda_name='lr__C', lambda_grid=np.array([.075, .05, .025]), n_bootstrap_iterations=3)
     39 
---> 40 clf.fit(X_train, y_train)
     41 #print(lr.C_)
     42 print(pd.Series(features)[vt.get_support(indices=True)[sel.get_support(indices=True)[clf.get_support()]]].sort_values().tail(60))

/opt/conda/lib/python3.6/site-packages/stability_selection/stability_selection.py in fit(self, X, y)
    344               for subsample in bootstrap_samples)
    345 
--> 346             stability_scores[:, idx] = np.vstack(selected_variables).mean(axis=0)
    347 
    348         self.stability_scores_ = stability_scores

/opt/conda/lib/python3.6/site-packages/numpy/core/shape_base.py in vstack(tup)
    281     """
    282     _warn_for_nonsequence(tup)
--> 283     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    284 
    285 

ValueError: all the input array dimensions except for the concatenation axis must match exactly

X_train is a scipy crs_matrix. Is this expected behaviour?

@goerch
Copy link
Author

goerch commented Nov 6, 2019

Monkey patching StabilitySelection with

from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.utils import check_array, check_random_state, check_X_y, safe_mask
from sklearn.externals.joblib import Parallel, delayed
from sklearn.feature_selection import SelectFromModel

def _return_estimator_from_pipeline(pipeline):
    """Returns the final estimator in a Pipeline, or the estimator
    if it is not"""
    if isinstance(pipeline, Pipeline):
        return pipeline._final_estimator
    else:
        return pipeline

def _bootstrap_generator(n_bootstrap_iterations, bootstrap_func, y,
                         n_subsamples, random_state=None):
    for _ in range(n_bootstrap_iterations):
        subsample = bootstrap_func(y, n_subsamples, random_state)
        if isinstance(subsample, tuple):
            for item in subsample:
                yield item
        else:
            yield subsample
            
def _fit_bootstrap_sample(base_estimator, X, y, lambda_name, lambda_value,
                          threshold=None):
    """
    Fits base_estimator on a bootstrap sample of the original data,
    and returns a mas of the variables that are selected by the fitted model.

    Parameters
    ----------
    base_estimator : Estimator
        Estimator to be fitted on the data

    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        The training input samples.

    y : array-like, shape = [n_samples]
        The target values.

    lambda_name : str
        Name of the penalization parameter of base_estimator

    lambda_value : float
        Value of the penalization parameter

    threshold : string, float, optional default None
        The threshold value to use for feature selection. Features whose
        importance is greater or equal are kept while the others are
        discarded. If "median" (resp. "mean"), then the ``threshold`` value is
        the median (resp. the mean) of the feature importances. A scaling
        factor (e.g., "1.25*mean") may also be used. If None and if the
        estimator has a parameter penalty set to l1, either explicitly
        or implicitly (e.g, Lasso), the threshold used is 1e-5.
        Otherwise, "mean" is used by default.

    Returns
    -------
    selected_variables : array-like, shape = [n_features]
        Boolean mask of selected variables.
    """

    base_estimator.set_params(**{lambda_name: lambda_value})
    base_estimator.fit(X, y)

    # TODO: Reconsider if we really want to use SelectFromModel here or not
    selector_model = _return_estimator_from_pipeline(base_estimator)
    variable_selector = SelectFromModel(estimator=selector_model,
                                        threshold=threshold,
                                        prefit=True)
    return variable_selector.get_support()

def fit(self, X, y):
    """Fit the stability selection model on the given data.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        The training input samples.

    y : array-like, shape = [n_samples]
        The target values.
    """

    self._validate_input()

    X, y = check_X_y(X, y, accept_sparse='csr')

    n_samples, n_variables = X.shape
    n_subsamples = np.floor(self.sample_fraction * n_samples).astype(int)
    n_lambdas = self.lambda_grid.shape[0]

    base_estimator = clone(self.base_estimator)
    random_state = check_random_state(self.random_state)
    stability_scores = np.zeros((n_variables, n_lambdas))

    for idx, lambda_value in enumerate(self.lambda_grid):
        if self.verbose > 0:
            print("Fitting estimator for lambda = %.5f (%d / %d) on %d bootstrap samples" %
                  (lambda_value, idx + 1, n_lambdas, self.n_bootstrap_iterations))

        bootstrap_samples = _bootstrap_generator(self.n_bootstrap_iterations,
                                                 self.bootstrap_func, y,
                                                 n_subsamples, random_state=random_state)

        selected_variables = Parallel(
            n_jobs=self.n_jobs, verbose=self.verbose,
            pre_dispatch=self.pre_dispatch
        )(delayed(_fit_bootstrap_sample)(clone(base_estimator),
                                         X=X[safe_mask(X, subsample), :],
                                         y=y[subsample],
                                         lambda_name=self.lambda_name,
                                         lambda_value=lambda_value,
                                         threshold=self.bootstrap_threshold)
          for subsample in bootstrap_samples)

        for i in range(0, self.n_bootstrap_iterations):
            print("Sample ", i, " selected ", selected_variables[i].shape)
        stability_scores[:, idx] = np.vstack(selected_variables).mean(axis=0)

    self.stability_scores_ = stability_scores
    return self

StabilitySelection.fit = fit

the debug output for the notebook is

Sample  0  selected  (30969,)
Sample  1  selected  (31011,)
Sample  2  selected  (30959,)

which indicates _fit_bootstrap_sample doesn't work with the feature selection of the notebook pipeline.

goerch pushed a commit to goerch/stability-selection that referenced this issue Nov 6, 2019
@thuijskens
Copy link
Collaborator

Hi @goerch - thanks for raising this issue.

It is hard for me to debug this without a minimal reproducible example (as I don't have access to your data).

Looking at your code, I wonder if the VarianceThreshold step is the cause for the different shapes in the output of _fit_bootstrap_sample. Depending on the bootstrap sample, this transformer may drop different variables which will result in a different data matrix that is passed to SelectFromModel. Perhaps you can log the shapes of the input and output data to VarianceThreshold and compare that to the shape of selected_variables.

@goerch
Copy link
Author

goerch commented Nov 6, 2019

Hi @thuijskens,

your assessment is correct. Commit goerch@9400866 contains a failing reproduction.

goerch pushed a commit to goerch/stability-selection that referenced this issue Nov 6, 2019
@goerch
Copy link
Author

goerch commented Nov 6, 2019

Hi @thuijskens,

commit goerch@37ac9be fixes the failing test case. But I'm not certain if it is even possible to support all different sklearn pipeline configurations.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants