sklearn compatibility with feature_names_out #395

PaulWestenthanner · 2023-01-21T22:26:50Z

Expected Behavior

The behaviour of get_feature_names_out should be the same as for sklearn

Actual Behavior an explanation in doc.

At the moment feature_names_out_ attribute is set in fit and returned by get_feature_names_out function. In sklearn the get_feature_names_out function does not just return the values set in fit, but determines the output feature names from given input feature names if possible.

Examples

Target-like encoders encode the features one to one and hence return the same columns as output as the input.
Hashing encoding knows how many columns it will produce (and their names) based on n_components
For BaseN and OneHot encoding the column count depends on the number of distinct values. Hence a not fitted error must be raised.

I'd like to discuss this idea a little with the community. Please add some Pros and Cons of this approach. I'm sure I'm missing something

Pros

streamlined behaviour with sklearn. This might lead to better testing and less issues in some corner cases (pipelines etc)
allows you to get output column names without having to fit an encoder (for some encoders)

Cons

quite some work: all encoders need to be changed
most encoders are 1-1 anyway or would raise a not fitted error.

The text was updated successfully, but these errors were encountered:

RNarayan73 · 2023-01-31T15:26:49Z

I agree that aligning with the updated sklearn version 1.2 would be very helpful.

In goes without saying that when using pipelines with multiple transformers, it is useful to have a df at the output end with features traceable to the input features.

The main benefit occurs downstream when the transformed dfs are passed on to predictors. For example, I have a situation currently wherein both XGBT and LGBM can take categorical features directly as an input without having to encode them. They rely on the categorical dtype of the column in the df to identify such features. Without a df output from the transformers in the pipeline with properly labelled columns, it is unnecessarily complicated if not virtually impossible to change the dtype to categorical within the pipeline before it goes on to the predictor.

I was very excited by this new feature in sklearn 1.2 hoping it would deal with it, but not realising that 3rd party transformers also have to play ball and support the api.

I therefore look forward to this being implemented soon!

Narayan

JaimeArboleda · 2023-02-23T10:55:27Z

Yes, I agree with all comments. In fact I am playing with the TargetEncoder and I found that it causes problems when integrated in a Pipeline or in a ColumnTransformer of sklearn, because get_feature_names_out will be called with a parameter and you get this kind of error:

BaseEncoder.get_feature_names_out() takes 1 positional argument but 2 were given

I will try to submit a PR if I find a simple solution that does not break anything.

JaimeArboleda · 2023-02-23T13:32:23Z

After a more careful review, I think this more work than expected. The reason is that, if you want to include a category encoder in a pipeline, unless the encoder is in an initial position, the input that it will receive is a numpy ndarray, not a pandas dataframe. And so, the column names won't exist (they will default to integers from 0 to n-1). And this is not a good behaviour.

That's why you cannot rely on the names of the inputs you see during fitting, because they might not exist!

I think that, unfortunately, each category encoder should define its own method for this, because depending on the EncodingRelation they will have different behaviour. So that's much more work than expected...

Well, in the case of ONE_TO_ONE relationships it will be straightforward.

I have made a PR with the partial fix, to see if the main contributors of the project like the approach and we can continue the work.

PaulWestenthanner · 2023-02-23T17:54:34Z

Hi Jaime,

thanks for your work. When exactly does the pipeline example fail? I think we have tests for fit and transform methods using pipelines. But you call pipeline.get_feature_name_out to get the error, right?

I agree that it is a lot of work. Basically at the moment output feature names are calculated on the fly when fitting whereas in sklearn the logic to determine output feature names is in the get_feature_names_out function. Hence we need to look at every single encoder and understand what it does to obtain output colnames.

I think that, unfortunately, each category encoder should define its own method for this

This is exactly the point, and also the way sklearn does it.
But still I think it is worth the effort and as a maintainer I'd love to see it implemented. Would you be willing to invest more time? I tagged this issue as a good-first-issue not because it is easy (yet do-able) but working on it will give a good overview of the project and will enable a lot of future commits.

JaimeArboleda · 2023-02-23T19:34:58Z

Thanks a lot!! Yes, I will be willing to invest more time, but I'd like to share my ideas with you so that we agree on how to tackle this first... I will share with you a google colab notebook to explain more carefully the issue.

JaimeArboleda · 2023-02-23T21:24:29Z

Ok, here is the colab notebook explaining the issue and a path to a solution (I hope it makes sense!).

https://colab.research.google.com/drive/1lJJvXyAOoG9MBtNiH2qSTrqbzZtEPTeB?usp=sharing

As a side comment, I think that in general the fact that inputs and outputs of sklearn objects can be either dataframes or np.arrays is very problematic and leads to lots of edge cases and needs for workarounds... For me keeping track of the columns and how they transform during the preprocessing is very valuable, and it's a pity that we have this complex situation, that is the cause why the "naive solution" cannot work.

PaulWestenthanner · 2023-02-26T10:35:50Z

Hi Jaime,

I think this is going into the right direction:
Basically every encoder needs to implement the logic to get output features, there is no way around this.
It makes sense to look at how sklearn is handling this:

OneHotEncoder: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_encoders.py#L1066 (note check if the encoder is fitted)
OneToOne: https://github.com/scikit-learn/scikit-learn/blob/1042757565b85960c42e97ef4a21d9943e7cb48e/sklearn/base.py#L904 This is a Mixin that allows composition over inheritance style, which I really like. We should try to use this mixin.

Here's a list of encoders and how output features will look like:

One To One -> via mixin
ContrastCoding (SumEncoder, BackwardDifferenceEncoder, PolynomialEncoder) -> Intercept + a column for each label except for one (I think the last one but not sure, might be the first one)
BaseNEncoding (also Gray and Binary) -> for each feature the log _N of the number of distinct labels
Hashing (note that the number of output features does not depend on the number of input features) -> n_components
OneHot (and RankHot) -> number of labels for each feature
PolynomialWrapper -> this is a wrapper that allows to use target-based encoders for multiclass targets by hot-encoding the target. Hence we have n_distinct_categories_in_target output features for each input feature (since the input features should be one-to-one)

Does this make it clear? Does it make sense to you? Did I miss anything?

JaimeArboleda · 2023-02-27T05:36:54Z

Hello Paul!

Thanks a lot. I agree it makes a lot of sense to follow the same path as in sklearn. I've been taking a look and there are new things that I did not know. For example they have created an API called set_output, that allows developers of estimators and transformers to outupt pandas dataframes based on some configuration. I guess it would be great to follow this API, do you agree? See here

I will start by creating more tests for feature names, at least ffor the following cases (for all Encoders):

Testing the transformer alone.
Testing the transformer inside a pipeline:
- When it's the first object.
- When it's not the first object.

And then I will try to implement all needed changes following how sklearn is doing it.

I have some questions:

Do you think it makes sense to maintain the OneHotEncoder class? I mean, as sklearn has it too, do you see if there is added value in the implementation provided by Category Encoders? In case you want to keep it: do you think column names should be sorted by value? I discovered that they are not sorted by default. For example, if you have a dataframe with values ['A', 'B', 'C'] in a column named 'col', then the OneHotEncoder may create the dummies columns in an unsorted order, for example, 'col_B', 'col_A', 'col_C'. I believe this is confusing, and sklearn sorts them.

Do you think this code that is in the BaseEncoder.fit(X, y) should be kept?

# for finding invariant columns transform without y (as is done on the test set)
X_transformed = self.transform(X, override_return_df=True)
self.feature_names_out_ = X_transformed.columns.tolist()

Maybe this was a workaround that we should try to fix?

PaulWestenthanner · 2023-02-27T09:49:54Z

Hi,

thanks for sharing the set_output docs. I think this is exactly what we need and should go for. Also the other Mixin mentioned there could be useful.
Regarding tests I agree that this is a good testing setup. I think we're missing tests for pipelines altogether, so we should really introduce them.
As for your questions:

Good question. I think I'd like to keep it. Our implementation does provide some features that the sklearn version does not have (i.e. handle_missing and handle_unkown = indicator and handle_unkown = return_nan and also drop_invariant). The handle_unkown=indicator might actually make problems in get_feature_names_out since it will add columns during predict time. On the other hand the sklearn version has the sparse option which is a nice feature that we miss. So it might make sense to reconsider this for the future. But for the time being I'd keep it.
No, this should probably be changed to use the get_feature_names_out function. But I think we probably cannot avoid the transform (although it is a little ugly) since for finding invariant columns we need to do a transform

JaimeArboleda · 2023-02-27T16:57:42Z

Ok, good. Yes, I will try to follow as much as possible the sklearn path. So there are two things to adapt, that I will try to include in the PR:

get_feature_names_out (using Mixins when possible).
set_output. This change was available in sklearn since version 1.2.0. I have a couple of questions regarding this:
- Should we modify the API of category_encoders? I guess it makes a lot of sense to remove arguments like return_df, because this should be handled by using set_output as all sklearn estimators now.
- But, if so, probably this will imply a new version of the library. Is that ok?
- I am not sure how this is configured at the level of tests. For example, if I include tests using set_output, those tests should only be performed for the new version of the library. Do I need to include this kind of configuration in the tox.ini file, or anywhere? Sorry for the question, I am not familiar with this kind of things.

Anyway, the second question is a technical point, I think I can start and we can adjust later. It's more important for me to be clear if we want to change the API (regarding the return_df input).

PaulWestenthanner · 2023-02-27T20:28:43Z

Perfect!
I agree that we should remove the return_df argument in the long run and replace it by set_output. However, since this is a breaking change and is actually the default option we can't replace it without a transition period. So for the time being I think we should introduce a future warning. I think a lot of user rely on getting a dataframe as output and they'd need to adjust on their ends.
Currently we're on version 2.6.0. If we only release a future warning but keep return_df I'd release your PR as version 2.6.1, and then maybe in 2.7 or even 2.8 (depending on how frequent we'll release) we could drop the return_df. I'd like to give the users at least a couple of months to adopt.
Regarding tests we're actually not using the tox file in our pipeline. Our pipeline will just test different versions of python and the latest versions of pandas/numpy/sklearn not all versions specified in the requirements file. So if we rely on set_output we'd basically need to bump the sklearn version to 1.2 which is rather new. This is also basically a trade-off between being fast paced and ensuring stability/compatibility where as a maintainer I favour the latter.
Considering this, we should maybe just introduce a future warning for the return_df parameter and wait to replace the parameter.
Alternatively, we could change what the return_df parameter does: Instead of returning a df in the base encoder we could call self.set_output(transform="pandas") in the __init__ function of the encoder. This would be a compromise, keeping backwards compatibility (and the argument) but already removing any custom logic and hence aligning more with sklearn.

What do you think?

JaimeArboleda · 2023-02-28T06:00:58Z

Thank you. Yes, this could be a good idea...

However, to be honest, I am feeling less and less comfortable and confident with this PR. I started early this morning preparing the new test suites, following the order of the encoders, and I panicked when I discovered how entangled this library is with having pandas dataframes as inputs and outputs. I am not saying that I don't understand this rationale, but it's a big (very big) change from how sklearn treats inputs and outputs. In sklearn, inputs and outputs are expected to be either pandas dataframes or numpy ndarray. Both options are allowed. And this is reflected (among other things) in the parameters of the transformers. For example, for the OrdinalEncoder of sklearn, if you wish to specify the order in which you want to encode the categories, you can use this attribute:

categories: ‘auto’ or a list of array-like, default=’auto’
        Categories (unique values) per feature:
        ‘auto’ : Determine categories automatically from the training data.
        list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.

However, to achieve similiar functionality with the OrdinalEncoder of category_encoders, you have to use the mapping parameter:

mapping: list of dicts
        a mapping of class to label to use for the encoding, optional.
        the dict contains the keys 'col' and 'mapping'.
        the value of 'col' should be the feature name.
        the value of 'mapping' should be a dictionary or pd.Series of 'original_label' to 'encoded_label'.
        example mapping: [
            {'col': 'col1', 'mapping': {None: 0, 'a': 1, 'b': 2}},
            {'col': 'col2', 'mapping': {None: 0, 'x': 1, 'y': 2}}
        ]

The big difference is that category_encoders is using the names of the columns, instead of the order in which they appear, hence it's implicitely expecting a pandas dataframe as input. I know that if the input of fit is a numpy ndarray, it's converted into a dataframe, but this is a hack that plays (IMHO) against the composability of the transformer. As showed in the Colab example, when the transformer is part of a Pipeline or a ColumnTransformer, if a previous transformer outputs a numpy ndarray, then this is what will be received by the fit method. And the hack of converting X to a pandas dataframe will make the mapping incorrect and create subtle bugs when the encoder is part of a complex object. I have created an example showing that problem.

And it's not so easy to get around this problem because the mapping is not only a parameter that can be used by the user to specify how to map the categories to the new values, but it's also used quite a lot internally. Honestly I don't know how to get around this problem... I think this asks for a big redesign of the whole library and I don't feel capable of doing that.

The assumption of the input being a dataframe is everywhere!

So I think there are two possible paths:

Redesign the whole library removing the implicit assumption of X being a dataframe. This will be a big breaking change. In the long term it's maybe the best solution (for achieving the bigest compatibility with sklearn and the way it works) and it will give more clarity, but it will be a hard path. I don't feel capable of doing that.
Use the new set_output API of sklearn to ensure that, even if the CategoryEncoder is part of a Pipeline, all estimators will ouptut pandas dataframes. I am talking about doing this:

In __init__.py:
```
from sklearn import set_config
set_config(transform_output="pandas")
```
Well, before setting the config, we could retrieve the current config value and send a warning in case we are modifying the current configuration. A warning with a message like: "To ensure full composability of CategoryEncoders with other sklearn estimators and transformers, and to get predictable column names when using get_feature_names_out, the configuration of sklearn has been set to transform_output="pandas". This might be ugly, but it will be the simplest solution given the current design.

This solution has another side benefit for solving the current issue: I believe that the naive solution I initially thought will be enough if we set the config like that.

What do you think?

JaimeArboleda · 2023-02-28T09:59:23Z

I have tested the second solution and it works ok. All new tests pass without issues and only small modifications of the codebase were needed. If setting this configuration of sklearn is not that ugly, I think this solution could be very practical.

Please, check the PR

However, old tests need plenty of adjustments, because sometimes they use plain lists as inputs (and this generates a problem when trying to build the pandas dataframe) or because they try to test return_df=False (which, if we follow this path, won't have any effect). I am trying to repair them, but it's taking more time than expected... So I believe it's better if I leave it like that and wait until your confirmation as a maintainer...

PaulWestenthanner · 2023-02-28T14:14:48Z

You're right, this library internally works with dataframes.
However I'd like to keep the behaviour that you can input arrays and lists and also not necessarily get a df back (return_df=False).
I've just been thinking about columns transformers: You need them if you only want to modify some columns and leave others untouched. This is basically what all encoders do! So I do not see a use-case where it makes sense to use a column transformer over integrating the encoder in a pipeline.
In case of using Pipelines you can manually set the previous step to output dataframes

categorical_preprocessor = Pipeline(
    steps=[
        ("imputation_constant", SimpleImputer(fill_value="missing", strategy="constant").set_output(transform='pandas')),
        ("ordinal", OrdinalEncoder(mapping=[{'col': 'col_a', 'mapping': {'a': 2, 'b': 1, 'c': 3}}])),
    ]
)

This is not very pretty but works. However I think if we can make the existing thing work, we should rather do this (and maybe add some docs on how to use with pipellines) rather than refactoring all of the code because we don't have (or at least I don't have) the resources for this.

Nevertheless we can include the functionality to get_feature_names_out right? even if we do not achieve full compatiblity with sklearn? It will be a step in the right direction.

Thanks for pointing all of this out. Great work so far! I'm also not super happy with the way it is ^^

JaimeArboleda · 2023-02-28T15:14:02Z

I see your points!

I think I would prefer a huge refactoring, but I don't have time, knowledge and energy to do that.

But you are right, if there is a solution that, in order to work 100% smoothly, involves only setting the output in skearn, we can just leave category_encoders with minimal changes so that no Exception are raised, and point out in the documentation how to ensure that the feature names are right.

This is what I have done in the last commit that you'll see in the PR. I think it's a good tradeoff, and the changes are truly minimal. I like it much more than my previous work with forcing the configuration of sklearn in the __init__.py file (in fact, I was finding a lot of problems with the existing tests...).

This is a bit off topic, but I am very, very fond of Pipelines and ColumnTransformers. When I want to optimize hyperparameters, including hyperparameters of the preprocessing steps (and this library, which is fairly knew to me, contains lots of examples of hyperparameters that you might want to play with), I like defining the whole preprocessing + model in a single object, and then using either RandomizedSearchCV or some method from skopt for finding the best hyperparameters. I even developed a small library for doing this procedure in a Nested Cross Validation scheme. If you are interested, this notebook from the library might give you an idea of why I find it useful to include in a single complex object the whole pipeline.

Going back to our work, please take a look at the PR and tell me if you are happy with it or if you want to modify something.

Thanks!

JaimeArboleda · 2023-02-28T16:05:59Z

Please, check the last commit I uploaded just right now, 673c876, because I fixed some tests that were failing (main reason was that get_feature_names_out outputs a np.array, as in sklearn, instead of a list...). Not a big deal, but now all tests pass OK.

PaulWestenthanner · 2023-02-28T16:43:36Z

Great stuff you've been doing. I'll review your code probably on Friday or the weekend, since I'll be a little busy for the rest of the week. Your support (and thorough knowledge about pipelines) is greatly appreciated

JaimeArboleda · 2023-02-28T16:49:39Z

Thanks a lot for your kind words. It's been a pleasure. :)

bmreiniger · 2023-04-30T16:29:43Z

Did #398 leave anything from this issue to do? (If so, can you summarize it, maybe spinning off a new issue; and if not, close the issue?)

PaulWestenthanner · 2023-05-06T20:23:03Z

Hi Ben, I've summarized the open point in #406 and will close this one

PaulWestenthanner added enhancement good first issue discussion labels Jan 31, 2023

JaimeArboleda mentioned this issue Feb 23, 2023

(WIP) Partial fix for getting feature names out #398

Merged

PaulWestenthanner closed this as completed May 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn compatibility with feature_names_out #395

sklearn compatibility with feature_names_out #395

PaulWestenthanner commented Jan 21, 2023

RNarayan73 commented Jan 31, 2023

JaimeArboleda commented Feb 23, 2023 •

edited

Loading

JaimeArboleda commented Feb 23, 2023

PaulWestenthanner commented Feb 23, 2023 •

edited

Loading

JaimeArboleda commented Feb 23, 2023 via email •

edited

Loading

JaimeArboleda commented Feb 23, 2023 •

edited

Loading

PaulWestenthanner commented Feb 26, 2023

JaimeArboleda commented Feb 27, 2023

PaulWestenthanner commented Feb 27, 2023

JaimeArboleda commented Feb 27, 2023

PaulWestenthanner commented Feb 27, 2023

JaimeArboleda commented Feb 28, 2023 •

edited

Loading

JaimeArboleda commented Feb 28, 2023 •

edited

Loading

PaulWestenthanner commented Feb 28, 2023

JaimeArboleda commented Feb 28, 2023

JaimeArboleda commented Feb 28, 2023

PaulWestenthanner commented Feb 28, 2023

JaimeArboleda commented Feb 28, 2023

bmreiniger commented Apr 30, 2023

PaulWestenthanner commented May 6, 2023

sklearn compatibility with feature_names_out #395

sklearn compatibility with feature_names_out #395

Comments

PaulWestenthanner commented Jan 21, 2023

Expected Behavior

Actual Behavior an explanation in doc.

Pros

Cons

RNarayan73 commented Jan 31, 2023

JaimeArboleda commented Feb 23, 2023 • edited Loading

JaimeArboleda commented Feb 23, 2023

PaulWestenthanner commented Feb 23, 2023 • edited Loading

JaimeArboleda commented Feb 23, 2023 via email • edited Loading

JaimeArboleda commented Feb 23, 2023 • edited Loading

PaulWestenthanner commented Feb 26, 2023

JaimeArboleda commented Feb 27, 2023

PaulWestenthanner commented Feb 27, 2023

JaimeArboleda commented Feb 27, 2023

PaulWestenthanner commented Feb 27, 2023

JaimeArboleda commented Feb 28, 2023 • edited Loading

JaimeArboleda commented Feb 28, 2023 • edited Loading

PaulWestenthanner commented Feb 28, 2023

JaimeArboleda commented Feb 28, 2023

JaimeArboleda commented Feb 28, 2023

PaulWestenthanner commented Feb 28, 2023

JaimeArboleda commented Feb 28, 2023

bmreiniger commented Apr 30, 2023

PaulWestenthanner commented May 6, 2023

JaimeArboleda commented Feb 23, 2023 •

edited

Loading

PaulWestenthanner commented Feb 23, 2023 •

edited

Loading

JaimeArboleda commented Feb 23, 2023 via email •

edited

Loading

JaimeArboleda commented Feb 23, 2023 •

edited

Loading

JaimeArboleda commented Feb 28, 2023 •

edited

Loading

JaimeArboleda commented Feb 28, 2023 •

edited

Loading