-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sklearn compatibility with feature_names_out #395
Comments
I agree that aligning with the updated sklearn version 1.2 would be very helpful. In goes without saying that when using pipelines with multiple transformers, it is useful to have a df at the output end with features traceable to the input features. The main benefit occurs downstream when the transformed dfs are passed on to predictors. For example, I have a situation currently wherein both XGBT and LGBM can take categorical features directly as an input without having to encode them. They rely on the categorical dtype of the column in the df to identify such features. Without a df output from the transformers in the pipeline with properly labelled columns, it is unnecessarily complicated if not virtually impossible to change the dtype to categorical within the pipeline before it goes on to the predictor. I was very excited by this new feature in sklearn 1.2 hoping it would deal with it, but not realising that 3rd party transformers also have to play ball and support the api. I therefore look forward to this being implemented soon! Narayan |
Yes, I agree with all comments. In fact I am playing with the TargetEncoder and I found that it causes problems when integrated in a Pipeline or in a ColumnTransformer of sklearn, because get_feature_names_out will be called with a parameter and you get this kind of error:
I will try to submit a PR if I find a simple solution that does not break anything. |
After a more careful review, I think this more work than expected. The reason is that, if you want to include a category encoder in a pipeline, unless the encoder is in an initial position, the input that it will receive is a numpy ndarray, not a pandas dataframe. And so, the column names won't exist (they will default to integers from 0 to n-1). And this is not a good behaviour. That's why you cannot rely on the names of the inputs you see during fitting, because they might not exist! I think that, unfortunately, each category encoder should define its own method for this, because depending on the EncodingRelation they will have different behaviour. So that's much more work than expected... Well, in the case of ONE_TO_ONE relationships it will be straightforward. I have made a PR with the partial fix, to see if the main contributors of the project like the approach and we can continue the work. |
Hi Jaime, thanks for your work. When exactly does the pipeline example fail? I think we have tests for I agree that it is a lot of work. Basically at the moment output feature names are calculated on the fly when fitting whereas in sklearn the logic to determine output feature names is in the
This is exactly the point, and also the way sklearn does it. |
Thanks a lot!! Yes, I will be willing to invest more time, but I'd like to share my ideas with you so that we agree on how to tackle this first...
I will share with you a google colab notebook to explain more carefully the issue.
|
Ok, here is the colab notebook explaining the issue and a path to a solution (I hope it makes sense!). https://colab.research.google.com/drive/1lJJvXyAOoG9MBtNiH2qSTrqbzZtEPTeB?usp=sharing As a side comment, I think that in general the fact that inputs and outputs of |
Hi Jaime, I think this is going into the right direction:
Here's a list of encoders and how output features will look like:
Does this make it clear? Does it make sense to you? Did I miss anything? |
Hello Paul! Thanks a lot. I agree it makes a lot of sense to follow the same path as in I will start by creating more tests for feature names, at least ffor the following cases (for all Encoders):
And then I will try to implement all needed changes following how I have some questions:
Maybe this was a workaround that we should try to fix? |
Hi, thanks for sharing the
|
Ok, good. Yes, I will try to follow as much as possible the
Anyway, the second question is a technical point, I think I can start and we can adjust later. It's more important for me to be clear if we want to change the API (regarding the |
Perfect! What do you think? |
Thank you. Yes, this could be a good idea... However, to be honest, I am feeling less and less comfortable and confident with this PR. I started early this morning preparing the new test suites, following the order of the encoders, and I panicked when I discovered how entangled this library is with having
However, to achieve similiar functionality with the
The big difference is that And it's not so easy to get around this problem because the The assumption of the input being a dataframe is everywhere! So I think there are two possible paths:
What do you think? |
I have tested the second solution and it works ok. All new tests pass without issues and only small modifications of the codebase were needed. If setting this configuration of Please, check the PR However, old tests need plenty of adjustments, because sometimes they use plain lists as inputs (and this generates a problem when trying to build the pandas dataframe) or because they try to test return_df=False (which, if we follow this path, won't have any effect). I am trying to repair them, but it's taking more time than expected... So I believe it's better if I leave it like that and wait until your confirmation as a maintainer... |
You're right, this library internally works with dataframes.
This is not very pretty but works. However I think if we can make the existing thing work, we should rather do this (and maybe add some docs on how to use with pipellines) rather than refactoring all of the code because we don't have (or at least I don't have) the resources for this. Nevertheless we can include the functionality to Thanks for pointing all of this out. Great work so far! I'm also not super happy with the way it is ^^ |
I see your points! I think I would prefer a huge refactoring, but I don't have time, knowledge and energy to do that. But you are right, if there is a solution that, in order to work 100% smoothly, involves only setting the output in This is what I have done in the last commit that you'll see in the PR. I think it's a good tradeoff, and the changes are truly minimal. I like it much more than my previous work with forcing the configuration of This is a bit off topic, but I am very, very fond of Going back to our work, please take a look at the PR and tell me if you are happy with it or if you want to modify something. Thanks! |
Please, check the last commit I uploaded just right now, 673c876, because I fixed some tests that were failing (main reason was that get_feature_names_out outputs a np.array, as in |
Great stuff you've been doing. I'll review your code probably on Friday or the weekend, since I'll be a little busy for the rest of the week. Your support (and thorough knowledge about pipelines) is greatly appreciated |
Thanks a lot for your kind words. It's been a pleasure. :) |
Did #398 leave anything from this issue to do? (If so, can you summarize it, maybe spinning off a new issue; and if not, close the issue?) |
Hi Ben, I've summarized the open point in #406 and will close this one |
Expected Behavior
The behaviour of
get_feature_names_out
should be the same as for sklearnActual Behavior an explanation in doc.
At the moment
feature_names_out_
attribute is set infit
and returned byget_feature_names_out
function. In sklearn theget_feature_names_out
function does not just return the values set infit
, but determines the output feature names from given input feature names if possible.Examples
n_components
I'd like to discuss this idea a little with the community. Please add some Pros and Cons of this approach. I'm sure I'm missing something
Pros
Cons
The text was updated successfully, but these errors were encountered: