Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with highly imbalanced data in Autosklearn #1164

Closed
ShirinNajdi opened this issue Jun 25, 2021 · 7 comments
Closed

Dealing with highly imbalanced data in Autosklearn #1164

ShirinNajdi opened this issue Jun 25, 2021 · 7 comments
Labels

Comments

@ShirinNajdi
Copy link

Hi,
In view of the issue #113, I would like to know if there is an update regarding including SMOTE in Autosklearn package?

@mfeurer
Copy link
Contributor

mfeurer commented Jun 25, 2021

Hey @ShirinNajdi thanks a lot for your interest. Unfortunately, the underlying issues in scikit-learn are still there: scikit-learn/scikit-learn#3855 and scikit-learn/scikit-learn#9630

We'll re-evaluate whether we can use the imbalanced learn extension to provide SMOTE in Auto-sklearn.

@ShirinNajdi
Copy link
Author

scikit-learn/scikit-learn#9630

Thank you for the quick reply.
So according to scikit-learn/scikit-learn#3855 (if I have understood it well) the problem is that in our case SMOTE should not be applied on test set. But how about applying SMOTE on training and leaving test data untouched. We can feed test and train data separately to fit() in Autosklearn. Does it make sense?

Similar situation can be tackled with imblearn.pipeline when using Bayesian hyper-parameter optimisation with SMOTE. The idea was transforming training data with SMOTE and keeping test data untouched.

I would be happy to know your idea.

@mfeurer
Copy link
Contributor

mfeurer commented Jun 30, 2021

The fact that transform needs to behave different at fit and at predict time is one issue. The other issue is that scikit-learn does not support changing the targets, so one cannot add new training samples in a pipeline. However, one would like to use SMOTE in the middle of a pipeline, after scaling and one hot encoding, but before using the classifier.

One could indeed use imblearn.pipeline, but we did not have the time to look into whether we can make use of that library.

@simonprovost
Copy link

@mfeurer Would you be open to a pull request to implement the SMOTE method into Auto-Sklearn?

Cheers.

@mfeurer
Copy link
Contributor

mfeurer commented Aug 24, 2021

Do you mean by integrating imblearn?

@mfeurer
Copy link
Contributor

mfeurer commented Nov 17, 2021

Closing this for now as it is an issue in scikit-learn. We can reassess this once scikit-learn allows changing the number of data points in a pipeline.

@simonprovost
Copy link

simonprovost commented May 30, 2024

@mfeurer Sorry for the very late response. I believe that when I made my comment, I did not thoroughly review each comment and G. issues shared. It is understandable why this could not have been done at the time. Meanwhile, @ShirinNajdi Check out https://github.com/prabhant/gama/tree/imblearn. Otherwise, check out AMLTK and create your own search space: https://github.com/automl/amltk. can also be done using GAMA. However, a PR should be available by the end of the summer to facilitate all of this, so I believe AMLTK is the better way to go now but for you to expend you search, look into: https://github.com/openml-labs/gama

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants