Dealing with highly imbalanced data in Autosklearn #1164

ShirinNajdi · 2021-06-25T14:18:31Z

Hi,
In view of the issue #113, I would like to know if there is an update regarding including SMOTE in Autosklearn package?

mfeurer · 2021-06-25T14:28:47Z

Hey @ShirinNajdi thanks a lot for your interest. Unfortunately, the underlying issues in scikit-learn are still there: scikit-learn/scikit-learn#3855 and scikit-learn/scikit-learn#9630

We'll re-evaluate whether we can use the imbalanced learn extension to provide SMOTE in Auto-sklearn.

ShirinNajdi · 2021-06-25T16:00:09Z

scikit-learn/scikit-learn#9630

Thank you for the quick reply.
So according to scikit-learn/scikit-learn#3855 (if I have understood it well) the problem is that in our case SMOTE should not be applied on test set. But how about applying SMOTE on training and leaving test data untouched. We can feed test and train data separately to fit() in Autosklearn. Does it make sense?

Similar situation can be tackled with imblearn.pipeline when using Bayesian hyper-parameter optimisation with SMOTE. The idea was transforming training data with SMOTE and keeping test data untouched.

I would be happy to know your idea.

mfeurer · 2021-06-30T08:19:04Z

The fact that transform needs to behave different at fit and at predict time is one issue. The other issue is that scikit-learn does not support changing the targets, so one cannot add new training samples in a pipeline. However, one would like to use SMOTE in the middle of a pipeline, after scaling and one hot encoding, but before using the classifier.

One could indeed use imblearn.pipeline, but we did not have the time to look into whether we can make use of that library.

simonprovost · 2021-07-13T17:48:46Z

@mfeurer Would you be open to a pull request to implement the SMOTE method into Auto-Sklearn?

Cheers.

mfeurer · 2021-08-24T16:02:11Z

Do you mean by integrating imblearn?

mfeurer · 2021-11-17T10:27:02Z

Closing this for now as it is an issue in scikit-learn. We can reassess this once scikit-learn allows changing the number of data points in a pipeline.

simonprovost · 2024-05-30T14:57:07Z

@mfeurer Sorry for the very late response. I believe that when I made my comment, I did not thoroughly review each comment and G. issues shared. It is understandable why this could not have been done at the time. Meanwhile, @ShirinNajdi Check out https://github.com/prabhant/gama/tree/imblearn. Otherwise, check out AMLTK and create your own search space: https://github.com/automl/amltk. can also be done using GAMA. However, a PR should be available by the end of the summer to facilitate all of this, so I believe AMLTK is the better way to go now but for you to expend you search, look into: https://github.com/openml-labs/gama

mfeurer added the question label Jun 25, 2021

mfeurer closed this as completed Nov 17, 2021

mfeurer mentioned this issue Jun 8, 2022

Transformer should accept y argument in the transform method #1494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with highly imbalanced data in Autosklearn #1164

Dealing with highly imbalanced data in Autosklearn #1164

ShirinNajdi commented Jun 25, 2021

mfeurer commented Jun 25, 2021

ShirinNajdi commented Jun 25, 2021

mfeurer commented Jun 30, 2021

simonprovost commented Jul 13, 2021

mfeurer commented Aug 24, 2021

mfeurer commented Nov 17, 2021

simonprovost commented May 30, 2024 •

edited

Loading

Dealing with highly imbalanced data in Autosklearn #1164

Dealing with highly imbalanced data in Autosklearn #1164

Comments

ShirinNajdi commented Jun 25, 2021

mfeurer commented Jun 25, 2021

ShirinNajdi commented Jun 25, 2021

mfeurer commented Jun 30, 2021

simonprovost commented Jul 13, 2021

mfeurer commented Aug 24, 2021

mfeurer commented Nov 17, 2021

simonprovost commented May 30, 2024 • edited Loading

simonprovost commented May 30, 2024 •

edited

Loading