Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement Featuretools into my ML Process without data leakage? #14

Open
kilincali35 opened this issue Feb 21, 2023 · 0 comments

Comments

@kilincali35
Copy link

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.

Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In your FAQ, youare giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.

Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.

Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.

It is the safest option, with time and complexity disadvantages.

Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at your Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.

For example;
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb

Actually if you as the developers of the project think like that, I could give it a chance with whole data. Don't you think there is a leakage risk with the approach you are using at these Taxi Trip Duration examples?

What do you think, I would love to hear about your intuition on FeatureTools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant