How to implement Featuretools into my ML Process without data leakage? #14

kilincali35 · 2023-02-21T07:34:24Z

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.

Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In your FAQ, youare giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.

Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.

Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.

It is the safest option, with time and complexity disadvantages.

Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at your Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.

For example;
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb

Actually if you as the developers of the project think like that, I could give it a chance with whole data. Don't you think there is a leakage risk with the approach you are using at these Taxi Trip Duration examples?

What do you think, I would love to hear about your intuition on FeatureTools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement Featuretools into my ML Process without data leakage? #14

How to implement Featuretools into my ML Process without data leakage? #14

kilincali35 commented Feb 21, 2023

How to implement Featuretools into my ML Process without data leakage? #14

How to implement Featuretools into my ML Process without data leakage? #14

Comments

kilincali35 commented Feb 21, 2023