You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In your FAQ, youare giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at your Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
Actually if you as the developers of the project think like that, I could give it a chance with whole data. Don't you think there is a leakage risk with the approach you are using at these Taxi Trip Duration examples?
What do you think, I would love to hear about your intuition on FeatureTools.
The text was updated successfully, but these errors were encountered:
I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In your FAQ, youare giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at your Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
For example;
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if you as the developers of the project think like that, I could give it a chance with whole data. Don't you think there is a leakage risk with the approach you are using at these Taxi Trip Duration examples?
What do you think, I would love to hear about your intuition on FeatureTools.
The text was updated successfully, but these errors were encountered: