You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
we are in the early stages of evaluating feathr and try to determine its flexibility and usability compared to our current inhouse solution. One important point is which types of features it can easily deal with.
One type we often use is exemplified by "number of days since last transaction". Meaning we have to look back in time from the instance time stamp and find the most recent transaction to then compute the difference between the instance timestamp and the transaction timestamp. By "instance" I mean the training example in the offline case and the current request in the online case.
In our current solution we cover this by making the instance's timestamp available to the aggregation function that is applied to the backward window. For feathr this wouldn't do much good of course, as the logic that can be used by windowed aggregation is very restricted (maybe deliberately forcing the user to break up complex feature logic into simpler and reusable steps).
Anyway, based on the feathr docs I came up with the following way how it could be achieved in the offline case: First, use WindowAggTransformation to create a feature which is the latest transaction's timestamp. Second, create another feature that is simply the observation timestamp. Finally, create a derived feature that is the difference of the two.
However, I'm not sure how this logic could elegantly translate to the online case. Because the "instance timestamp" that was available in the raw data is not going to be available in the online case. At least not naturally. I suppose we could create an input column in the live straming data that is simply the request timestamp. But it seems a little too complicated. Or we could use the current time, which is I think available via the supported SparkSQL expressions. But this would mean the online case would be based on partly different sources than the offline case. Certainly doable but appears to me as going against the spirit of streamlining the offline to online switch in terms of feature management.
I'd be interested in learning whether it's a requirement you also encounter and how you solve it. My main question being whether there's a more direct way in feathr that eludes me.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
we are in the early stages of evaluating feathr and try to determine its flexibility and usability compared to our current inhouse solution. One important point is which types of features it can easily deal with.
One type we often use is exemplified by "number of days since last transaction". Meaning we have to look back in time from the instance time stamp and find the most recent transaction to then compute the difference between the instance timestamp and the transaction timestamp. By "instance" I mean the training example in the offline case and the current request in the online case.
In our current solution we cover this by making the instance's timestamp available to the aggregation function that is applied to the backward window. For feathr this wouldn't do much good of course, as the logic that can be used by windowed aggregation is very restricted (maybe deliberately forcing the user to break up complex feature logic into simpler and reusable steps).
Anyway, based on the feathr docs I came up with the following way how it could be achieved in the offline case: First, use
WindowAggTransformation
to create a feature which is the latest transaction's timestamp. Second, create another feature that is simply the observation timestamp. Finally, create a derived feature that is the difference of the two.However, I'm not sure how this logic could elegantly translate to the online case. Because the "instance timestamp" that was available in the raw data is not going to be available in the online case. At least not naturally. I suppose we could create an input column in the live straming data that is simply the request timestamp. But it seems a little too complicated. Or we could use the current time, which is I think available via the supported SparkSQL expressions. But this would mean the online case would be based on partly different sources than the offline case. Certainly doable but appears to me as going against the spirit of streamlining the offline to online switch in terms of feature management.
I'd be interested in learning whether it's a requirement you also encounter and how you solve it. My main question being whether there's a more direct way in feathr that eludes me.
Thanks and best regards
Jonas
Beta Was this translation helpful? Give feedback.
All reactions