-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added MASE metric and y_train
parameter to objectives
#4221
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4221 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 349 349
Lines 38281 38309 +28
=======================================
+ Hits 38162 38190 +28
Misses 119 119
|
d43834e
to
57f19c9
Compare
evalml/tests/dependency_update_check/latest_dependency_versions.txt
Outdated
Show resolved
Hide resolved
|
||
def objective_function(self, y_true, y_predicted, X=None, sample_weight=None): | ||
"""Objective function for mean absolute percentage error for time series regression.""" | ||
y_train = y_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MASE takes in training data as a parameter, since we don't use training data we just used the actual data in place of training data for now, but wanted to see what others think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This brings up a very interesting question. A quick test shows that the results do change based on what you pass in to the y_train
argument, meaning that the results we pass in with this are going to be less correct.
Looking through where we score pipelines, I don't think it would be too much extra lift to add the ability to pass y_train
through. We have access to it in the upper level time series pipeline scoring functions, as it's pretty common for time series related things in general to require the historical data.
I'm curious what others think though - should we update our objective functions to have access to y_train
, or should we keep it as is like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should update it to support MASE properly but lets work on it in a followup issue. @remyogasawara can you file the follow up in EvalML? Just basically stating the requirements that @eccabay wrote above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is a tricky one - I think there's a few details we need to discuss before it can be merged in! Curious for other team members' input on this as well.
evalml/tests/dependency_update_check/latest_dependency_versions.txt
Outdated
Show resolved
Hide resolved
|
||
def objective_function(self, y_true, y_predicted, X=None, sample_weight=None): | ||
"""Objective function for mean absolute percentage error for time series regression.""" | ||
y_train = y_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This brings up a very interesting question. A quick test shows that the results do change based on what you pass in to the y_train
argument, meaning that the results we pass in with this are going to be less correct.
Looking through where we score pipelines, I don't think it would be too much extra lift to add the ability to pass y_train
through. We have access to it in the upper level time series pipeline scoring functions, as it's pretty common for time series related things in general to require the historical data.
I'm curious what others think though - should we update our objective functions to have access to y_train
, or should we keep it as is like this?
if isinstance(y_train, pd.Series): | ||
y_train = y_train.to_numpy() | ||
mase = MeanAbsoluteScaledError() | ||
return mase(y_true, y_predicted, y_train=y_train) * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation, another parameter that we may want to consider leveraging is sp
. With our Decomposer.determine_periodicity
function (or, if the pipeline has a Decomposer, the decomposer's saved sp parameter), we should have the information necessary to pass that through. The question remains how, since we need information from X and y train, but it could be done. From brief testing, it looks like adding the sp argument gives us lower values (aka "better" scores), and they would once again be considered to be more accurate, since we use sp during training.
I'm curious what other people think about the need to include sp as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should support this! @remyogasawara can you file another issue for this so we can prioritize after decomp is added for multiseries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
def objective_function(self, y_true, y_predicted, X=None, sample_weight=None): | ||
"""Objective function for mean absolute percentage error for time series regression.""" | ||
y_train = y_true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should update it to support MASE properly but lets work on it in a followup issue. @remyogasawara can you file the follow up in EvalML? Just basically stating the requirements that @eccabay wrote above?
One other small thing I forgot to request, can you add this (as well as SMAPE, because I forgot before 😅) to the API reference? |
2491719
to
5569083
Compare
y_train
parameter to objectives
y_train
parameter to objectivesy_train
parameter to objectives
657d675
to
bd072d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM good work!
634f20c
to
a2ba1db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, great job being thorough with your changes! Just a few small nitpicks, but otherwise this looks ready to go
if (isinstance(y_train, pd.DataFrame) and (y_train.values == 0).all()) or ( | ||
isinstance(y_train, pd.Series) and (y_train == 0).all() | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to split it up this way - y_train.values
should work regardless of whether it's a dataframe or a series!
7865012
to
90fb578
Compare
844a517
to
c6f52e4
Compare
Resolves #4217