-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust the calculation for None intent and True Negatives to match the LUIS batch testing #254
Conversation
…les to improve table formatting and avoid truncating intent and entity names
…e functions private and adjusted tests accordingly
… a null intent is detected we should throw an error since we cannot do the comparison
…ny other intent, the full design of this is in microsoft#252 and UTs are passing with this checkin
… capture the None intent
Addressing issue microsoft#249 microsoft#249 Today's code does not support the compare command for Lex results since the format only supports "intentName" microsoft#252 will track the work to support Lex properly. Current design for LUIS and DF is as follows: null intents are not allowed. It probably means there is no "intent" entry in the JSON which makes the comparison void and we would throw an error when this happens. None intent and default_fallback_intents are treated like any other intent To calculate TP the actual and expected intents should match When the actual intent is different from the expected we will record a FP for the actual and a FN for the expected. For a given intent, if it doesn't match the actual or expected we then record a TN. This behavior is matching the calculations that LUIS batch testing does.
This is a good start, but honestly if we're calculating this, it looks like there is no longer a real reason to compute true negatives (maybe for no entities detected or expected), and any time we have a true positive, we also have a true negative. It seems like we might be able to do away with the confusion matrix results, and just have boolean results for everything... |
{ | ||
if (intent != actual && intent != expected) | ||
{ | ||
yield return TrueNegative( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, wondering if we really need to even bother tracking the true negatives, especially since we don't have the full model context and know all possible intents (see comment above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed True negative calculations for intents
@@ -71,12 +71,15 @@ string getUtteranceId(LabeledUtterance utterance, int index) | |||
return utterance.GetUtteranceId() ?? index.ToString(CultureInfo.InvariantCulture); | |||
} | |||
|
|||
var intents = actualUtterances.Select(utterance => utterance.Intent).Distinct().ToList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an accurate representation of all possible intents. The only source of truth for all possible intents is the model itself. I think we may want to take this library out of the business of determining true negatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I was assuming the test set would include all intents but this is not guarantueed and the predictioons are not guarantuees to contain all intents.
Agree, True Negatives are not giving much info in this context and are not using for any calculation. I would say even for entities, it doesn't add much value. What do you mean by having boolean results for everything? In reply to: 586041963 [](ancestors = 586041963) |
…n a null intent we now treat the null intent as any other intents and calculate TP,TN and FP for it and report it
} | ||
|
||
[Test] | ||
[TestCase("foo", "foo", 1, 0, 0, 0)] | ||
[TestCase(null, null, 0, 1, 0, 0)] | ||
[TestCase(null, null, 0, 0, 0, 0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should still be a TruePositive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If we consider the "None" intent as just any other intent, perhaps we should consider the "null" intent as any other intent)
} | ||
|
||
if (!isNoneIntent(expected)) | ||
else | ||
{ | ||
yield return FalseNegative( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point I was making about binary results was that we no longer need the concept of Positive or Negative, just true and false.
I think my main concern with this is that we could also solve it by just allowing the user to configure the "true negative" intent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See main conversation comment - I think we may first want to see if we can solve by making the "true negative" intent configurable.
Superceded by #274 |
Here is the suggested design for this change:
When no intents match an existing intent here is how each provider labels the intent:
1- LUIS would return a None intent
{
"text": "helloooo",
"intent": "None",
"entities": [],
"score": 0.9076262,
"textScore": 0.0,
"timestamp": "2020-01-28T11:20:42.8469612-05:00"
}
2. Aws Lex would return a **null ** intent
{
"dialogState": "ElicitIntent",
"intentName": null,
"message": "Sorry, can you please repeat that?",
...
}
3. Dialogflow would return a **default fallback ** intent
{
"text": "how are you today?",
"intent": "Default_Fallback_Intent",
"entities": []
},
Today's code supports the compare command for LUIS,DF and Lex results.
null intents, None intent and default_fallback_intents are treated like any other intent
To calculate TP the actual and expected intents should match
When the actual intent is different from the expected we will record a FP for the actual and a FN for the expected.
We no longer calculate True positives since we don't have a good idea of all possible intents for a given model and since they are not used in the calculation of precision,recall and F measure
This behavior is matching the calculations that LUIS batch testing does.
Fixes #249