Adjust the calculation for None intent and True Negatives to match the LUIS batch testing #254

saroup · 2020-02-13T13:31:34Z

Here is the suggested design for this change:
When no intents match an existing intent here is how each provider labels the intent:
1- LUIS would return a None intent
{
"text": "helloooo",
"intent": "None",
"entities": [],
"score": 0.9076262,
"textScore": 0.0,
"timestamp": "2020-01-28T11:20:42.8469612-05:00"
}
2. Aws Lex would return a **null ** intent
{
"dialogState": "ElicitIntent",
"intentName": null,
"message": "Sorry, can you please repeat that?",
...
}
3. Dialogflow would return a **default fallback ** intent
{
"text": "how are you today?",
"intent": "Default_Fallback_Intent",
"entities": []
},
Today's code supports the compare command for LUIS,DF and Lex results.
null intents, None intent and default_fallback_intents are treated like any other intent
To calculate TP the actual and expected intents should match
When the actual intent is different from the expected we will record a FP for the actual and a FN for the expected.
We no longer calculate True positives since we don't have a good idea of all possible intents for a given model and since they are not used in the calculation of precision,recall and F measure
This behavior is matching the calculations that LUIS batch testing does.

Fixes #249

…xception

…les to improve table formatting and avoid truncating intent and entity names

…e functions private and adjusted tests accordingly

… a null intent is detected we should throw an error since we cannot do the comparison

…ny other intent, the full design of this is in microsoft#252 and UTs are passing with this checkin

… capture the None intent

Addressing issue microsoft#249 microsoft#249 Today's code does not support the compare command for Lex results since the format only supports "intentName" microsoft#252 will track the work to support Lex properly. Current design for LUIS and DF is as follows: null intents are not allowed. It probably means there is no "intent" entry in the JSON which makes the comparison void and we would throw an error when this happens. None intent and default_fallback_intents are treated like any other intent To calculate TP the actual and expected intents should match When the actual intent is different from the expected we will record a FP for the actual and a FN for the expected. For a given intent, if it doesn't match the actual or expected we then record a TN. This behavior is matching the calculations that LUIS batch testing does.

src/NLU.DevOps.CommandLine/Compare/CompareCommand.cs

src/NLU.DevOps.ModelPerformance.Tests/TestCaseSourceTests.cs

rozele · 2020-02-14T00:40:44Z

This is a good start, but honestly if we're calculating this, it looks like there is no longer a real reason to compute true negatives (maybe for no entities detected or expected), and any time we have a true positive, we also have a true negative. It seems like we might be able to do away with the confusion matrix results, and just have boolean results for everything...

src/NLU.DevOps.ModelPerformance/TestCaseSource.cs

rozele · 2020-02-14T00:43:17Z

src/NLU.DevOps.ModelPerformance/TestCaseSource.cs

+            {
+                if (intent != actual && intent != expected)
+                {
+                    yield return TrueNegative(


Hmm, wondering if we really need to even bother tracking the true negatives, especially since we don't have the full model context and know all possible intents (see comment above).

removed True negative calculations for intents

rozele · 2020-02-14T00:44:07Z

src/NLU.DevOps.ModelPerformance/TestCaseSource.cs

@@ -71,12 +71,15 @@ string getUtteranceId(LabeledUtterance utterance, int index)
                return utterance.GetUtteranceId() ?? index.ToString(CultureInfo.InvariantCulture);
            }

+            var intents = actualUtterances.Select(utterance => utterance.Intent).Distinct().ToList();


This is not an accurate representation of all possible intents. The only source of truth for all possible intents is the model itself. I think we may want to take this library out of the business of determining true negatives.

good point. I was assuming the test set would include all intents but this is not guarantueed and the predictioons are not guarantuees to contain all intents.

saroup · 2020-02-14T01:31:17Z

Agree, True Negatives are not giving much info in this context and are not using for any calculation. I would say even for entities, it doesn't add much value. What do you mean by having boolean results for everything?

In reply to: 586041963 [](ancestors = 586041963)

…n a null intent we now treat the null intent as any other intents and calculate TP,TN and FP for it and report it

rozele · 2020-02-14T20:36:45Z

src/NLU.DevOps.ModelPerformance.Tests/TestCaseSourceTests.cs

        }

        [Test]
        [TestCase("foo", "foo", 1, 0, 0, 0)]
-        [TestCase(null, null, 0, 1, 0, 0)]
+        [TestCase(null, null, 0, 0, 0, 0)]


Seems like this should still be a TruePositive?

(If we consider the "None" intent as just any other intent, perhaps we should consider the "null" intent as any other intent)

rozele · 2020-02-14T20:38:08Z

src/NLU.DevOps.ModelPerformance/TestCaseSource.cs

            }
-
-            if (!isNoneIntent(expected))
+            else
            {
                yield return FalseNegative(


The point I was making about binary results was that we no longer need the concept of Positive or Negative, just true and false.

rozele · 2020-02-26T17:20:07Z

I think my main concern with this is that we could also solve it by just allowing the user to configure the "true negative" intent.

rozele

See main conversation comment - I think we may first want to see if we can solve by making the "true negative" intent configurable.

rozele · 2020-03-09T19:28:56Z

Superceded by #274

saroup and others added 30 commits August 23, 2019 18:01

Minor change in documentation to reflect current code

ad0ab2b

Merge remote-tracking branch 'upstream/master'

7240b52

Supporting 1 character entities where startPos is equal tp endPos

4efbf8e

making the change shorter by adding <=

30267f7

Adding unit tests for IsValid

8681525

added a separate test for single character entities

9783d77

Minor fix to test case to assert matched value.

8d196e3

Merge remote-tracking branch 'upstream/master'

5f9d3e2

Merge branch 'master' of github.com:saroup/NLU.DevOps into issue202

43bb831

Merge branch 'master' of https://github.com/microsoft/NLU.DevOps

9329951

Merge remote-tracking branch 'upstream/master'

c137789

When the body of the exception is null just show the message in the e…

38822c8

…xception

Adding Precision, Recall and F1 calculations to the compare command

db25a0b

Merge remote-tracking branch 'upstream/master'

d2f8645

renamed utilities to NLUAccuracy

677ca1c

modifified some function names and access level, also used consoleTab…

78e5f9b

…les to improve table formatting and avoid truncating intent and entity names

minor spacing change

291f840

Modifying tests first to remove any None related checks

8941658

fixed regex to match intent instead of entity

afd0fe4

adjusting expected results when None intent is actual or expected

983c041

Simplified the code further, renamed calcMetrics to calcAccuracy, mad…

be94fd0

…e functions private and adjusted tests accordingly

fixing typo

d4e4759

minor styling modifications

5b285e1

Merge branch 'master' into saroup/noneIntent

c2fffab

Remove tests that took into consideration null intents. Eventually if…

aa2cfa0

… a null intent is detected we should throw an error since we cannot do the comparison

Adjusting tests that had null intent to have empty intent

174244a

Changed the logic for calculating TNs, we now treat None intents as a…

e0bde89

…ny other intent, the full design of this is in microsoft#252 and UTs are passing with this checkin

Pulling out the intents from the actual utterances so we make sure to…

6f3625d

… capture the None intent

Adding a confusion table for false positive intents

4928359

saroup requested a review from rozele February 13, 2020 13:31

rozele reviewed Feb 14, 2020

View reviewed changes

src/NLU.DevOps.CommandLine/Compare/CompareCommand.cs Outdated Show resolved Hide resolved

rozele reviewed Feb 14, 2020

View reviewed changes

src/NLU.DevOps.ModelPerformance.Tests/TestCaseSourceTests.cs Show resolved Hide resolved

rozele reviewed Feb 14, 2020

View reviewed changes

src/NLU.DevOps.ModelPerformance.Tests/TestCaseSourceTests.cs Outdated Show resolved Hide resolved

rozele reviewed Feb 14, 2020

View reviewed changes

src/NLU.DevOps.ModelPerformance.Tests/TestCaseSourceTests.cs Outdated Show resolved Hide resolved

rozele reviewed Feb 14, 2020

View reviewed changes

src/NLU.DevOps.ModelPerformance/TestCaseSource.cs Outdated Show resolved Hide resolved

rozele reviewed Feb 14, 2020

View reviewed changes

Merge https://github.com/microsoft/NLU.DevOps

8672e4e

saroup added 3 commits February 13, 2020 21:02

adjusting some nits and adding tests for null expected and actual intent

e929ae6

removed the calculations for true positives, and since Lex can retur…

70d2970

…n a null intent we now treat the null intent as any other intents and calculate TP,TN and FP for it and report it

removing getting all intents

07b0067

rozele mentioned this pull request Feb 14, 2020

Add design doc for how confusion matrix results are calculated for NLU.DevOps #243

Closed

removed TN calculation from entity comparison

9ee1881

rozele reviewed Feb 14, 2020

View reviewed changes

rozele suggested changes Feb 26, 2020

View reviewed changes

rozele closed this Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust the calculation for None intent and True Negatives to match the LUIS batch testing #254

Adjust the calculation for None intent and True Negatives to match the LUIS batch testing #254

saroup commented Feb 13, 2020 •

edited

Loading

rozele commented Feb 14, 2020

rozele Feb 14, 2020 •

edited

Loading

saroup Feb 14, 2020 •

edited

Loading

rozele Feb 14, 2020

saroup Feb 14, 2020

saroup commented Feb 14, 2020

rozele Feb 14, 2020

rozele Feb 14, 2020

rozele Feb 14, 2020

rozele commented Feb 26, 2020

rozele left a comment

rozele commented Mar 9, 2020

Adjust the calculation for None intent and True Negatives to match the LUIS batch testing #254

Adjust the calculation for None intent and True Negatives to match the LUIS batch testing #254

Conversation

saroup commented Feb 13, 2020 • edited Loading

rozele commented Feb 14, 2020

rozele Feb 14, 2020 • edited Loading

Choose a reason for hiding this comment

saroup Feb 14, 2020 • edited Loading

Choose a reason for hiding this comment

rozele Feb 14, 2020

Choose a reason for hiding this comment

saroup Feb 14, 2020

Choose a reason for hiding this comment

saroup commented Feb 14, 2020

rozele Feb 14, 2020

Choose a reason for hiding this comment

rozele Feb 14, 2020

Choose a reason for hiding this comment

rozele Feb 14, 2020

Choose a reason for hiding this comment

rozele commented Feb 26, 2020

rozele left a comment

Choose a reason for hiding this comment

rozele commented Mar 9, 2020

saroup commented Feb 13, 2020 •

edited

Loading

rozele Feb 14, 2020 •

edited

Loading

saroup Feb 14, 2020 •

edited

Loading