Name	Name	Last commit message	Last commit date
parent directory ..
data	data
README.md	README.md
data_loader.py	data_loader.py
polyai-logo.png	polyai-logo.png

NLU++

A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue

This dataset presents a challenging evaluation environment for dialogue NLU models. It is divided in 2 domains, banking and hotels, and provides high quality examples combining a large amount of multi-label intents and slots. More details on the dataset can be found in our publication.

Domain	Number of examples	Number of intents	Number of slots
BANKING	2,071	48	13
HOTELS	1,009	40	14
ALL	3,080	62	17

Example Query	Intents	Slots (values)
I want to change my restaurant reservation	change, booking, room	---
I am trying to make a transfer but it doesn’t let me	make, transfer_payment, not_working	---
Why can’t I amend my booking on tuesday?	why, change, booking, not_working	date (tuesday)
How much less did I spend on Amazon during the current year?	how_much, less, transfer_payment	date_period (current year), company_name (Amazon)
Can I make a reservation from the 1st of June to the 7th?	make, booking	date_from (1st of June), date_to (7th)

Data structure

The data is divided in 2 domains, banking and hotels, each of it in their corresponding directories. The data for each of these domains is divided in 20 folds, each of them in a json file named fold0.json, fold1.json, etc. The structure of each example is the following:

{
    "text": "How much did I spend in total until May on amazon prime?",
    "intents": [
      "how_much",
      "transfer_payment_deposit"
    ],
    "slots": {
      "date_to": {
        "text": "May",
        "span": [
          36,
          39
        ],
        "value": {
          "day": 31,
          "month": 5,
          "year": 2022
        }
      },
      "company_name": {
        "text": "amazon prime",
        "span": [
          43,
          55
        ],
        "value": "amazon prime"
      }
    }
  }

Where the intents (or intent modules) are defined as a list in the field "intents" and the slot-values in the field "slots". If any of the fields is missing, then it means that the example has no intents or slot-values. For each slot present, the span is defined in the field "span" and the canonical value is defined in the field "value". For relative dates and times, the reference date is set to 2022/3/15 and the reference time is set to 09:00 a.m., e.g.:

{
    "text": "today",
    "slots": {
      "date": {
        "text": "today",
        "span": [
          0,
          5
        ],
        "value": {
          "day": 15,
          "month": 3,
          "year": 2022
        }
      }
    }
  }

{
    "text": "any table free in 2 hours?",
    "intents": [
      "request_info",
      "restaurant",
      "booking"
    ],
    "slots": {
      "time": {
        "text": "in 2 hours",
        "span": [
          15,
          25
        ],
        "value": {
          "hour": 11,
          "minute": 0
        }
      }
    }
  }

The values of relative weekdays will be considered to be in the future, e.g.:

{
    "text": "I'm leaving on Wednesday",
    "slots": {
      "date_to": {
        "text": "Wednesday",
        "span": [
          15,
          24
        ],
        "value": {
          "day": 16,
          "month": 3,
          "year": 2022
        }
      }
    }
  }

Experimental setup

For the experiments presented in the paper we adopt 3 data setups:

20-fold: (or "low") We use 1 fold for training and the other 19 folds for testing, doing this 20 times with a different training fold. Then we report the mean results of each fold.
10-fold: (or "mid") We use 2 folds for training and the other 18 folds for testing, doing this 10 times with 10 pairs of training folds. We use consecutive indices for the training pairs (i.e. fold0.json and fold1.json, fold2.json and fold3.json, etc.). Then we report the mean results for each pair of folds.
Large: We use 18 folds for training and the other two for testing, doing this 10 times with 10 pairs of testing folds. This can be seen as "inverse" 10-fold.

These setups are designed to replicate the data setups found in production while not overfitting to a small test set.

We also use 3 domain setups:

BANKING: We train and test on banking data only.
HOTELS: We train and test on hotels data only.
ALL: We train and test on both domains. The folds for each domain are joined for the different data setups (i.e. banking/fold0.json will be joined with hotels/fold0.json and so on)

We provide the code to easily load all these setups in data_loader.py, e.g.

from nlupp.data_loader import DataLoader

loader = DataLoader("<PATH_TO_NLUPP_DATA>")
banking_low = loader.get_data_for_experiment(domain="banking", regime="low")

in this example, the method will return a dictionary with the data already structured for the 20-fold BANKING experiments, i.e.

banking_low = {
  0: {"train": train examples_for_fold_0,
      "test": test_examples_for_fold_0},
  1: {"train": train examples_for_fold_1,
      "test": test_examples_for_fold_1},
  ...,
  19: {"train": train examples_for_fold_19,
      "test": test_examples_for_fold_19}
  }

The method can be called with domain = "banking", "hotels" or "all" and regime = "low", "mid" or "large"

To replicate the experiments, simply train 1 model per fold with the returned train-test splits and report the average results.

We also perform additional cross-domain experiments, in where we train the models on all the data from one of the domains, and we test it in the generic intents of the other domain. To get the cross domain data, pass either "banking-hotels" or "hotels-banking" as domain argument (with no regime arg), e.g.:

banking_hotels = loader.get_data_for_experiment(domain="banking-hotels")

Which will return

banking_hotels = {
  0: {"train": all_banking_examples,
      "test": all_hotels_examples},
  }

Note that in this case a single fold is returned. Also, all the non generic intents and slots are removed from the annotations.

Citations

When using the NLU++ dataset in your work, please cite NLU++: A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue.

@inproceedings{Casanueva2022,
    author      = {I{\~{n}}igo Casanueva and Ivan Vuli\'{c} and Georgios Spithourakis and Pawe\l~Budzianowski},
    title       = {NLU++: A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue},
    year        = {2022},
    month       = {apr},
    note        = {Data available at https://www.youtube.com/watch?v=dQw4w9WgXcQ},
    url         = {https://www.youtube.com/watch?v=dQw4w9WgXcQ},
    booktitle   = {TODO}
}

License

The datasets shared on this repository are licensed under the license found in the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlupp

nlupp

README.md

NLU++

Data structure

Experimental setup

Citations

License

Files

nlupp

Directory actions

More options

Directory actions

More options

Latest commit

History

nlupp

Folders and files

parent directory

README.md

NLU++

Data structure

Experimental setup

Citations

License