A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue
This dataset presents a challenging evaluation environment for dialogue NLU models. It is divided in 2 domains, banking and hotels, and provides high quality examples combining a large amount of multi-label intents and slots. More details on the dataset can be found in our publication.
Domain | Number of examples | Number of intents | Number of slots |
---|---|---|---|
BANKING | 2,071 | 48 | 13 |
HOTELS | 1,009 | 40 | 14 |
ALL | 3,080 | 62 | 17 |
Example Query | Intents | Slots (values) |
---|---|---|
I want to change my restaurant reservation | change, booking, room | --- |
I am trying to make a transfer but it doesn’t let me | make, transfer_payment, not_working | --- |
Why can’t I amend my booking on tuesday? | why, change, booking, not_working | date (tuesday) |
How much less did I spend on Amazon during the current year? | how_much, less, transfer_payment | date_period (current year), company_name (Amazon) |
Can I make a reservation from the 1st of June to the 7th? | make, booking | date_from (1st of June), date_to (7th) |
The data is divided in 2 domains, banking and hotels, each of it in their corresponding directories.
The data for each of these domains is divided in 20 folds, each of them in a json file named fold0.json
,
fold1.json
, etc. The structure of each example is the following:
{
"text": "How much did I spend in total until May on amazon prime?",
"intents": [
"how_much",
"transfer_payment_deposit"
],
"slots": {
"date_to": {
"text": "May",
"span": [
36,
39
],
"value": {
"day": 31,
"month": 5,
"year": 2022
}
},
"company_name": {
"text": "amazon prime",
"span": [
43,
55
],
"value": "amazon prime"
}
}
}
Where the intents (or intent modules) are defined as a list in the field
"intents"
and the slot-values in the field "slots"
. If any of the fields is missing,
then it means that the example has no intents or slot-values. For each slot present, the
span is defined in the field "span"
and the canonical value is defined in the field "value"
.
For relative dates and times, the reference date is set to 2022/3/15
and the
reference time is set to 09:00 a.m.
, e.g.:
{
"text": "today",
"slots": {
"date": {
"text": "today",
"span": [
0,
5
],
"value": {
"day": 15,
"month": 3,
"year": 2022
}
}
}
}
{
"text": "any table free in 2 hours?",
"intents": [
"request_info",
"restaurant",
"booking"
],
"slots": {
"time": {
"text": "in 2 hours",
"span": [
15,
25
],
"value": {
"hour": 11,
"minute": 0
}
}
}
}
The values of relative weekdays will be considered to be in the future, e.g.:
{
"text": "I'm leaving on Wednesday",
"slots": {
"date_to": {
"text": "Wednesday",
"span": [
15,
24
],
"value": {
"day": 16,
"month": 3,
"year": 2022
}
}
}
}
For the experiments presented in the paper we adopt 3 data setups:
- 20-fold: (or "low") We use 1 fold for training and the other 19 folds for testing, doing this 20 times with a different training fold. Then we report the mean results of each fold.
- 10-fold: (or "mid") We use 2 folds for training and the other 18 folds for testing, doing this 10 times with 10 pairs of training folds. We use consecutive indices for the training pairs (i.e.
fold0.json
andfold1.json
,fold2.json
andfold3.json
, etc.). Then we report the mean results for each pair of folds. - Large: We use 18 folds for training and the other two for testing, doing this 10 times with 10 pairs of testing folds. This can be seen as "inverse" 10-fold.
These setups are designed to replicate the data setups found in production while not overfitting to a small test set.
We also use 3 domain setups:
- BANKING: We train and test on banking data only.
- HOTELS: We train and test on hotels data only.
- ALL: We train and test on both domains. The folds for each domain are joined for the different data setups (i.e. banking/fold0.json will be joined with hotels/fold0.json and so on)
We provide the code to easily load all these setups in data_loader.py
, e.g.
from nlupp.data_loader import DataLoader
loader = DataLoader("<PATH_TO_NLUPP_DATA>")
banking_low = loader.get_data_for_experiment(domain="banking", regime="low")
in this example, the method will return a dictionary with the data already structured for the 20-fold BANKING experiments, i.e.
banking_low = {
0: {"train": train examples_for_fold_0,
"test": test_examples_for_fold_0},
1: {"train": train examples_for_fold_1,
"test": test_examples_for_fold_1},
...,
19: {"train": train examples_for_fold_19,
"test": test_examples_for_fold_19}
}
The method can be called with domain = "banking", "hotels" or "all"
and
regime = "low", "mid" or "large"
To replicate the experiments, simply train 1 model per fold with the returned train-test splits and report the average results.
We also perform additional cross-domain experiments, in where we train the
models on all the data from one of the domains, and we test it in the generic
intents of the other domain. To get the cross domain data, pass either
"banking-hotels"
or "hotels-banking"
as domain
argument (with no regime
arg), e.g.:
banking_hotels = loader.get_data_for_experiment(domain="banking-hotels")
Which will return
banking_hotels = {
0: {"train": all_banking_examples,
"test": all_hotels_examples},
}
Note that in this case a single fold is returned. Also, all the non generic intents and slots are removed from the annotations.
When using the NLU++ dataset in your work, please cite NLU++: A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue.
@inproceedings{Casanueva2022,
author = {I{\~{n}}igo Casanueva and Ivan Vuli\'{c} and Georgios Spithourakis and Pawe\l~Budzianowski},
title = {NLU++: A Multi-Label, Slot-Rich, Generalisable Dataset for Natural Language Understanding in Task-Oriented Dialogue},
year = {2022},
month = {apr},
note = {Data available at https://www.youtube.com/watch?v=dQw4w9WgXcQ},
url = {https://www.youtube.com/watch?v=dQw4w9WgXcQ},
booktitle = {TODO}
}
The datasets shared on this repository are licensed under the license found in the LICENSE file.