Each dataset is stored in multiple JSON files. For example, the ag dataset is stored in train.json
and test.json
.
The JSON file contains the following fields:
label_mapping
: a list of strings. Thelabel_mapping
maps an integer label to the actual meaning of that label. This list is not used in the algorithm.cased
: a bool value indicates if it is a cased dataset or uncased dataset. Sentences in uncased datasets are all in lowercase.paraphrase_field
: choose fromtext0
andtext1
.Paraphrase_field
indicates which sentence in each data record should be paraphrased.data
: a list of data records. Each data records contains:label
: an integer indicating the classification label of the text.text0
:- For topic and sentiment classification datasets, text0 stores the text to be classified.
- For natural language inference datasets, text0 stores the premise.
text1
:- For topic and sentiment classification datasets, this field is omitted.
- For natural language inference datasets, text1 stores the hypothesis.
A topic / sentiment classification example:
{
"label_mapping": [
"World",
"Sports",
"Business",
"Sci/Tech"
],
"cased": true,
"paraphrase_field": "text0",
"data": [
{
"label": 1,
"text0": "Boston won the NBA championship in 2008."
},
{
"label": 3,
"text0": "Apple releases its latest cell phone."
},
...
]
}
A natural langauge inference example:
{
"label_mapping": [
"neutral",
"entailment",
"contradiction"
],
"cased": true,
"paraphrase_field": "text1",
"data": [
{
"label": 0,
"text0": "A person on a horse jumps over a broken down airplane.",
"text1": "A person is training his horse for a competition."
},
{
"label": 2,
"text0": "A person on a horse jumps over a broken down airplane.",
"text1": "A person is at a diner, ordering an omelette."
},
...
]
}
We have scripts to help you easily download all datasets. We provide two options to download datasets:
Download data preprocessed by us.
We preprocessed datasets and uploaded them to AWS. You can use the following command to download all datasets.
python3 -m fibber.datasets.download_datasets
After executing the command, the dataset is stored at ~/.fibber/datasets/<dataset_name>/*.json
. For example, the ag dataset is stored in ~/.fibber/datasets/ag/
. And there will be two sets train.json
and test.json
in the folder.
Download and process data from the original source.
You can also download the original dataset version and process it locally.
python3 -m fibber.datasets.download_datasets --process_raw 1
This script will download data from the original source to ~/.fibber/datasets/<dataset_name>/raw/
folder. And process the raw data to generate the JSON files.
During the benchmark process, we save results in several files.
The intermediate result <output_dir>/<dataset>-<strategy>-<date>-<time>-tmp.json
stores the paraphrased sentences. Strategies can run for a few minutes (hours) on some datasets, so we save the result every 30 seconds. The file format is similar to the dataset file. For each data record, we add a new field, text0_paraphrases
or text1_paraphrases
depending the paraphrase_field
.
An example is as follows.
{
"label_mapping": [
"World",
"Sports",
"Business",
"Sci/Tech"
],
"cased": true,
"paraphrase_field": "text0",
"data": [
{
"label": 1,
"text0": "Boston won the NBA championship in 2008.",
"text0_paraphrases": ["The 2008 NBA championship is won by Boston.", ...]
},
...
]
}
The result <output_dir>/<dataset>-<strategy>-<date>-<time>-with-metrics.json
stores the paraphrased sentences as well as metrics. Compute metrics may need a few minutes on some datasets, so we save the result every 30 seconds. The file format is similar to the intermediate file. For each data record, we add two new field, original_text_metrics
and paraphrase_metrics
.
An example is as follows.
{
"label_mapping": [
"World",
"Sports",
"Business",
"Sci/Tech"
],
"cased": true,
"paraphrase_field": "text0",
"data": [
{
"label": 1,
"text0": "Boston won the NBA championship in 2008.",
"text0_paraphrases": [..., ...],
"original_text_metrics": {
"EditDistanceMetric": 0,
"USESemanticSimilarityMetric": 1.0,
"GloVeSemanticSimilarityMetric": 1.0,
"GPT2GrammarQualityMetric": 1.0,
"BertClassifier": 1
},
"paraphrase_metrics": [
{
"EditDistanceMetric": 7,
"USESemanticSimilarityMetric": 0.91,
"GloVeSemanticSimilarityMetric": 0.94,
"GPT2GrammarQualityMetric": 2.3,
"BertClassifier": 1
},
...
]
},
...
]
}
The original_text_metrics
stores a dict of several metrics. It compares the original text against itself. The paraphrase_metrics
is a list of the same length as paraphrases in this data record. Each element in this list is a dict showing the comparison between the original text and one paraphrased text.