Skip to content

Latest commit

 

History

History
151 lines (110 loc) · 8.69 KB

TDD-C-202203-CC-001.md

File metadata and controls

151 lines (110 loc) · 8.69 KB

MNLI-TR

This dataset is created by translating MultiNLI using Amazon Translate. The MNLI dataset is created using English sentences and it is used for natural language inference tasks. The official website for the dataset can be found at NLI-TR.

Dataset Details

This dataset consists of 433K machine-translated Turkish sentences pairs. The MultiNLI dataset is created by selecting a premise sentence from a preexisting text source and asking a human annotator to compose a novel sentence to pair with it as a hypothesis. MNLI dataset consist of 10 different genres. Only 5 of 10 genres appear in the training set but all of 10 genres appear in the development and test set. This intension aims to assess the performance of the models on unseen domains. MNLI-TR dataset is created by translating MultiNLI dataset using Amazon Translate.

Each pair is annotated with an annotator label, genre, gold label, pair id and prompt id. Annotator labels stand for the collection of annotations decided by different annotators. Genre describes the genre of the domain which the premise and hypothesis coming from. The gold label is the label chosen by the majority of annotators. Pair id is a unique id assigned to each pair and prompt id is an identifier for the prompts shown to the annotators during the annotation process.

Samples

Examples from the original MultiNLI dataset and MNLI-TR dataset are given below.

An example from MultiNLI:

{
        "annotator_labels": [
            "neutral"
        ],
        "genre": "telephone",
        "gold_label": "neutral",
        "pairID": "48889n",
        "promptID": "48889",
        "sentence1": "and uh then you'd be willing to give up your job to stay home and with or stay with the children",
        "sentence1_binary_parse": "( and ( uh ( then ( you ( 'd ( be ( willing ( to ( ( give up ) ( your ( job ( to ( ( ( stay ( ( home and ) with ) ) or ) ( stay ( with ( the children ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )",
        "sentence1_parse": "(ROOT (FRAG (CC and) (NP (NP (NNP uh)) (SBAR (S (ADVP (RB then)) (NP (PRP you)) (VP (MD 'd) (VP (VB be) (ADJP (JJ willing) (S (VP (TO to) (VP (VB give) (PRT (RP up)) (NP (PRP$ your) (NN job) (S (VP (TO to) (VP (VP (VB stay) (UCP (ADVP (RB home)) (CC and) (PP (IN with)))) (CC or) (VP (VB stay) (PP (IN with) (NP (DT the) (NNS children)))))))))))))))))))",
        "sentence2": "Is your dream to stay at home?",
        "sentence2_binary_parse": "( ( ( Is your ) ( dream ( to ( stay ( at home ) ) ) ) ) ? )",
        "sentence2_parse": "(ROOT (SQ (VBZ Is) (NP (PRP$ your)) (NP (NP (NN dream)) (S (VP (TO to) (VP (VB stay) (PP (IN at) (NP (NN home))))))) (. ?)))"
}(NP (DT the) (NN ceiling))))) (. .)))"
}

The corresponding Turkish translation in MNLI-TR:

{
        "annotator_labels": [
            "neutral"
        ],
        "genre": "telephone",
        "gold_label": "neutral",
        "pairID": "48889n",
        "promptID": "48889",
        "sentence1": "Ve o zaman evde kalmak ya da çocuklarla kalmak için işinden vazgeçersin.",
        "sentence2": "Hayaliniz evde kalmak mı?"
}

Fields

field dtype
annotator_labels string list
genre string
gold_label string
pairID string
promptID string
sentence1 string
sentence2 string

Splits

Since the test split is on MultiNLI are not publicly available, MNLI only contains the training and dev splits. Matched development set contains genres that are present in the training set and mismatched development set contains genres that are not present in the training set. First five genre presents the matched and last five genre presents the mismatched parts in the development set.

Genre Train Dev
FICTION 77,348 2,000
GOVERNMENT 77,350 2,000
SLATE 77,306 2,000
TELEPHONE 83,348 2,000
TRAVEL 77,350 2,000
9/11 0 2,000
FACE-TO-FACE 0 2,000
LETTERS 0 2,000
OUP 0 2,000
VERBATIM 0 2,000
TOTAL 392,702 20,000

Dataset Creation

Curation Rationale

Large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress in other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. Since the machine translation systems are robust, they can be used for translating English NLI datasets to Turkish.

Data Source

The premise sentences are collected from 10 different freely available texts. Nine of them are from Open American National Corpus. For the tenth genre (Fiction), various number of fiction books are collected from gutenberg.org project. Then all sentences are translated to Turkish using Amazon Translate.

Annotations

MNLI used gethybrid.io for data annotation. During the annotation process, 5 different prompts are prepared according to the genres. Annotators are asked to create Entailment, Contradiction and Neatural hypothesis for the given premises. This process creates the premise-hypothesis pairs in the dataset. Then workers are presented with pairs of sentences and asked to supply a single label (Entailment, Contradiction, Neutural) for the pair. Each pair in the validation process is shown 4 times, which adds up to 5 labels in the dataset.

The detailed description of the dataset annotations can be found at the MultiNLI Paper.

Quality

There are two steps for measuring the quality of the dataset:

  1. Quality of MultiNLI dataset
  2. Quality of Translation from MultiNLI to MNLI-TR

Quality of MultiNLI dataset and details can be found at MultiNLI Paper, and quality of the translation will be discussed here.

The validation of translation process is explained in the paper as follows:

"For expert evaluation, we grouped the translations into example sets of four sentences. We distributed the sets to the experts so that each set (and sentence) was examined by five randomly chosen experts and each expert co-examined approximately the same number of sets with each other expert. Each expert evaluated the translation by (i) grading the translation quality between 1 and 5 (inclusive; 5 the best) and (ii) checking if the translation altered the semantic relation. We distributed an annotation guide to the team to standardize the criteria. In total, 500 example sets (2,000 translated sentences) were examined by five experts, yielding 10,000 annotations. The annotation-level analysis compares each new annotation with the gold label on the original English example. The majority-level analysis compares only the majority label (if any) of the five new annotations with the English gold label"

The results of validation process is given in the following table:

Translation Quality Annotation-level Label Consistency Majority-level Label Consistency
Train 4.56 (0.80) 89.96% 96.22%
Dev 4.42 (0.86) 88.53% 95.33%
Test 4.49 (0.82) 92.53% 98.00%

Additional Information

Dataset Curators

Published by Emrah Budur, Rıza Özçelik, Tunga Gung, Christopher Potts

Citation Information

Please cite the following paper (arXiv) if you found this dataset useful:

Emrah Budur, Rıza Özçelik, Tunga Güngör and Christopher Potts. 2020. Data and Representation for Turkish Natural Language Inference. Proceedings of EMNLP.

@inproceedings{budur-etal-2020-data,
    title = "Data and {R}epresentation for {T}urkish {N}atural {L}anguage {I}nference",
    author = {Budur, Emrah  and
      {\"O}z{\c{c}}elik, R{\i}za  and
      Gungor, Tunga  and
      Potts, Christopher},
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.662",
    doi = "10.18653/v1/2020.emnlp-main.662",
    pages = "8253--8267"

Contact

Uploaded and documented by Arda Goktogan: ardagoktogan gmail com