The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:
- CoLA (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a dataset containing sentences labeled grammatically correct or not.
- MNLI (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- MRPC (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- QNLI (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- QQP (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- RTE (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- SST-2 (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- STS-B (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)