This is an attemp to compile a consolidated list of datasets available for different NLP tasks. This work is under progress.
- DROP Allen NLP: https://allennlp.org/drop.html
- Google Natural Questions: https://ai.google.com/research/NaturalQuestions
- Stanford SQuAD 2.0: https://rajpurkar.github.io/SQuAD-explorer/
- Stanford SQuAD: https://rajpurkar.github.io/SQuAD-explorer/
- Microsoft MARCO: http://www.msmarco.org/dataset.aspx
- CMU RACE: http://www.cs.cmu.edu/~glai1/data/race/
- University of Washington TriviaQA: http://nlp.cs.washington.edu/triviaqa/
- Microsoft WikiQA: https://www.microsoft.com/en-us/download/details.aspx?id=52419
- CNN/ Dailymail: https://cs.nyu.edu/~kcho/DMQA/
- NewsQA: https://datasets.maluuba.com/NewsQA/dl
- TREC: http://trec.nist.gov/data/qamain.html
- CoQA: https://stanfordnlp.github.io/coqa/
- CSQA: https://amritasaha1812.github.io/CSQA/download/
- QuAC: https://quac.ai
A compilation for all types of dialog generation datasets: https://breakend.github.io/DialogDatasets/
- FB15K (FreeBase): https://github.com/ttrouill/complex/tree/master/datasets
- WN18 (WordNet): https://github.com/ttrouill/complex/tree/master/datasets
- IMDb Movie Review: http://ai.stanford.edu/~amaas/data/sentiment/
- Movie review data (Cornell): http://www.cs.cornell.edu/people/pabo/movie-review-data/
- Yelp dataset: https://www.yelp.com/dataset/challenge
- RDF data in HDT format: http://www.rdfhdt.org/datasets/
- Wikidata: https://www.wikidata.org/wiki/Wikidata:Database_download
- MS COCO: http://cocodataset.org/#home
- Flikr8K: http://nlp.cs.illinois.edu/HockenmaierGroup/Framing_Image_Description/KCCA.html
- Flikr30K: http://shannon.cs.illinois.edu/DenotationGraph/
- PASCAL : http://vision.cs.uiuc.edu/pascal-sentences/
- Visual Genome: http://visualgenome.org
- InstaPIC: https://github.com/cesc-park/attend2u
- YFCC100M: http://yfcc100m.appspot.com