Skip to content

Latest commit

 

History

History
58 lines (42 loc) · 7.26 KB

DATASTATEMENT.md

File metadata and controls

58 lines (42 loc) · 7.26 KB

Corpus Data Statement

Following Datasheets for Datasets(Gebru et al. 2020) and earlier NLP-specific work(Bender and Friedman 2018), we want to maintain a data statement making clear the data origins and details.

Motivation

  • For what purpose was the dataset created? This dataset is for studying computational models trained to reason about prototypical situations. It is anticipated that still would not lead to usage in a downstream task, but as a way of studying the knowledge (and biases) of prototypical situations already contained in pre-trained models. The scraped data is sourced from fan websites for a gameshow (Family Feud), and thus is optimized for entertainment rather than scientific rigor.
  • Who created the dataset and on behalf of which entity: See author list in paper.
  • Who funded the creation of the dataset?: See acknowledgments in paper.

Composition

Regarding the scraped training and scraped-dev sets:

  • What do the instances that comprise the dataset represent?: Each represents a survey question from the Family Feud game and reported answer clusters
  • How many instances are there in total?: 9789 instances
  • Does the dataset contain all possible instances or is it a sample(not necessarily random) of instances from a larger set?: This is a sampling from a larger set of all transcriptions of such questions on all sites.
  • What data does each instance consist of?: Each instance is a question, a set of answers, and a count associated with each answer.
  • Is any information missing from individual instances? It is unclear
  • Are there recommended data splits (e.g., training, development/validation,testing)?: Data is sorted into suggested splits, and separated in the data/ folder.
  • Are there any errors, sources of noise, or redundancies in the dataset?: All data was scraped from fan sites, and therefore prone to erroneous or incomplete additions. Redundancies were found with various automatic metrics (such as edit distance) and removed during processing, as were obviously incomplete or incorrect answer sets (e.g. when answers totaled more than 100). However, we expect there to be noise and redundancies not captured in that process.
  • Is the dataset self-contained, or does it link to or otherwise rely onexternal resources: The data is self-contained.
  • Does the dataset contain data that might be considered confidential? The data does not concern individuals and thus does not contain any information to identify persons. Crowdsourced answers do not provide any user identifiers.
  • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? Not egregiously so (questions are all designed to be shown on television or replications thereof),
  • Does the dataset contain data that might be considered sensitive in any way? As the questions address prototypical/stereotypical activities, models trained on more offensive material (such as large language models) may provide offensive answers to such questions. While we had found a few questions which we worried would actually encourage models to provide offensive answers, we cannot guarantee that the data is clean of such questions. Even a perfectly innocent version of this dataset would be encouraging models to express generalizations about situations, and therefore may provoke offensive material that is contained in language models.

Collection Process

  • How was the data associated with each instance acquired?: See paper for details. Scraped data was acquired through fan transcriptions at https://www.familyfeudinfo.com and http://familyfeudfriends.arjdesigns.com/ ; crowdsourced data was acquired with FigureEight (now Appen).
  • If the dataset is a sample from a larger set, what was the sampling strategy: Deterministic filtering was used (noted elsewhere), but no probabilistic sampling was used.
  • Who was involved in the data collection process (e.g., students,crowdworkers , contractors) and how were they compensated: Crowdworkers were used to create the evalaution dataset. Time per task was calculated and per-task cost was set to attempt to provide a living wage.
  • Over what timeframe was the data collected: Crowdsource answers were collected between Fall of 2018 and Spring of 2019. Scraped data covers question-answer pairs collected since the origin of the show in 1976.
  • Annotator Demographics The original question-answer pairs were generated by surveys of US English-speakers in a period from 1976 to present day. Crowd-sourced evaluation was constrained geographically to US English speakers but not otherwise constrained. Additional demographic data was not collected.

Preprocessing/cleaning/labeling

  • Was any preprocessing/cleaning/labeling of the data done: Obvious typos in the crowdsourced answer set were corrected, and clearly incorrect answers removed.

Uses

  • Has the dataset been used for any tasks already? The dataset has been used to train an interactive demo, but not deployed for other tasks.
  • Is there a repository that links to any or all papers or systems thatuse the dataset?: Such a list can be maintained here.
  • What (other) tasks could the dataset be used for? We encourage use of the dataset to study stereotypes in pre-trained language models.
  • Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?: All original questions were written with US television audiences in mind, and therefore characterize prototypical situations from this lens. Any usages which deploy this to actually model prototypical situations globally will carry that bias.
  • Are there tasks for which the dataset should not be used?: We caution regarding free-form use of this dataset for interactive "commonsense question answering" purposes without more study of the biases and stereotypes learned by such models.

Distribution

  • This dataset is distributed here via github.
  • Will the dataset be distributed under a copyright or other intel-lectual property (IP) license, and/or under applicable terms of use(ToU)?: We use CC-BY-4.0; see LICENSE .
  • Have any third parties imposed IP-based or other restrictions on the data associated with the instances?: Not at this time.

Maintenance

  • Who is supporting/hosting/maintaining the dataset? The listed authors are maintaining/supporting the dataset. They pledge to help support issues, but cannot guarantee long-term support.
  • How can the owner/curator/manager of the dataset be contacted: See author contacts in paper, or post issues in the current repository.
  • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?: We have started an omitted.jsonl file of instances to be removed from the training set, and if other such instances are found that should not be used in training, we can move these to that file. Tagged releases will be used, and a history updated, for any changes to the training data.
  • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If interested, contact the authors of the paper.