Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code review Q2 2020 : Data publication #3

Open
24 tasks
luizaandrade opened this issue May 18, 2020 · 0 comments
Open
24 tasks

Code review Q2 2020 : Data publication #3

luizaandrade opened this issue May 18, 2020 · 0 comments

Comments

@luizaandrade
Copy link
Collaborator

Data cleaning code review checklist

Data source/survey round

Date

List of files to be checked [Add names or links]

  • Master script

  • Clean dataset(s)

  • Cleaning scripts

Identifiers

  • Deidentified data does not contain identifying variable

Reproducibility

  • All scripts run from the master after adding the correct folder path to line(s) X (and XX)
  • The master script is organized in a way that allows you to understand the general tasks being performed in the code
  • The master script tracks which scripts create and use which files
  • The data sets created by the reviewer are exactly the same as those shared by the coder

Code organization and readability

  • Code names are informative
  • It is clear in the code why tasks are being executed
  • The code structure facilitates understanding of the tasks
  • Code uses white space to improve readability
  • There is extensive use of comments to explain the code
  • The code is efficient (tasks are executed in the simplest way possible, loops are used when needed rather than repeating lines, pre-defined functions are used)
  • Common tasks are abstracted and automated (e.g. using functions or macros)

Clean data set checks (pre-publication)

  • The data does not include direct identifiers
  • The data set has a clearly labeled, uniquely and fully identifying ID variable
  • The level of observation of the data set is clear from the dataset name, ID variables and documentation
  • Variables have informative labels or an acompanying dictionary
  • Categorical variables have clear and informative value labels
  • No modification is made from the raw to the clean data other then correcting problems
  • No raw variables are processed (winsorized, for example)
  • Variables can be easily traced back to the original questionnaire

Data cleaning tasks

  • Are new variables being created in the cleaning do-files?
  • Are any changes being made to observations values in the cleaning do-files?
  • Check merges: Are any observations dropped? If so, is there a clear justification for that? If any observations didn't match, is that explained in the comments?
  • Are missing values coded consistently? Are extended missing values used?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant