Code review Q2 2020 : Data publication #3

luizaandrade · 2020-05-18T14:27:48Z

Data cleaning code review checklist

Data source/survey round

Date

All scripts run from the master after adding the correct folder path to line(s) X (and XX)
The master script is organized in a way that allows you to understand the general tasks being performed in the code
The master script tracks which scripts create and use which files
The data sets created by the reviewer are exactly the same as those shared by the coder

Code names are informative
It is clear in the code why tasks are being executed
The code structure facilitates understanding of the tasks
Code uses white space to improve readability
There is extensive use of comments to explain the code
The code is efficient (tasks are executed in the simplest way possible, loops are used when needed rather than repeating lines, pre-defined functions are used)
Common tasks are abstracted and automated (e.g. using functions or macros)

The data does not include direct identifiers
The data set has a clearly labeled, uniquely and fully identifying ID variable
The level of observation of the data set is clear from the dataset name, ID variables and documentation
Variables have informative labels or an acompanying dictionary
Categorical variables have clear and informative value labels
No modification is made from the raw to the clean data other then correcting problems
No raw variables are processed (winsorized, for example)
Variables can be easily traced back to the original questionnaire

Are new variables being created in the cleaning do-files?
Are any changes being made to observations values in the cleaning do-files?
Check merges: Are any observations dropped? If so, is there a clear justification for that? If any observations didn't match, is that explained in the comments?
Are missing values coded consistently? Are extended missing values used?