diff --git a/chapters/abbreviations.tex b/auxiliary/abbreviations.tex similarity index 100% rename from chapters/abbreviations.tex rename to auxiliary/abbreviations.tex diff --git a/chapters/conclusion.tex b/auxiliary/conclusion.tex similarity index 93% rename from chapters/conclusion.tex rename to auxiliary/conclusion.tex index 9fafe6c8d..d94de69dc 100644 --- a/chapters/conclusion.tex +++ b/auxiliary/conclusion.tex @@ -1,4 +1,4 @@ -We hope you have enjoyed \textit{Data for Development Impact: The DIME Analytics Resource Guide}. +We hope you have enjoyed \textit{Development Research in Practice: The DIME Analytics Data Handbook}. Our aim was to teach you to handle data more efficiently, effectively, and ethically. We laid out a complete vision of the tasks of a modern researcher, from planning a project's data governance to publishing code and data @@ -41,4 +41,4 @@ and come back to it anytime you need more information. We wish you all the best in your work and will love to hear any input you have on ours!\sidenote{ -You can share your comments and suggestion on this book through \url{https://worldbank.github.io/d4di}.} +You can share your comments and suggestion on this book through \url{https://worldbank.github.io/dime-data-handbook}.} diff --git a/chapters/notes.tex b/auxiliary/notes.tex similarity index 82% rename from chapters/notes.tex rename to auxiliary/notes.tex index bf8106dcf..2208ca157 100644 --- a/chapters/notes.tex +++ b/auxiliary/notes.tex @@ -1,6 +1,6 @@ This is a draft peer review edition of -\textit{Data for Development Impact: -The DIME Analytics Resource Guide}. +\textit{Development Research in Practice: +The DIME Analytics Data Handbook}. This version of the book has been substantially revised since the first release in June 2019 with feedback from readers and other experts. @@ -12,9 +12,9 @@ This book is intended to remain a living product that is written and maintained in the open. The raw code and edit history are online at: -\url{https://github.com/worldbank/d4di}. +\url{https://github.com/worldbank/dime-data-handbook}. You can get a PDF copy at: -\url{https://worldbank.github.com/d4di}. +\url{https://worldbank.github.com/dime-data-handbook}. The website also includes the most updated instructions for providing feedback, as well as a log of errata and updates that have been made to the content. @@ -27,7 +27,7 @@ \subsection{Feedback} We encourage feedback and corrections so that we can improve the contents of the book in future editions. Please visit -\url{https://worldbank.github.com/d4di/feedback} to +\url{https://worldbank.github.com/dime-data-handbook/feedback} to see different options on how to provide feedback. You can also email us at \url{dimeanalytics@worldbank.org} with input or comments, and we will be very thankful. diff --git a/chapters/preamble.tex b/auxiliary/preamble.tex similarity index 92% rename from chapters/preamble.tex rename to auxiliary/preamble.tex index a9c79fdf0..d7c18fc7f 100644 --- a/chapters/preamble.tex +++ b/auxiliary/preamble.tex @@ -108,8 +108,8 @@ % BOOK META-INFORMATION %---------------------------------------------------------------------------------------- -\title{Data for \\ \noindent Development Impact: \\ \bigskip -\noindent The DIME Analytics \\ \noindent Resource Guide} % Title of the book +\title{Development \\ \noindent Research \\ \noindent in Practice: \\ \bigskip +\noindent The DIME Analytics \\ \noindent Data Handbook} % Title of the book \author{Kristoffer Bj{\"a}rkefur \\ \noindent Lu{\'i}za Cardoso de Andrade \\ \noindent Benjamin Daniels \\ \noindent Maria Jones \\} % Author @@ -126,13 +126,13 @@ %Set this user input \newcommand{\gitfolder}{.git} %relative path to .git folder from .tex doc -\newcommand{\reponame}{worldbank/d4di} % Name of account and repo be set in URL +\newcommand{\reponame}{worldbank/dime-data-handbook} % Name of account and repo be set in URL %Based on this https://tex.stackexchange.com/questions/455396/how-to-include-the-current-git-commit-id-and-branch-in-my-document -\CatchFileDef{\headfull}{\gitfolder/HEAD.}{} %Get path to head file for checked out branch +\CatchFileDef{\headfull}{\gitfolder/HEAD}{} %Get path to head file for checked out branch \StrGobbleRight{\headfull}{1}[\head] %Remove end of line character \StrBehind[2]{\head}{/}[\branch] %Parse out the path only -\CatchFileDef{\commit}{\gitfolder/refs/heads/\branch.}{} %Get the content of the branch head +\CatchFileDef{\commit}{\gitfolder/refs/heads/\branch}{} %Get the content of the branch head \StrGobbleRight{\commit}{1}[\commithash] %Remove end of line characted %Build the URL to this commit based on the information we now have @@ -176,15 +176,18 @@ \bigskip\par\smallcaps{Published by \thanklesspublisher} -\par\smallcaps{\url{https://worldbank.github.com/d4di}} +\par\smallcaps{\url{https://worldbank.github.com/dime-data-handbook}} -\par Released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. +\par Compiled from commit: \newline +\vspace{-0.5cm} +\commiturl +\par Released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.\newline +\vspace{-0.5cm} \url{https://creativecommons.org/licenses/by/4.0} \par\textit{First printing, \monthyear} -\par Compiled from: \commiturl \end{fullwidth} %---------------------------------------------------------------------------------------- @@ -194,7 +197,7 @@ \cleardoublepage \chapter*{Notes on this edition} % The asterisk leaves out this chapter from the table of contents -\input{chapters/notes.tex} +\input{auxiliary/notes.tex} %---------------------------------------------------------------------------------------- % Abbreviations @@ -203,7 +206,7 @@ \chapter*{Notes on this edition} % The asterisk leaves out this chapter from the \cleardoublepage \chapter*{Abbreviations} % The asterisk leaves out this chapter from the table of contents -\input{chapters/abbreviations.tex} +\input{auxiliary/abbreviations.tex} %---------------------------------------------------------------------------------------- diff --git a/auxiliary/research-design.tex b/auxiliary/research-design.tex new file mode 100644 index 000000000..f7d25cf06 --- /dev/null +++ b/auxiliary/research-design.tex @@ -0,0 +1,620 @@ +%----------------------------------------------------------------------------------------------- + +\begin{fullwidth} +Research design is the process of defining the methods and data +that will be used to answer a specific research question. +You don't need to be an expert in research design to do effective data work, +but it is essential that you understand the design of the study you are working on, +and how it affects the data work. +Without going into too much technical detail, +as there are many excellent resources on impact evaluation design, +this chapter presents a brief overview +of the most common causal inference methods, +focusing on implications for data structure and analysis. +The intent of this chapter is for you to obtain an understanding of +the way in which each method constructs treatment and control groups, +the data structures needed to estimate the corresponding effects, +and specific code tools designed for each method (the list, of course, is not exhaustive). + +Thinking through research design before starting data work is important for several reasons. +If you do not know how to calculate the correct estimator for your study, +you will not be able to assess the statistical power of your research design. +You will also be unable to make decisions in the field +when you inevitably have to allocate scarce resources +between tasks like maximizing sample size +and ensuring follow-up with specific individuals. +You will save a lot of time by understanding the way +your data needs to be organized +in order to be able to produce meaningful analytics throughout your projects. +Just as importantly, familiarity with each of these approaches +will allow you to keep your eyes open for research opportunities: +many of the most interesting projects occur because people in the field +recognize the opportunity to implement one of these methods +in response to an unexpected event. +Intuitive knowledge of your project's chosen approach will make you +much more effective at the analytical part of your work. + +This chapter first covers causal inference methods. +Next it discusses how to measure treatment effects and structure data for specific methods, +including cross-sectional randomized control trials, difference-in-difference designs, +regression discontinuity, instrumental variables, matching, and synthetic controls. + +\end{fullwidth} + +%----------------------------------------------------------------------------------------------- +%----------------------------------------------------------------------------------------------- + +\section{Causality, inference, and identification} + +When we are discussing the types of inputs -- ``treatments'' -- commonly referred to as +``programs'' or ``interventions'', we are typically attempting to obtain estimates +of program-specific \textbf{treatment effects}. +These are the changes in outcomes attributable to the treatment.\cite{abadie2018econometric} + \index{treatment effect} +The primary goal of research design is to establish \textbf{causal identification} for an effect. +Causal identification means establishing that a change in an input directly altered an outcome. + \index{identification} +When a study is well-identified, then we can say with confidence +that our estimate of the treatment effect would, +with an infinite amount of data, +give us a precise estimate of that treatment effect. +Under this condition, we can proceed to draw evidence from the limited samples we have access to, +using statistical techniques to express the uncertainty of not having infinite data. +Without identification, we cannot say that the estimate would be accurate, +even with unlimited data, and therefore cannot attribute it to the treatment +in the small samples that we typically have access to. +More data is not a substitute for a well-identified experimental design. +Therefore it is important to understand how exactly your study +identifies its estimate of treatment effects, +so you can calculate and interpret those estimates appropriately. + +All the study designs we discuss here use the potential outcomes framework\cite{athey2017state} +to compare a group that received some treatment to another, counterfactual group. +Each of these approaches can be used in two types of designs: +\textbf{experimental} designs, in which the research team +is directly responsible for creating the variation in treatment, +and \textbf{quasi-experimental} designs, in which the team +identifies a ``natural'' source of variation and uses it for identification. +Neither type is implicitly better or worse, +and both types are capable of achieving causal identification in different contexts. + +%----------------------------------------------------------------------------------------------- +\subsection{Estimating treatment effects using control groups} + +The key assumption behind estimating treatment effects is that every +person, facility, or village (or whatever the unit of intervention is) +has two possible states: their outcomes if they do not receive some treatment +and their outcomes if they do receive that treatment. +Each unit's treatment effect is the individual difference between these two states, +and the \textbf{average treatment effect (ATE)} is the average of all +individual differences across the potentially treated population. + \index{average treatment effect} +This is the parameter that most research designs attempt to estimate, +by establishing a \textbf{counterfactual}\sidenote{ + \textbf{Counterfactual:} A statistical description of what would have happened to specific individuals in an alternative scenario, for example, a different treatment assignment outcome.} +for the treatment group against which outcomes can be directly compared. + \index{counterfactual} +There are several resources that provide more or less mathematically intensive +approaches to understanding how various methods do this. +\textit{Impact Evaluation in Practice} is a strong general guide to these methods.\sidenote{ + \url{https://www.worldbank.org/en/programs/sief-trust-fund/publication/impact-evaluation-in-practice}} +\textit{Causal Inference} and \textit{Causal Inference: The Mixtape} +provides more detailed mathematical approaches to the tools.\sidenote{ + \url{https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book} + \\ \noindent \url{http://scunning.com/cunningham_mixtape.pdf}} +\textit{Mostly Harmless Econometrics} and \textit{Mastering Metrics} +are excellent resources on the statistical principles behind all econometric approaches.\sidenote{ + \url{https://www.researchgate.net/publication/51992844\_Mostly\_Harmless\_Econometrics\_An\_Empiricist's\_Companion} + \\ \noindent \url{https://assets.press.princeton.edu/chapters/s10363.pdf}} + +Intuitively, the problem is as follows: we can never observe the same unit +in both their treated and untreated states simultaneously, +so measuring and averaging these effects directly is impossible.\sidenote{ + \url{https://www.stat.columbia.edu/~cook/qr33.pdf}} +Instead, we typically make inferences from samples. +\textbf{Causal inference} methods are those in which we are able to estimate the +average treatment effect without observing individual-level effects, +but through some comparison of averages with a \textbf{control} group. + \index{causal inference}\index{control group} +Every research design is based on a way of comparing another set of observations -- +the ``control'' observations -- against the treatment group. +They all work to establish that the control observations would have been +identical \textit{on average} to the treated group in the absence of the treatment. +Then, the mathematical properties of averages imply that the calculated +difference in averages is equivalent to the average difference: +exactly the parameter we are seeking to estimate. +Therefore, almost all designs can be accurately described +as a series of between-group comparisons.\sidenote{ + \url{https://nickchk.com/econ305.html}} + +Most of the methods that you will encounter rely on some variant of this strategy, +which is designed to maximize their ability to estimate the effect +of an average unit being offered the treatment you want to evaluate. +The focus on identification of the treatment effect, however, +means there are several essential features of causal identification methods +that are not common in other types of statistical and data science work. +First, the econometric models and estimating equations used +do not attempt to create a predictive or comprehensive model +of how the outcome of interest is generated. +Typically, causal inference designs are not interested in predictive accuracy, +and the estimates and predictions that they produce +will not be as good at predicting outcomes or fitting the data as other models. +Second, when control variables or other variables are used in estimation, +there is no guarantee that the resulting parameters are marginal effects. +They can only be interpreted as correlative averages, +unless there are additional sources of identification. +The models you will construct and estimate are intended to do exactly one thing: +to express the intention of your project's research design, +and to accurately estimate the effect of the treatment it is evaluating. +In other words, these models tell the story of the research design +in a way that clarifies the exact comparison being made between control and treatment. + +%----------------------------------------------------------------------------------------------- +\subsection{Experimental and quasi-experimental research designs} + +Experimental research designs explicitly allow the research team +to change the condition of the populations being studied,\sidenote{ + \url{https://dimewiki.worldbank.org/Experimental_Methods}} +often in the form of government programs, NGO projects, new regulations, +information campaigns, and many more types of interventions.\cite{banerjee2009experimental} +The classic experimental causal inference method +is the \textbf{randomized control trial (RCT)}.\sidenote{ + \url{https://dimewiki.worldbank.org/Randomized_Control_Trials}} + \index{randomized control trials} +In randomized control trials, the treatment group is randomized -- +that is, from an eligible population, +a random group of units are given the treatment. +Another way to think about these designs is how they establish the control group: +a random subset of units are \textit{not} given access to the treatment, +so that they may serve as a counterfactual for those who are. +A randomized control group, intuitively, is meant to represent +how things would have turned out for the treated group +if they had not been treated, and it is particularly effective at doing so +as evidenced by its broad credibility in fields ranging from clinical medicine to development. +Therefore RCTs are very popular tools for determining the causal impact +of specific programs or policy interventions.\sidenote{ + \url{https://www.nobelprize.org/prizes/economic-sciences/2019/ceremony-speech}} +However, there are many other types of interventions that are impractical or unethical +to effectively approach using an experimental strategy, +and therefore there are limitations to accessing ``big questions'' +through RCT approaches.\sidenote{ + \url{https://www.nber.org/papers/w14690.pdf}} + +Randomized designs all share several major statistical concerns. +The first is the fact that it is always possible to select a control group, +by chance, which is not in fact very similar to the treatment group. +This feature is called randomization noise, and all RCTs share the need to assess +how randomization noise may impact the estimates that are obtained. +(More detail on this later.) +Second, take-up and implementation fidelity are extremely important, +since programs will by definition have no effect +if the population intended to be treated +does not accept or does not receive the treatment. +Loss of statistical power occurs quickly and is highly nonlinear: +70\% take-up or efficacy doubles the required sample, and 50\% quadruples it.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up}} +Such effects are also very hard to correct ex post, +since they require strong assumptions about the randomness or non-randomness of take-up. +Therefore a large amount of field time and descriptive work +must be dedicated to understanding how these effects played out in a given study, +and may overshadow the effort put into the econometric design itself. + +\textbf{Quasi-experimental} research designs,\sidenote{ + \url{https://dimewiki.worldbank.org/Quasi-Experimental_Methods}} +by contrast, are causal inference methods based on events not controlled by the research team. +Instead, they rely on ``experiments of nature'', +in which natural variation can be argued to approximate +the type of exogenous variation in treatment availability +that a researcher would attempt to create with an experiment.\cite{dinardo2016natural} +Unlike carefully planned experimental designs, +quasi-experimental designs typically require the extra luck +of having access to data collected at the right times and places +to exploit events that occurred in the past, +or having the ability to collect data in a time and place +where an event that produces causal identification occurred or will occur. +Therefore, these methods often use either secondary data, +or they use primary data in a cross-sectional retrospective method, +including administrative data or other new classes of routinely-collected information. + +Quasi-experimental designs therefore can access a much broader range of questions, +and with much less effort in terms of executing an intervention. +However, they require in-depth understanding of the precise events +the researcher wishes to address in order to know what data to use +and how to model the underlying natural experiment. +Additionally, because the population exposed +to such events is limited by the scale of the event, +quasi-experimental designs are often power-constrained. +Since the research team cannot change the population of the study +or the treatment assignment, power is typically maximized by ensuring +that sampling for data collection is carefully designed to match the study objectives +and that attrition from the sampled groups is minimized. + +%----------------------------------------------------------------------------------------------- +%----------------------------------------------------------------------------------------------- +\section{Obtaining treatment effects from specific research designs} + + +%----------------------------------------------------------------------------------------------- +\subsection{Cross-sectional designs} + +A cross-sectional research design is any type of study +that observes data in only one time period +and directly compares treatment and control groups. +This type of data is easy to collect and handle because +you do not need to track individuals across time. +If this point in time is after a treatment has been fully delivered, +then the outcome values at that point in time +already reflect the effect of the treatment. +If the study is experimental, the treatment and control groups are randomly constructed +from the population that is eligible to receive each treatment. +By construction, each unit's receipt of the treatment +is unrelated to any of its other characteristics +and the ordinary least squares (OLS) regression +of outcome on treatment, without any control variables, +is an unbiased estimate of the average treatment effect. + +Cross-sectional designs can also exploit variation in non-experimental data +to argue that observed correlations do in fact represent causal effects. +This can be true unconditionally -- which is to say that something random, +such as winning the lottery, is a true random process and can tell you about the effect +of getting a large amount of money.\cite{imbens2001estimating} +It can also be true conditionally -- which is to say that once the +characteristics that would affect both the likelihood of exposure to a treatment +and the outcome of interest are controlled for, +the process is as good as random: +like arguing that once risk preferences are taken into account, +exposure to an earthquake is unpredictable and post-event differences +are causally related to the event itself.\cite{callen2015catastrophes} + +For cross-sectional designs, what needs to be carefully maintained in data +is the treatment randomization process itself (whether experimental or not), +as well as detailed information about differences +in data quality and attrition across groups.\cite{athey2017econometrics} +Only these details are needed to construct the appropriate estimator: +clustering of the standard errors is required at the level +at which the treatment is assigned to observations, +and variables which were used to stratify the treatment +must be included as controls (in the form of strata fixed effects).\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios}} +\textbf{Randomization inference} can be used +to estimate the underlying variability in the randomization process +(more on this in the next chapter). +\textbf{Balance checks}\sidenote{ + \textbf{Balance checks:} Statistical tests of the similarity of treatment and control groups.} +are often reported as evidence of an effective randomization, +and are particularly important when the design is quasi-experimental +(since then the randomization process cannot be simulated explicitly). +However, controls for balance variables are usually unnecessary in RCTs, +because it is certain that the true data-generating process +has no correlation between the treatment and the balance factors.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/should-we-require-balance-t-tests-baseline-observables-randomized-experiments}} + +Analysis is typically straightforward \textit{once you have a strong understanding of the randomization}. +A typical analysis will include a description of the sampling and randomization results, +with analyses such as summary statistics for the eligible population, +and balance checks for randomization and sample selection. +The main results will usually be a primary regression specification +(with multiple hypotheses appropriately adjusted for), +and additional specifications with adjustments for non-response, balance, and other potential contamination. +Robustness checks might include randomization-inference analysis or other placebo regression approaches. +There are a number of user-written code tools that are also available +to help with the complete process of data analysis,\sidenote{ + \url{https://toolkit.povertyactionlab.org/resource/coding-resources-randomized-evaluations}} +including to analyze balance\sidenote{ + \url{https://dimewiki.worldbank.org/iebaltab}} +and to visualize treatment effects.\sidenote{ + \url{https://dimewiki.worldbank.org/iegraph}} +Extensive tools and methods for analyzing selective non-response are available.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} + +%----------------------------------------------------------------------------------------------- +\subsection{Difference-in-differences} + +Where cross-sectional designs draw their estimates of treatment effects +from differences in outcome levels in a single measurement, +\textbf{differences-in-differences}\sidenote{ + \url{https://dimewiki.worldbank.org/Difference-in-Differences}} +designs (abbreviated as DD, DiD, diff-in-diff, and other variants) +estimate treatment effects from \textit{changes} in outcomes +between two or more rounds of measurement. + \index{difference-in-differences} +In these designs, three control groups are used – +the baseline level of treatment units, +the baseline level of non-treatment units, +and the endline level of non-treatment units.\sidenote{ + \url{https://www.princeton.edu/~otorres/DID101.pdf}} +The estimated treatment effect is the excess growth +of units that receive the treatment, in the period they receive it: +calculating that value is equivalent to taking +the difference in means at endline and subtracting +the difference in means at baseline +(hence the singular ``difference-in-differences'').\cite{mckenzie2012beyond} +The regression model includes a control variable for treatment assignment, +and a control variable for time period, +but the treatment effect estimate corresponds to +an interaction variable for treatment and time: +it indicates the group of observations for which the treatment is active. +This model depends on the assumption that, +in the absense of the treatment, +the outcome of the two groups would have changed at the same rate over time, +typically referred to as the \textbf{parallel trends} assumption.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice}} +Experimental approaches satisfy this requirement in expectation, +but a given randomization should still be checked for pre-trends +as an extension of balance checking.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/revisiting-difference-differences-parallel-trends-assumption-part-i-pre-trend}} + +There are two main types of data structures for differences-in-differences: +\textbf{repeated cross-sections} and \textbf{panel data}. +In repeated cross-sections, each successive round of data collection contains a random sample +of observations from the treated and untreated groups; +as in cross-sectional designs, both the randomization and sampling processes +are critically important to maintain alongside the data. +In panel data structures, we attempt to observe the exact same units +in different points in time, so that we see the same individuals +both before and after they have received treatment (or not).\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/what-are-we-estimating-when-we-estimate-difference-differences}} +This allows each unit's baseline outcome (the outcome before the intervention) to be used +as an additional control for its endline outcome, +which can provide large increases in power and robustness.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow}} +When tracking individuals over time for this purpose, +maintaining sampling and tracking records is especially important, +because attrition will remove that unit's information +from all points in time, not just the one they are unobserved in. +Panel-style experiments therefore require a lot more effort in field work +for studies that use original data.\sidenote{ + \url{https://www.princeton.edu/~otorres/Panel101.pdf}} +Since baseline and endline may be far apart in time, +it is important to create careful records during the first round +so that follow-ups can be conducted with the same subjects, +and attrition across rounds can be properly taken into account.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/dealing-attrition-field-experiments}} + +As with cross-sectional designs, difference-in-differences designs are widespread. +Therefore there exist a large number of standardized tools for analysis. +Our \texttt{ietoolkit} Stata package includes the \texttt{ieddtab} command +which produces standardized tables for reporting results.\sidenote{ + \url{https://dimewiki.worldbank.org/ieddtab}} +For more complicated versions of the model +(and they can get quite complicated quite quickly), +you can use an online dashboard to simulate counterfactual results.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/econometrics-sandbox-event-study-designs-co}} +As in cross-sectional designs, these main specifications +will always be accompanied by balance checks (using baseline values), +as well as randomization, selection, and attrition analysis. +In trials of this type, reporting experimental design and execution +using the CONSORT style is common in many disciplines +and will help you to track your data over time.\cite{schulz2010consort} + +%----------------------------------------------------------------------------------------------- +\subsection{Regression discontinuity} + +\textbf{Regression discontinuity (RD)} designs exploit sharp breaks or limits +in policy designs to separate a single group of potentially eligible recipients +into comparable groups of individuals who do and do not receive a treatment.\sidenote{ + \url{https://dimewiki.worldbank.org/Regression_Discontinuity}} +These designs differ from cross-sectional and diff-in-diff designs +in that the group eligible to receive treatment is not defined directly, +but instead created during the treatment implementation. + \index{regression discontinuity} +In an RD design, there is typically some program or event +that has limited availability due to practical considerations or policy choices +and is therefore made available only to individuals who meet a certain threshold requirement. +The intuition of this design is that there is an underlying \textbf{running variable} +that serves as the sole determinant of access to the program, +and a strict cutoff determines the value of this variable at which eligibility stops.\cite{imbens2008regression}\index{running variable} +Common examples are test score thresholds and income thresholds.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} +The intuition is that individuals who are just above the threshold +will be very nearly indistinguishable from those who are just under it, +and their post-treatment outcomes are therefore directly comparable.\cite{lee2010regression} +The key assumption here is that the running variable cannot be directly manipulated +by the potential recipients. +If the running variable is time (what is commonly called an ``event study''), +there are special considerations.\cite{hausman2018regression} +Similarly, spatial discontinuity designs are handled a bit differently due to their multidimensionality.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/spatial-jumps}} + +Regression discontinuity designs are, once implemented, +very similar in analysis to cross-sectional or difference-in-differences designs. +Depending on the data that is available, +the analytical approach will center on the comparison of individuals +who are narrowly on the inclusion side of the discontinuity, +compared against those who are narrowly on the exclusion side.\sidenote{ + \url{https://cattaneo.princeton.edu/books/Cattaneo-Idrobo-Titiunik_2019\_CUP-Vol1.pdf}} +The regression model will be identical to the matching research designs, +i.e., contingent on whether data has one or more time periods +and whether the same units are known to be observed repeatedly. +The treatment effect will be identified, however, by the addition of a control +for the running variable -- meaning that the treatment effect estimate +will only be applicable for observations in a small window around the cutoff: +in the lingo, the treatment effects estimated will be ``local'' rather than ``average''. +In the RD model, the functional form of the running variable control and the size of that window, +often referred to as the choice of \textbf{bandwidth} for the design, +are the critical parameters for the result.\cite{calonico2019regression} +Therefore, RD analysis often includes extensive robustness checking +using a variety of both functional forms and bandwidths, +as well as placebo testing for non-realized locations of the cutoff.\sidenote{ + \url{https://www.mdrc.org/sites/default/files/RDD\%20Guide\_Full\%20rev\%202016\_0.pdf}} + +In the analytical stage, regression discontinuity designs +often include a large component of visual evidence presentation.\sidenote{ + \url{https://faculty.smu.edu/kyler/courses/7312/presentations/baumer/Baumer\_RD.pdf}} +These presentations help to suggest both the functional form +of the underlying relationship and the type of change observed at the discontinuity, +and help to avoid pitfalls in modeling that are difficult to detect with hypothesis tests.\sidenote{ + \url{https://econ.lse.ac.uk/staff/spischke/ec533/RD.pdf}} +Because these designs are so flexible compared to others, +there is an extensive set of commands that help assess +the efficacy and results from these designs under various assumptions.\sidenote{ + \url{https://sites.google.com/site/rdpackages}} +These packages support the testing and reporting +of robust plotting and estimation procedures, +tests for manipulation of the running variable, +and tests for power, sample size, and randomization inference approaches +that will complement the main regression approach used for point estimates. + +%----------------------------------------------------------------------------------------------- +\subsection{Instrumental variables} + +\textbf{Instrumental variables (IV)} designs, unlike the previous approaches, +begin by assuming that the treatment delivered in the study in question is +linked to the outcome in a pattern such that its effect is not directly identifiable. +Instead, similar to regression discontinuity designs, +IV attempts to focus on a subset of the variation in treatment take-up +and assesses that limited window of variation that can be argued +to be unrelated to other factors.\cite{angrist2001instrumental} +To do so, the IV approach selects an \textbf{instrument} +for the treatment status -- an otherwise-unrelated predictor of exposure to treatment +that affects the take-up status of an individual.\sidenote{ + \url{https://dimewiki.worldbank.org/instrumental_variables}} +Whereas regression discontinuity designs are ``sharp'' -- +treatment status is completely determined by which side of a cutoff an individual is on -- +IV designs are ``fuzzy'', meaning that they do not completely determine +the treatment status but instead influence the \textit{probability} of treatment. + +As in regression discontinuity designs, +the fundamental form of the regression +is similar to either cross-sectional or difference-in-differences designs. +However, instead of controlling for the instrument directly, +the IV approach typically uses the \textbf{two-stage-least-squares (2SLS)} estimator.\sidenote{ + \url{https://www.nuff.ox.ac.uk/teaching/economics/bond/IV\%20Estimation\%20Using\%20Stata.pdf}} +This estimator forms a prediction of the probability that the unit receives treatment +based on a regression against the instrumental variable. +That prediction will, by assumption, be the portion of the actual treatment +that is due to the instrument and not any other source, +and since the instrument is unrelated to all other factors, +this portion of the treatment can be used to assess its effects. +Unfortunately, these estimators are known +to have very high variances relative other methods, +particularly when the relationship between the instrument and the treatment is small.\cite{young2017consistency} +IV designs furthermore rely on strong but untestable assumptions +about the relationship between the instrument and the outcome.\cite{bound1995problems} +Therefore IV designs face intense scrutiny on the strength and exogeneity of the instrument, +and tests for sensitivity to alternative specifications and samples +are usually required with an instrumental variables analysis. +However, the method has special experimental cases that are significantly easier to assess: +for example, a randomized treatment \textit{assignment} can be used as an instrument +for the eventual take-up of the treatment itself, +especially in cases where take-up is expected to be low, +or in circumstances where the treatment is available +to those who are not specifically assigned to it (``encouragement designs''). + +In practice, there are a variety of packages that can be used +to analyse data and report results from instrumental variables designs. +While the built-in Stata command \texttt{ivregress} will often be used +to create the final results, the built-in packages are not sufficient on their own. +The \textbf{first stage} of the design should be extensively tested, +to demonstrate the strength of the relationship between +the instrument and the treatment variable being instrumented.\cite{stock2005weak} +This can be done using the \texttt{weakiv} and \texttt{weakivtest} commands.\sidenote{ + \url{https://www.carolinpflueger.com/WangPfluegerWeakivtest_20141202.pdf}} +Additionally, tests should be run that identify and exclude individual +observations or clusters that have extreme effects on the estimator, +using customized bootstrap or leave-one-out approaches.\cite{young2017consistency} +Finally, bounds can be constructed allowing for imperfections +in the exogeneity of the instrument using loosened assumptions, +particularly when the underlying instrument is not directly randomized.\sidenote{ + \url{http://www.damianclarke.net/research/papers/practicalIV-CM.pdf}} + + +%----------------------------------------------------------------------------------------------- +\subsection{Matching} + +\textbf{Matching} methods use observable characteristics of individuals +to directly construct treatment and control groups to be as similar as possible +to each other, either before a randomization process +or after the collection of non-randomized data.\sidenote{ + \url{https://dimewiki.worldbank.org/Matching}} + \index{matching} +Matching observations may be one-to-one or many-to-many; +in any case, the result of a matching process +is similar in concept to the use of randomization strata +in simple randomized control trials. +In this way, the method can be conceptualized +as averaging across the results of a large number of ``micro-experiments'' +in which the randomized units are verifiably similar aside from the treatment. + +When matching is performed before a randomization process, +it can be done on any observable characteristics, +including outcomes, if they are available. +The randomization should then record an indicator for each matching set, +as these become equivalent to randomization strata and require controls in analysis. +This approach is stratification taken to its most extreme: +it reduces the number of potential randomizations dramatically +from the possible number that would be available +if the matching was not conducted, +and therefore reduces the variance caused by the study design. +When matching is done ex post in order to substitute for randomization, +it is based on the assertion that within the matched groups, +the assignment of treatment is as good as random. +However, since most matching models rely on a specific linear model, +such as \textbf{propensity score matching},\sidenote{ + \textbf{Propensity Score Matching (PSM):} An estimation method that controls for the likelihood + that each unit of observation would recieve treatment as predicted by observable characteristics.} +they are open to the criticism of ``specification searching'', +meaning that researchers can try different models of matching +until one, by chance, leads to the final result that was desired; +analytical approaches have shown that the better the fit of the matching model, +the more likely it is that it has arisen by chance and is therefore biased.\cite{king2019propensity} +Newer methods, such as \textbf{coarsened exact matching},\cite{iacus2012causal} +are designed to remove some of the dependence on linearity. +In all ex-post cases, pre-specification of the exact matching model +can prevent some of the potential criticisms on this front, +but ex-post matching in general is not regarded as a strong identification strategy. + +Analysis of data from matching designs is relatively straightforward; +the simplest design only requires controls (indicator variables) for each group +or, in the case of propensity scoring and similar approaches, +weighting the data appropriately in order to balance the analytical samples on the selected variables. +The \texttt{teffects} suite in Stata provides a wide variety +of estimators and analytical tools for various designs.\sidenote{ + \url{https://ssc.wisc.edu/sscc/pubs/stata_psmatch.htm}} +The coarsened exact matching (\texttt{cem}) package applies the nonparametric approach.\sidenote{ + \url{https://gking.harvard.edu/files/gking/files/cem-stata.pdf}} +DIME's \texttt{iematch} command in the \texttt{ietoolkit} package produces matchings based on a single continuous matching variable.\sidenote{ + \url{https://dimewiki.worldbank.org/iematch}} +In any of these cases, detailed reporting of the matching model is required, +including the resulting effective weights of observations, +since in some cases the lack of overlapping supports for treatment and control +mean that a large number of observations will be weighted near zero +and the estimated effect will be generated based on a subset of the data. + +%----------------------------------------------------------------------------------------------- +\subsection{Synthetic controls} + +\textbf{Synthetic control} is a relatively new method +for the case when appropriate counterfactual individuals +do not exist in reality and there are very few (often only one) treatment units.\cite{abadie2015comparative} + \index{synthetic controls} +For example, state- or national-level policy changes +that can only be analyzed as a single unit +are typically very difficult to find valid comparators for, +since the set of potential comparators is usually small and diverse +and therefore there are no close matches to the treated unit. +Intuitively, the synthetic control method works +by constructing a counterfactual version of the treated unit +using an average of the other units available.\cite{abadie2010synthetic} +This is a particularly effective approach +when the lower-level components of the units would be directly comparable: +people, households, business, and so on in the case of states and countries; +or passengers or cargo shipments in the case of transport corridors, for example.\cite{gobillon2016regional} +This is because in those situations the average of the untreated units +can be thought of as balancing by matching the composition of the treated unit. + +To construct this estimator, the synthetic controls method requires +retrospective data on the treatment unit and possible comparators, +including historical data on the outcome of interest for all units. +The counterfactual blend is chosen by optimizing the prediction of past outcomes +based on the potential input characteristics, +and typically selects a small set of comparators to weight into the final analysis. +These datasets therefore may not have a large number of variables or observations, +but the extent of the time series both before and after the implementation +of the treatment are key sources of power for the estimate, +as are the number of counterfactual units available. +Visualizations are often excellent demonstrations of these results. +The \texttt{synth} package provides functionality for use in Stata and R, +although since there are a large number of possible parameters +and implementations of the design it can be complex to operate.\sidenote{ + \url{https://web.stanford.edu/~jhain/synthpage.html}} diff --git a/appendix/stata-guide.tex b/auxiliary/stata-guide.tex similarity index 90% rename from appendix/stata-guide.tex rename to auxiliary/stata-guide.tex index 526fbadb7..9c98b924f 100644 --- a/appendix/stata-guide.tex +++ b/auxiliary/stata-guide.tex @@ -40,6 +40,23 @@ \section{Using the code examples in this book} +In the book, code examples are presented like the following: + +\codeexample{code.do}{./code/code.do} + +We ensure that each code block runs independently, is well-formatted, +and uses built-in functions as much as possible. +We will point to user-written functions when they provide important tools. +In particular, we point to two suites of Stata commands developed by DIME Analytics, +\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/ietoolkit}} and +\texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/iefieldkit}} +which standardize our core data collection, management, and analysis workflows. +We will comment the code generously (as you should), +but you should reference Stata help-files by writing \texttt{help [command]} +whenever you do not understand the command that is being used. +We hope that these snippets will provide a foundation for your code style. +Providing some standardization to Stata code style is also a goal of this team. + You can access the raw code used in examples in this book in several ways. We use GitHub to version control everything in this book, the code included. To see the code on GitHub, go to: \url{https://github.com/worldbank/d4di/tree/master/code}. @@ -58,7 +75,36 @@ \section{Using the code examples in this book} with all the content used in writing this book, including the \LaTeX{} code used for the book itself. After extracting the .zip-file you will find all the code in a folder called \texttt{/code/}. -\subsection{Understanding Stata code} +\subsection{Writing and Understanding Stata code} + +``Good'' code has two elements: (1) it is correct, in that it doesn't produce any errors, +and (2) it is useful and comprehensible to someone who hasn't seen it before +(or even yourself a few weeks, months or years later). +Many researchers have been trained to code correctly. +However, when your code runs on your computer and you get the correct results, +you are only half-done writing \textit{good} code. +Good code is easy to read and replicate, making it easier to spot mistakes. +Good code reduces sampling, randomization, and cleaning errors. +Good code can easily be reviewed by others before it's published and replicated afterwards. + +You should think of code in terms of three major elements: +\textbf{structure}, \textbf{syntax}, and \textbf{style}. +We always tell people to ``code as if a stranger would read it'' +(from tomorrow, that stranger could be you!). +The \textbf{structure} is the environment and file organization your code lives in: +good structure means that it is easy to find individual pieces of code +that correspond to specific tasks and outputs. +Good structure also means that functional blocks are sufficiently independent from each other +that they can be shuffled around, repurposed, and even deleted without damaging other portions. +The \textbf{syntax} is the literal language of your code. +Good syntax means that your code is readable +in terms of how its mechanics implement ideas -- +it should not require arcane reverse-engineering +to figure out what a code chunk is trying to do. +\textbf{Style}, finally, is the way that the non-functional elements of your code convey its purpose. +Elements like spacing, indentation, and naming conventions (or lack thereof) can make your code much more +(or much less) accessible to someone who is reading it for the first time +and needs to understand it quickly and correctly. Whether you are new to Stata or have used it for decades, you will always run into commands that diff --git a/bibliography.bib b/bibliography.bib index 1bb68eea3..81a18b455 100644 --- a/bibliography.bib +++ b/bibliography.bib @@ -537,6 +537,18 @@ @article{begg1996improving publisher={American Medical Association} } +@article{jepdataquality, +Author = {Norwood, Janet L.}, +Title = {Distinguished Lecture on Economics in Government: Data Quality and Public Policy}, +Journal = {Journal of Economic Perspectives}, +Volume = {4}, +Number = {2}, +Year = {1990}, +Month = {June}, +Pages = {3-12}, +DOI = {10.1257/jep.4.2.3}, +URL = {https://www.aeaweb.org/articles?id=10.1257/jep.4.2.3}} + @article{carril2017dealing, title={Dealing with misfits in random treatment assignment}, author={Carril, Alvaro}, @@ -548,6 +560,69 @@ @article{carril2017dealing publisher={SAGE Publications Sage CA: Los Angeles, CA} } +@article{mccullough2008economics, + title={Do economics journal archives promote replicable research?}, + author={McCullough, Bruce D and McGeary, Kerry Anne and Harrison, Teresa D}, + journal={Canadian Journal of Economics/Revue canadienne d'{\'e}conomique}, + volume={41}, + number={4}, + pages={1406--1420}, + year={2008}, + publisher={Wiley Online Library} +} + +@article{camerer2016evaluating, + title={Evaluating replicability of laboratory experiments in economics}, + author={Camerer, Colin F and Dreber, Anna and Forsell, Eskil and Ho, Teck-Hua and Huber, J{\"u}rgen and Johannesson, Magnus and Kirchler, Michael and Almenberg, Johan and Altmejd, Adam and Chan, Taizan and others}, + journal={Science}, + volume={351}, + number={6280}, + pages={1433--1436}, + year={2016}, + publisher={American Association for the Advancement of Science} +} + +@book{ozier2019replication, + title={Replication Redux: The Reproducibility Crisis and the Case of Deworming}, + author={Ozier, Owen}, + year={2019}, + publisher={The World Bank} +} + +@article{hamermesh2007replication, + title={Replication in economics}, + author={Hamermesh, Daniel S}, + journal={Canadian Journal of Economics/Revue canadienne d'{\'e}conomique}, + volume={40}, + number={3}, + pages={715--733}, + year={2007}, + publisher={Wiley Online Library} +} + +@article{chang2015economics, + title={Is economics research replicable? Sixty published papers from thirteen journals say'usually not'}, + author={Chang, Andrew C and Li, Phillip}, + journal={Available at SSRN 2669564}, + year={2015} +} + +@misc{vilhuber_lars_2020_3911311, + author = {Vilhuber, Lars}, + title = {{Implementing Increased Transparency and + Reproducibility in Economics}}, + month = apr, + year = 2020, + note = {{The opinions expressed in this talk are solely the + authors, and do not represent the views of the + U.S. Census Bureau, the American Economic + Association, or any of the funding agencies.}}, + publisher = {Zenodo}, + version = {2020-06-27}, + doi = {10.5281/zenodo.3911311}, + url = {https://doi.org/10.5281/zenodo.3911311} +} + @article{athey2017econometrics, title={The econometrics of randomized experiments}, author={Athey, Susan and Imbens, Guido W}, @@ -558,6 +633,15 @@ @article{athey2017econometrics publisher={Elsevier} } +@inproceedings{swanson2020research, + title={Research Transparency Is on the Rise in Economics}, + author={Swanson, Nicholas and Christensen, Garret and Littman, Rebecca and Birke, David and Miguel, Edward and Paluck, Elizabeth Levy and Wang, Zenan}, + booktitle={AEA Papers and Proceedings}, + volume={110}, + pages={61--65}, + year={2020} +} + @article{christensen2018transparency, title={Transparency, reproducibility, and the credibility of economics research}, author={Christensen, Garret and Miguel, Edward}, diff --git a/chapters/0-introduction.tex b/chapters/0-introduction.tex new file mode 100644 index 000000000..e0ddb225b --- /dev/null +++ b/chapters/0-introduction.tex @@ -0,0 +1,353 @@ +\begin{fullwidth} +Welcome to \textit{Development Research in Practice: The DIME Analytics Data Handbook}. +This book is intended to teach all users of development data +how to handle data effectively, efficiently, and ethically. +It takes lessons, tools, and processes that emerged from the DIME portfolio +and compiles them into a single narrative about doing data work +that we hope will provide a foundation for these professional skills. + +DIME generates high-quality and operationally relevant data and research +to transform development policy, help reduce extreme poverty, and secure shared prosperity. +It develops customized data and evidence ecosystems to produce actionable information +and recommend specific policy pathways to maximize impact. +DIME conducts research in 60 countries with 200 agencies, leveraging a +US\$180 million research budget to shape the design and implementation of +US\$18 billion in development finance. +DIME also provides advisory services to 30 multilateral and bilateral development agencies. +The DIME team includes four primary topic pillars: +Economic Transformation and Growth; +Gender, Economic Opportunity, and Fragility; +Governance and Institution Building; +and Infrastructure and Climate Change. +DIME has included dozens of research economists, +and, over the years, has employed hundreds of full-time research assistants, field coordinators, and staff. +The team has conducted over 325 impact evaluations. +This book exists to take advantage of that concentration and scale of research, +to synthesize many resources for data collection and research, +and to make DIME tools available to the larger community of development researchers. + +As part of its broader mission, DIME invests in public goods +to improve the quality and reproducibility of development research around the world. +One key early innovation at DIME was the creation of DIME Analytics , +the team responsible for writing and maintaining this book. +DIME Analytics is responsible for ensuring the quality of research practices across DIME. +This is done through an intensive, collaborative innovation cycle: +DIME Analytics onboards and supports research assistants and field coordinators, +provides standard tools and workflows to all teams, +delivers intensive support when new tasks or challenges arise, +and then develops and integrates lessons from those engagements to bring to the full team. +\textit{Development Research in Practice} is intended for a broad audience. +It synthesizes and compiles the key ideas, best practices, and research tools +that the members of the DIME Analytics team have developed while supporting +DIME's global impact evaluation portfolio over the last decade. + +\end{fullwidth} + +%------------------------------------------------ + +\section{How to read this book} + +The book aims to be a highly practical resource so the reader can +immediately begin to collaborate effectively on large, long-term research projects +that already use the methods and tools outlined in this book. +This book walks the reader through data work at each stage +of an empirical research project, from design to publication. +Each chapter focuses on a single stage in the data work process. +We begin with ethical principles to guide empirical research, +focusing on research reproducibility, transparency, and credibility. + +\textbf{Chapter 1} outlines a set of practices and ideals that help to ensure that +research consumers can be confident in the conclusions reached, +and research work can be assumed and verified to be reliable. +We also introduce three primary tools that are used to document +the aims and methods of a research project, +ensuring that meta-information about your research is available +and that you approach all data work with an eye towards the future. + +\textbf{Chapter 2} will teach you to structure your data work for collaborative research, +while ensuring the privacy and security of research participants. +It discusses the importance of planning the tools that will be used; +lays the groundwork to structure the research project at its outset -- +long before any data is acquired -- +and provides suggestions for collaborative workflows and tools. +It also describes essential ethical practices around data, +as well as common pitfalls in legal and practical management of data +that respect the rights of research participants. + +\textbf{Chapter 3} turns to the measurement framework, +a special set of information that describes the data in your project +and how you plan to use it -- what you are studying and why. +Setting up a measurement framework means translating a research design to a data work plan, +including master datasets that are appropriate to the design, +tracking and monitoring field work across time, +and implementing and evaluating experimental designs in a rigorous manner. + +\textbf{Chapter 4} covers data acquisition. We start with +the legal and institutional frameworks for data ownership and licensing, +to ensure that you are aware of the rights and responsibilities +of using data collected by you or by others. +We provide a deep dive on collecting high-quality primary electronic survey data, +including developing and deployting survey instruments. +Finally, we discuss secure data handling during transfer, sharing, and storage, +which is essential in protecting the privacy of respondents in any data. + +\textbf{Chapter 5} describes workflows for data processing. +It details how to construct ``tidy'' data at the appropriate units of analysis, +how to ensure uniquely identified datasets, and +how to routinely incorporate data quality checks into the workflow. +It also provides guidance on de-identification and cleaning of personally-identified data, +focusing on how to understand and structure data +so that it is ready for indicator construction and analytical work. + +\textbf{Chapter 6} discusses data analysis. +It begins with data construction, or the creation of new variables +from the raw data acquired or collected in the field. +It introduces core principles for writing analytical code +and creating, exporting, and storing research outputs +such as figures and tables reproducibily using dynamic documents. + +\textbf{Chapter 7} turns to publication of research outputs, +including manuscripts, code, and data. +This chapter discusses +how to effectively collaborate on technical writing +using {\LaTeX} as a document preparation system. +It covers how and why to release or publish datasets +in an accessible, citable, and safe fashion. +Finally, it provides guidelines for preparing +functional and informative reproducibility packages +that contain all the code, data, and meta-information needed +for others to evaluate and reproduce your work. + +After reading each chapter, you should understand +what tasks your team will be performing, +where in the data workflow each task falls, +and how to implement them according to best practices. +You should also understand how the various stages tie together, +and what inputs and outputs are required from each. +Then, the references and links contained in each chapter +will lead you to detailed descriptions of individual +ideas, tools, and processes when you need to implement the tasks yourself. +In particular, highly specific implementation details +will often be found on the \textbf{DIME Wiki}.\sidenote{Like this: +\url{https://dimewiki.worldbank.org/Primary_Data_Collection}} + +The DIME Wiki is one of DIME Analytics' flagship products, +a free online collection of our resources and best practices.\sidenote{ +\url{https://dimewiki.worldbank.org}} +This book complements the DIME Wiki by providing a structured narrative +of the data workflow for a typical research project. +The Wiki, by contrast, provides unstructured but detailed and up-to-date information +on how to complete each task, and links to further practical resources. +For some implementation portions where precise code is particularly important, +we provide minimal code examples either in the book or on the DIME Wiki. +All code guidance is software-agnostic, but code examples are provided in Stata. + +\section{Handling original data is a core research task} + +An empirical revolution has changed the face of development research rapidly over the last decade. +Increasingly, researchers are working not just with complex data, +but with \textit{original} data: +datasets either collected by the research team themselves +or acquired through a unique agreement with a project partner. +Original data, especially those collected or assembled by the team themsselves, +requires that the team carefully document how the data was created, handled, and analyzed. +These tasks now contribute as much weight to the quality of the evidence +as the research design and the statistical approaches do. +At the same time, the scope and scale of empirical research projects is expanding: +more people are working on the same data over longer timeframes. +For that reason, the central premise of this book is that data work is a ``social process''. +This means that the many different people on a team need to have the same ideas +about what is to be done, and when and where and by whom, +so that they can collaborate effectively on a large, long-term research project. + +In the past, these processes were often treated as a ``black box'' in research. +A published manuscript might exhaustively detail +research designs, estimation strategies, and theoretical frameworks, +but would typically reserve very little space for detailed descriptions +of how data was actually collected and handled. +Not only is it almost impossible to assess the quality of the data in such a paper, +it is very hard for research teams -- particularly new staff -- +to understand and learn the skills and tools needed to do this well. +There are few guides to these conventions, standards, and best practices +that are fast becoming a necessity for empirical research. +Since 2010, reproducibility practices in development have been rapidly adopted,\cite{swanson2020research} +in part due to increasing requirements by publishers and funders to release code and data. +However, little practical guidance on complete data handling workflows are available for practitioners, +aside from one relatively recent handbook on reproducibility.\cite{christensen2019transparent} +This book aims to fill that gap. +It covers data workflows at all stages of the research process, +from design to data acquisition and analysis. + +The Analytics team has invested many hours over many years +learning from data work across DIME's portfolio, +identifying inefficiencies and barriers to success, +developing tools and trainings, and standardizing best-practice workflows at DIME. +It has also invested significant energy in the language and materials +used to teach these workflows to new team members, +and, in many cases, in software tools that support these workflows explicitly. +DIME team members often work on diverse portfolios of projects +with a wide range of teammates, and we have found +that standardizing core processes across all projects +results in higher-quality work with fewer opportunities to make mistakes. +In that way, the Analytics team is DIME's method of ``institutionalizing'' +tools and practices, developed and refined over time, +that give the department a common base of knowledge and practice. +In 2018, for example, DIME adoped universal reproducibility checks +conducted by the Analytics team; +the lessons from this practice helped move the DIME team +from where 50\% of submitted papers in 2018 +required significant revision to pass +to where 64\% of papers passed in 2019 without revision required. + +This book is intended to share these ideas, practices, and tools +with everyone who interacts with development data. +Its content is not sector-specific; +it will not teach you econometrics, +or how to design an impact evaluation. +There are many excellent existing resources on those topics. +Instead, this book will teach you how to think about all aspects of your research from a data perspective, +how to structure research projects to maximize data quality, +and how to institute transparent and reproducible workflows. + +Whether it is collected through surveys, shared by partner organizations, +or acquired from ``big'' data sources like sensors, satellites, or call data records, +data handling and documentation is a key skill for researchers and staff. +Standard processes and documentation practices +are important throughout the research process to accurately convey +and implement the intended research design in reality.\cite{vilhuber_lars_2020_3911311} +For example, statistical code is typically an essential part of +research design components such as sampling, randomization, and power analysis. +As data is obtained, its quality must be validated, +linkages between datasets must be organized and managed, +and errors must be identified and corrected. +Once raw data is in hand, researchers must create and analyze the final measures that +are the motivation for the research study; +then they must conduct the actual analyses. +When these are done in an ad-hoc or project-specific manner, +it is very difficult for others to understand what is being done -- +in that case, a reader has to simply trust that the author did these things right. +Standardizing and documenting these processes +makes it possible to evaluate and understand +the exact details of each step of this work +alongside any final research outputs. + +Researchers therefore need to maintain records of the handling and processing of all their data, +which involves managing and collating different types of information, +often at different levels of analysis and different stages in time. +The tight linkages between documentation, data quality, and policy decisions +have long been recognized by research entities such as government statistical agencies,\cite{jepdataquality} +and must now be imported into the practice of researchers who collect original data +rather than relying on data that comes from, for example, highly-experienced statistical agencies. + +A breakdown in any part of this data pipeline +means that the results that are acquired become unreliable.\cite{mccullough2008economics} +If that happens, the results cannot be faithfully interpreted +as being an accurate picture of the intended research design.\sidenote{ + \url{https://blogs.worldbank.org/impactevaluations/more-replication-economics}} +Because we almost never have ``laboratory'' settings in this type of research, +such a failure has a very high cost: +we will have wasted the investments that were made into knowledge generation, +with little ability to reproduce or recreate the situation +where we intended to operate the research project.\cite{camerer2016evaluating} +Hence accurate and reproducible data management and analysis is a core component +of the credibility of development research. +Being able to accurately and reproducibily implement these tasks +is essential to the success and credibility of any modern research output. + + +\section{Documenting data work with standardized code} + +One method of solving this problem in the social context +is what we refer to as \textbf{process standardization}. +Process standardization means that there is +little ambiguity about how something ought to be done, +and therefore the tools to do it can be set in advance. +Standard processes help other people understand your work. +Work should be well-documented in the sense that others can: +(1) quickly understand what a particular process or output is supposed to be doing; +(2) evaluate whether or not it does that thing correctly; and +(3) modify it efficiently either to test alternative hypotheses +or to adapt into their own work. + +Modern quantitative research already relies heavily +on standardized statistical software tools, +written with various coding languages, to standardize analytical work. +Outputs like regression tables and data visualizations +are created using code in statistical software for two primary reasons. +The first is that using a standard command or package ensures that the work is done right, +and the second is that it ensures the same procedure can be confirmed or checked +at a later date or using different data. +Keeping a clear, human-readable record of these code and data structures is critical. +While is often \textit{possible} to perform nearly all the relevant tasks +through an interactive user interface or even through software such as Excel, +this practice is strongly advised against. +In the context of statistical analysis, +the practice of writing all work using standard code is widely accepted. +To support this practice, DIME now maintains strict portfolio-wide standards +about how analytical code should be maintained and made accessible +before, during, and after release or publication. + +Over the last few years, DIME has extended the same principles to preparing data for analysis, +which often comprises just as much (or more) of the manipulation done to the data +over the life cycle of a research project. +A major aim of this book is to encourage research teams +to think of the tools and processes they use +for designing, collecting, and handling data +just as they do for analytical tasks. +Correspondingly, a major contribution of DIME Analytics +has been tools and standard practices +for implementing these tasks using statistical software. +While we assume that you are going to do nearly all data work using code, +many development researchers come from economics and statistics backgrounds +and often understand code to be a means to an end rather than an output itself. +We believe that this must change somewhat: +in particular, we think that development practitioners +must think about their code and programming workflows +just as methodologically as they think about their research workflows, +and think of code and data as research outputs, just as manuscripts and briefs are. + +This approach arises because we see the code as the ``recipe'' for the analysis. +The code tells others exactly what was done, +how they can do it again in the future, +and provides a roadmap and knowledge base for further original work.\cite{hamermesh2007replication} +Performing every task through written code +creates a record of every task you performed.\cite{ozier2019replication} +It also prevents direct interaction +with the data files that could lead to non-reproducible processes.\cite{chang2015economics} +Finally, DIME Analytics has invested a lot of time in developing code as a learning tool: +the examples we have written and the commands we provide +are designed to provide a framework for common practice +across the entire DIME team, so that everyone is able to +read, review, and provide feedback on the work of others +starting from the same basic ideas about how various tasks are done. + +Most specific code tools have a learning and adaptation process, +meaning you will become most comfortable with each tool +only by using it in real-world work. +To support your process of learning reproducible tools and workflows, +we reference free and open-source tools wherever possible, +and point to more detailed instructions when relevant. +Stata, as a proprietary software, is the notable exception here +due to its persistent popularity in development economics and econometrics.\sidenote{ + \url{https://aeadataeditor.github.io/presentation-20191211/\#9}} +This book also includes, as an appendix, +the \textbf{DIME Analytics Stata Style Guide} +that we use in our work, which provides +standards for coding in Stata so that code styles +can be harmonized across teams for easier understanding and reuse of code. +Stata has relatively few resources of this type available, +and the one that we have created and shared here +we hope will be an asset to all its users. + +While adopting the workflows and mindsets described in this book requires an up-front cost, +it will save you (and your collaborators) a lot of time and hassle very quickly. +In part this is because you will learn how to implement essential practices directly; +in part because you will find tools for the more advanced practices; +and most importantly because you will acquire the mindset of doing research with a high-quality data focus. +We hope you will find this book helpful for accomplishing all of the above, +and that mastery of data helps you make an impact. +We hope that by the end of the book, +you will have learned how to handle data more efficiently, effectively and ethically +at all stages of the research process. + +\mainmatter diff --git a/chapters/1a-reproducibility.tex b/chapters/1-reproducibility.tex similarity index 100% rename from chapters/1a-reproducibility.tex rename to chapters/1-reproducibility.tex diff --git a/chapters/planning-data-work.tex b/chapters/2-collaboration.tex similarity index 100% rename from chapters/planning-data-work.tex rename to chapters/2-collaboration.tex diff --git a/chapters/sampling-randomization-power.tex b/chapters/3-measurement.tex similarity index 100% rename from chapters/sampling-randomization-power.tex rename to chapters/3-measurement.tex diff --git a/chapters/data-collection.tex b/chapters/4-acquisition.tex similarity index 100% rename from chapters/data-collection.tex rename to chapters/4-acquisition.tex diff --git a/chapters/data-processing.tex b/chapters/5-processing.tex similarity index 100% rename from chapters/data-processing.tex rename to chapters/5-processing.tex diff --git a/chapters/data-analysis.tex b/chapters/6-analysis.tex similarity index 100% rename from chapters/data-analysis.tex rename to chapters/6-analysis.tex diff --git a/chapters/introduction.tex b/chapters/introduction.tex deleted file mode 100644 index 4b34244d3..000000000 --- a/chapters/introduction.tex +++ /dev/null @@ -1,254 +0,0 @@ -\begin{fullwidth} -Welcome to \textit{Data for Development Impact}. -This book is intended to teach all users of development data -how to handle data effectively, efficiently, and ethically. -An empirical revolution has changed the face of research economics rapidly over the last decade. -%had to remove cite {\cite{angrist2017economic}} because of full page width -Today, especially in the development subfield, working with raw data -- -whether collected through surveys or acquired from ``big'' data sources like sensors, satellites, or call data records -- -is a key skill for researchers and their staff. -At the same time, the scope and scale of empirical research projects is expanding: -more people are working on the same data over longer timeframes. -As the ambition of development researchers grows, so too has the complexity of the data -on which they rely to make policy-relevant research conclusions. -Yet there are few guides to the conventions, standards, and best practices -that are fast becoming a necessity for empirical research. -This book aims to fill that gap. - -This book is targeted to everyone who interacts with development data: -graduate students, research assistants, policymakers, and empirical researchers. -It covers data workflows at all stages of the research process, from design to data acquisition and analysis. -Its content is not sector-specific; it will not teach you econometrics, or how to design an impact evaluation. -There are many excellent existing resources on those topics. -Instead, this book will teach you how to think about all aspects of your research from a data perspective, -how to structure research projects to maximize data quality, -and how to institute transparent and reproducible workflows. -The central premise of this book is that data work is a ``social process'', -in which many people need to have the same idea about what is to be done, and when and where and by whom, -so that they can collaborate effectively on large, long-term research projects. -It aims to be a highly practical resource: we provide code snippets, links to checklists and other practical tools, -and references to primary resources that allow the reader to immediately put recommended processes into practice. - -\end{fullwidth} - -%------------------------------------------------ - -\section{Doing credible research at scale} - -The team responsible for this book is known as \textbf{DIME Analytics}.\sidenote{ -\url{https://www.worldbank.org/en/research/dime/data-and-analytics}} -The DIME Analytics team is part of the \textbf{Development Impact Evaluation (DIME)} Department\sidenote{ -\url{https://www.worldbank.org/en/research/dime}} -within the World Bank's \textbf{Development Economics (DEC) Vice Presidency}.\sidenote{ -\url{https://www.worldbank.org/en/about/unit/unit-dec}} - -DIME generates high-quality and operationally relevant data and research -to transform development policy, help reduce extreme poverty, and secure shared prosperity. -It develops customized data and evidence ecosystems to produce actionable information -and recommend specific policy pathways to maximize impact. -DIME conducts research in 60 countries with 200 agencies, leveraging a -US\$180 million research budget to shape the design and implementation of -US\$18 billion in development finance. -DIME also provides advisory services to 30 multilateral and bilateral development agencies. -Finally, DIME invests in public goods (such as this book) to improve the quality and reproducibility of development research around the world. - -DIME Analytics was created to take advantage of the concentration and scale of research at DIME to develop and test solutions, -to ensure high quality data collection and research across the DIME portfolio, -and to make training and tools publicly available to the larger community of development researchers. -\textit{Data for Development Impact} compiles the ideas, best practices and software tools Analytics -has developed while supporting DIME's global impact evaluation portfolio. - -The \textbf{DIME Wiki} is one of our flagship products, a free online collection of our resources and best practices.\sidenote{ -\url{https://dimewiki.worldbank.org}} -This book complements the DIME Wiki by providing a structured narrative of the data workflow for a typical research project. -We will not give a lot of highly specific details in this text, -but we will point you to where they can be found.\sidenote{Like this: -\url{https://dimewiki.worldbank.org/Primary_Data_Collection}} -Each chapter focuses on one task, providing a primarily narrative account of: -what you will be doing; where in the workflow this task falls; -when it should be done; and how to implement it according to best practices. - -We will use broad terminology throughout this book to refer to research team members: -\textbf{principal investigators (PIs)} who are responsible for -the overall design and stewardship of the study; -\textbf{field coordinators (FCs)} who are responsible for -the implementation of the study on the ground; -and \textbf{research assistants (RAs)} who are responsible for -handling data processing and analytical tasks. - - -\section{Adopting reproducible tools} - -We assume througout all of this book -that you are going to do nearly all of your data work though code. -It may be possible to perform all relevant tasks -through the user interface in some statistical software, -or even through less field-specific software such as Excel. -However, we strongly advise against it. -The reason for that are the transparency, reproducibility and credibility principles -discussed in Chapter 1. -Writing code creates a record of every task you performed. -It also prevents direct interaction -with the data files that could lead to non-reproducible processes. -Think of the code as a recipe to create your results: -other people can follow it, reproduce it, -and even disagree with your the amount of spices you added -(or some of your coding decisions). -Many development researchers come from economics and statistics backgrounds -and often understand code to be a means to an end rather than an output itself. -We believe that this must change somewhat: -in particular, we think that development practitioners -must begin to think about their code and programming workflows -just as methodologically as they think about their research workflows. - -Most tools have a learning and adaptation process, -meaning you will become most comfortable with each tool -only by using it in real-world work. -To support your process of learning reproducible tools and workflows, -will reference free and open-source tools wherever possible, -and point to more detailed instructions when relevant. -Stata, as a proprietary software, is the notable exception here -due to its current popularity in development economics.\sidenote{ - \url{https://aeadataeditor.github.io/presentation-20191211/\#9}} -This book also includes -the DIME Analytics Stata Style Guide -that we use in our work, which provides -some new standards for coding so that code styles -can be harmonized across teams for easier understanding and reuse of code. -Stata has relatively few resources of this type available, -and the ones that we have created and shared here -we hope will be an asset to all its users. - - -\section{Writing reproducible code in a collaborative environment} -Throughout the book, we refer to the importance of good coding practices. -These are the foundation of reproducible and credible data work, -and a core part of the new data science of development research. -Code today is no longer a means to an end (such as a research paper), -rather it is part of the output itself: a means for communicating how something was done, -in a world where the credibility and transparency of data cleaning and analysis is increasingly important. -As this is fundamental to the remainder of the book's content, -we provide here a brief introduction to \textbf{``good'' code} and \textbf{process standardization}. - -``Good'' code has two elements: (1) it is correct, i.e. it doesn't produce any errors, -and (2) it is useful and comprehensible to someone who hasn't seen it before -(or even yourself a few weeks, months or years later). -Many researchers have been trained to code correctly. -However, when your code runs on your computer and you get the correct results, -you are only half-done writing \textit{good} code. -Good code is easy to read and replicate, making it easier to spot mistakes. -Good code reduces sampling, randomization, and cleaning errors. -Good code can easily be reviewed by others before it's published and replicated afterwards. - -Process standardization means that there is -little ambiguity about how something ought to be done, -and therefore the tools to do it can be set in advance. -Standard processes for code help other people to ready your code.\sidenote{ -\url{https://dimewiki.worldbank.org/Stata_Coding_Practices}} -Code should be well-documented, contain extensive comments, and be readable in the sense that others can: -(1) quickly understand what a portion of code is supposed to be doing; -(2) evaluate whether or not it does that thing correctly; and -(3) modify it efficiently either to test alternative hypotheses -or to adapt into their own work.\sidenote{\url{https://kbroman.org/Tools4RR/assets/lectures/07_clearcode.pdf}} - -You should think of code in terms of three major elements: -\textbf{structure}, \textbf{syntax}, and \textbf{style}. -We always tell people to ``code as if a stranger would read it'' -(from tomorrow, that stranger could be you!). -The \textbf{structure} is the environment your code lives in: -good structure means that it is easy to find individual pieces of code that correspond to tasks. -Good structure also means that functional blocks are sufficiently independent from each other -that they can be shuffled around, repurposed, and even deleted without damaging other portions. -The \textbf{syntax} is the literal language of your code. -Good syntax means that your code is readable -in terms of how its mechanics implement ideas -- -it should not require arcane reverse-engineering -to figure out what a code chunk is trying to do. -\textbf{Style}, finally, is the way that the non-functional elements of your code convey its purpose. -Elements like spacing, indentation, and naming (or lack thereof) can make your code much more -(or much less) accessible to someone who is reading it for the first time and needs to understand it quickly and correctly. - -As you gain experience in coding -and get more confident with the way you implement these suggestions, -you will feel more empowered to apply critical thinking to the way you handle data. -For example, you will be able to predict which section -of your script are more likely to create errors. -This may happen intuitively, but you will improve much faster as a coder -if you do it purposefully. -Ask yourself, as you write code and explore results: -Do I believe this number? -What can go wrong in my code? -How will missing values be treated in this command? -What would happen if more observations would be added to the dataset? -Can my code be made more efficient or easier to understand? - -\subsection{Code examples} -For some implementation portions where precise code is particularly important, -we will provide minimal code examples either in the book or on the DIME Wiki. -All code guidance is software-agnostic, but code examples are provided in Stata. -In the book, code examples will be presented like the following: - -\codeexample{code.do}{./code/code.do} - -We ensure that each code block runs independently, is well-formatted, -and uses built-in functions as much as possible. -We will point to user-written functions when they provide important tools. -In particular, we point to two suites of Stata commands developed by DIME Analytics, -\texttt{ietoolkit}\sidenote{\url{https://dimewiki.worldbank.org/ietoolkit}} and -\texttt{iefieldkit},\sidenote{\url{https://dimewiki.worldbank.org/iefieldkit}} -which standardize our core data collection, management, and analysis workflows. -We will comment the code generously (as you should), -but you should reference Stata help-files by writing \texttt{help [command]} -whenever you do not understand the command that is being used. -We hope that these snippets will provide a foundation for your code style. -Providing some standardization to Stata code style is also a goal of this team; -we provide our guidance on this in the Stata Style Guide in the Appendix. - -\section{Outline of this book} - -This book covers each stage of an empirical research project, from design to publication. -We start with ethical principles to guide empirical research, -focusing on research transparency and the right to privacy. -In Chapter 1, we outline a set of practices that help to ensure -research participants are appropriately protected and -research consumers can be confident in the conclusions reached. -Chapter 2 will teach you to structure your data work to be efficient, -collaborative and reproducible. -It discusses the importance of planning data work at the outset of the research project -- -long before any data is acquired -- and provides suggestions for collaborative workflows and tools. -In Chapter 3, we turn to research design, -focusing specifically on how to measure treatment effects -and structure data for common experimental and quasi-experimental research methods. -We provide an overview of research designs frequently used for -causal inference, and consider implications for data structure. -Chapter 4 concerns sampling and randomization: -how to implement both simple and complex designs reproducibly, -and how to use power calculations and randomization inference -to critically and quantitatively assess -sampling and randomization to make optimal choices when planning studies. - -Chapter 5 covers data acquisition. We start with -the legal and institutional frameworks for data ownership and licensing, -dive in depth on collecting high-quality survey data, -and finally discuss secure data handling during transfer, sharing, and storage. -Chapter 6 teaches reproducible and transparent workflows for data processing and analysis, -and provides guidance on de-identification of personally-identified data, -focusing on how to organize data work so that it is easy to code the desired analysis. -In Chapter 7, we turn to publication. You will learn -how to effectively collaborate on technical writing, -how and why to publish data, -and guidelines for preparing functional and informative replication packages. - - -While adopting the workflows and mindsets described in this book requires an up-front cost, -it will save you (and your collaborators) a lot of time and hassle very quickly. -In part this is because you will learn how to implement essential practices directly; -in part because you will find tools for the more advanced practices; -and most importantly because you will acquire the mindset of doing research with a high-quality data focus. -We hope you will find this book helpful for accomplishing all of the above, -and that mastery of data helps you make an impact. -We hope that by the end of the book, -you will have learned how to handle data more efficiently, effectively and ethically -at all stages of the research process. - -\mainmatter diff --git a/manuscript.tex b/manuscript.tex index 70f8b495b..f189f7d8e 100644 --- a/manuscript.tex +++ b/manuscript.tex @@ -16,16 +16,16 @@ % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\input{chapters/preamble.tex} +\input{auxiliary/preamble.tex} %---------------------------------------------------------------------------------------- % INTRODUCTION %---------------------------------------------------------------------------------------- \cleardoublepage -\chapter{Introduction: Data for development impact} % The asterisk leaves out this chapter from the table of contents +\chapter{Introduction: Development research in practice} -\input{chapters/introduction.tex} +\input{chapters/0-introduction.tex} %---------------------------------------------------------------------------------------- % CHAPTER 1 @@ -35,16 +35,16 @@ \chapter{Introduction: Data for development impact} % The asterisk leaves out th \chapter{Chapter 1: Reproducibility, transparency, and credibility} \label{ch:1} -\input{chapters/1a-reproducibility.tex} +\input{chapters/1-reproducibility.tex} %---------------------------------------------------------------------------------------- % CHAPTER 2 %---------------------------------------------------------------------------------------- -\chapter{Chapter 2: Collaborating on code and data} +\chapter{Chapter 2: Setting the stage for collaboration} \label{ch:2} -\input{chapters/planning-data-work.tex} +\input{chapters/2-collaboration.tex} %---------------------------------------------------------------------------------------- % CHAPTER 3 @@ -54,7 +54,7 @@ \chapter{Chapter 2: Collaborating on code and data} \chapter{Chapter 3: Establishing a measurement framework} \label{ch:3} -\input{chapters/sampling-randomization-power.tex} +\input{chapters/3-measurement.tex} %---------------------------------------------------------------------------------------- @@ -62,10 +62,10 @@ \chapter{Chapter 3: Establishing a measurement framework} %---------------------------------------------------------------------------------------- -\chapter{Chapter 4: Acquiring data} +\chapter{Chapter 4: Acquiring development data} \label{ch:4} -\input{chapters/data-collection.tex} +\input{chapters/4-acquisition.tex} @@ -73,10 +73,10 @@ \chapter{Chapter 4: Acquiring data} % CHAPTER 5 %---------------------------------------------------------------------------------------- -\chapter{Chapter 5: Cleaning data} +\chapter{Chapter 5: Cleaning and processing research data} \label{ch:5} -\input{chapters/data-processing.tex} +\input{chapters/5-processing.tex} %---------------------------------------------------------------------------------------- % CHAPTER 6 @@ -85,13 +85,13 @@ \chapter{Chapter 5: Cleaning data} \chapter{Chapter 6: Analyzing research data} \label{ch:6} -\input{chapters/data-analysis.tex} +\input{chapters/6-analysis.tex} %---------------------------------------------------------------------------------------- % CHAPTER 7 %---------------------------------------------------------------------------------------- -\chapter{Chapter 7: Publishing collaborative research} +\chapter{Chapter 7: Publishing research outputs} \label{ch:7} \input{chapters/7-publication.tex} @@ -100,9 +100,9 @@ \chapter{Chapter 7: Publishing collaborative research} % Conclusion %---------------------------------------------------------------------------------------- -\chapter{Bringing it all together} +\chapter*{Bringing it all together} % The asterisk leaves out this chapter from the table of contents -\input{chapters/conclusion.tex} +\input{auxiliary/conclusion.tex} %---------------------------------------------------------------------------------------- % APPENDIX : Stata Style Guide @@ -111,10 +111,20 @@ \chapter{Bringing it all together} \chapter{Appendix: The DIME Analytics Stata Style Guide} \label{ap:1} -\input{appendix/stata-guide.tex} +\input{auxiliary/stata-guide.tex} %---------------------------------------------------------------------------------------- +%---------------------------------------------------------------------------------------- +% APPENDIX : Research Design +%---------------------------------------------------------------------------------------- + +\chapter{Appendix: Research design for impact evaluation} +\label{ap:1} + +\input{auxiliary/research-design.tex} + +%---------------------------------------------------------------------------------------- \backmatter diff --git a/mkdocs/docs/bookpdf/Data-for-Development-Impact.pdf b/mkdocs/docs/bookpdf/Data-for-Development-Impact.pdf deleted file mode 100644 index 3855c90ed..000000000 Binary files a/mkdocs/docs/bookpdf/Data-for-Development-Impact.pdf and /dev/null differ diff --git a/mkdocs/docs/feedback.md b/mkdocs/docs/feedback.md index 98429dfe9..cb790ad77 100644 --- a/mkdocs/docs/feedback.md +++ b/mkdocs/docs/feedback.md @@ -12,7 +12,7 @@ You can always send us an email with any type of feedback to [dimeanalytics@worl ### Feedback through GitHub.com -If you are familiar with GitHub you can use the repository [github.com/worldbank/d4di](https://github.com/worldbank/d4di) to provide feedback. You can, for example, create [issues](https://www.github.com/worldbank/d4di/issues) with suggestions for improvement or contribute directly by forking the repository. +If you are familiar with GitHub you can use the repository [github.com/worldbank/dime-data-handbook](https://github.com/worldbank/dime-data-handbook) to provide feedback. You can, for example, create [issues](https://www.github.com/worldbank/dime-data-handbook/issues) with suggestions for improvement or contribute directly by forking the repository. ## Already addressed errata diff --git a/mkdocs/docs/index.md b/mkdocs/docs/index.md index aaf80962e..1bee491c6 100644 --- a/mkdocs/docs/index.md +++ b/mkdocs/docs/index.md @@ -1,6 +1,6 @@ # Home -Welcome to Data for Development Impact: The DIME Analytics Resource Guide. +Welcome to Development Research in Practice: The DIME Analytics Data Handbook. This book is intended to serve as an introduction to the primary tasks required in development research, from experimental design to data collection to data analysis to publication. @@ -9,5 +9,5 @@ and is produced by [DIME Analytics](https://www.worldbank.org/en/research/dime/d ### Download the Book in PDF Format -[Download from Github](https://github.com/worldbank/d4di/raw/master/mkdocs/docs/bookpdf/Data-for-Development-Impact.pdf) - +[Download from Github](https://github.com/worldbank/dime-data-handbook/raw/gh-pages/bookpdf/development-research-in-practice.pdf) + diff --git a/mkdocs/mkdocs.yml b/mkdocs/mkdocs.yml index bd47aa255..825f17ae0 100644 --- a/mkdocs/mkdocs.yml +++ b/mkdocs/mkdocs.yml @@ -1,4 +1,4 @@ -site_name: Data for Development Impact +site_name: Development Research in Practice theme: name: material palette: