diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index c2ecc4ab2..d9ab2419b 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -3,32 +3,42 @@ \begin{fullwidth} In this chapter we will show how you can save a lot of time -and increase the quality of your research by planning your project's data requirements -in advance, based on your research design. -There are many published resources about -the theories behind different research designs. -This chapter will instead focus on how the design -impacts a project's data requirements. -We assume you have a working familiarity -with the research designs mentioned here. -If needed, you can reference \textcolor{red}{Appendix XYZ}, -where you will find more details -and specific references for common impact evaluation methods. +and increase the quality of your research by +planning your project's data requirements in advance. +Planning data requirements requires more than +simply listing the key outcome variables. +You need to understand how to structure the project's data +to best answer the research questions, +and create the tools to share this understanding across your team. + +The first section of this chapter discusses how to +determine your project's data needs, +and introduces \textit{DIME's Data Map} template. +The template includes: +one data linkage table, +one or several master datasets, and +one or several data flowcharts. +These three tools will help to communicate the project's data requirements +both across the team and across time. +This section also discusses what specific research data you need +based on your project's research design, +and how to document those data needs in the data map. +We will discuss two types of variables: +variables that tie your research design +to the observations in the data +which we call \textbf{research variables}; +and variables that correspond to observations of the real world, +which we call \textbf{measurement variables}. +The project's data map needs to account for both. + +The second section of this chapter covers two activities where +research data is created by the research team +instead of being observed in the real world. +Those two activities are random sampling and random assignment. +Special focus is spent on how to ensure that +these and other random processes are reproducible, +which is critical for the credibility of your research. -Planning data requirements is more than just listing key outcome variables. -It requires understanding how to structure the project's data to best answer the research questions, -and creating the tools to share this understanding across your team. -The first section of this chapter discusses how to determine the data needs of the project, -based on the research design and measurement framework, -and how to document these through a data map and master dataset(s). -The second section of this chapter covers random sampling and assignment -and the necessary practices to ensure that -these and other random processes are reproducible. -Almost all research designs rely on a random component -for the results of the research to be a valid interpretation of the real world. -This includes both how a sample is representative to the population studied, -and how the counterfactual observations in experimental design are statistically indistinguishable -from the treatment observations. The chapter concludes with a discussion of power calculations and randomization inference, and how both are important tools to make optimal choices when planning data work. @@ -37,18 +47,13 @@ %----------------------------------------------------------------------------------------------- -\section{Translating research design to master data} +\section{Creating a data map} In most projects, more than one data source is needed to answer the research question. -These could be multiple survey rounds, -data acquired from different partners (e.g. administrative data, -web scraping, implementation monitoring, etc) -or complex combinations of these. -For example, you may have different \textbf{units of observation}\sidenote{ - The \textbf{unit of observation} is the unit at or for which data is collected. See - \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}}, -and their level may vary from round to round. - +These could be data from multiple survey rounds, +data acquired from different partners (such as administrative data, implementation data, sensor data), +web scraping, +or complex combinations of these and other sources. However your study is structured, you need to know how to link data from all sources and analyze the relationships between the units that appear in them to answer all your research questions. @@ -56,114 +61,243 @@ \section{Translating research design to master data} but your whole research team is unlikely to have the same understanding, at all times, of all the datasets required. The only way to make sure that the full team shares the same understanding -is to create \textit{master datasets} and a \textit{data map}. - -\index{Master datasets}\textbf{Master datasets} serve three key functions. -First, they list all the units that are eligible for the study, -and enable you to map the data to the research design. -Second, in designs where your team has direct control over interventions or other field work, -they allow you to plan sampling and treatment assignment before going to the field. -Finally, they constitute the single unambiguous location where all information -related to the implementation and validity of your research is stored, -as well as all information needed to correctly identify any observation in any of your project's datasets. -The \textbf{data map} diagrams each data source the project will use, -the unit of response, frequency of measurement, -and the level(s) at which it can be linked to other datasets within the framework. - - -\subsection{Creating master datasets and a data map} - -A \textbf{master dataset}\sidenote{ +is to create a \textbf{data map}\index{Data map}.\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Map}} +DIME's data map template has three components: +one \textit{data linkage table},\index{Data linkage table} +one or several \textit{master datasets}\index{Master datasets} +and one or several \textit{data flowcharts}.\index{Data flowchart} + +A \textbf{data linkage table}\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Linkage\_Table}} +lists all the datasets that will be used in the project. +Its most important function is to indicate +how all those datasets can be be linked when +combining information from multiple data sources. +Each \textbf{master dataset}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} -\index{master datasets} -details all project-wide time-invariant information -about all observations encountered, -as well as their relationship to the research design, -typically summarized by sampling and treatment status. -Having a plan for how to get the raw data into analysis shape -before you acquire it, -and making sure that the full research team knows where to find this information, -will save you a ton of time during the course of the project -and increase the quality of your research. - -You should create a master dataset +lists all observations relevant to the project +for a \textbf{unit of observation}\sidenote{ + \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} +and is the authoritative source for all research data +about that unit of observation +including unique identifiers, sample status, and treatment assignment. +\textbf{Data flowcharts}\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} +list all data sources that are needed to create each analysis dataset, +and what manipulation of these data sources is necessary +to get to the final analysis dataset(s), +such as merging, appending, or other linkages. + +The process of drafting the data map is itself useful, +as it is an opportunity for the principal investigators +to communicate their vision of the data environment, +and for research assistants to communicate +their understanding of that vision. +The data map should be drafted at the outset of a project, +before any data is acquired, +but it is not a static document; +it will need to be updated as the project evolves. + +\subsection{Data linkage table} + +To create a data map according to DIME's template, +the first step is to create a \textbf{data linkage table} by listing +all the data sources you know you will use in a spreadsheet. +If one source of data will result in two different datasets, +then list each dataset on its own row. +For each dataset, list the unit of observation +and the name of the project ID variable for that unit of observation. +Your project should only have one project ID variable per unit of observation. +When you list a dataset in the data linkage table -- +which should be done before that dataset is acquired -- +you should always make sure that the dataset will +be fully and uniquely identified by the project ID, +or make a plan for how +the new dataset will be linked to the project ID. +It is very labor intensive to work with a dataset that +do not have an unambiguous way to link to the project ID, +and it is a major source of error. + +The data linkage table should indicate whether +datasets can be merged one-to-one (for example, +merging baseline and endline datasets +that use the same unit of observation), +or whether two datasets need to be merged many-to-one +(for example, school administrative data merged with student data). +Your data map must indicate which ID variables +can be used -- and how -- when merging datasets. +The data linkage table is also a great place to list other metadata, +such as the source of your data, its backup locations, +the nature of the data license, and so on. + +\subsection{Master datasets} + +The second step in creating a data map is to create one \textbf{master dataset} for each unit of observation -relevant to the research. -This includes all units used for significant research activity, -like data collection or data analysis. -Therefore, any unit -that will be used in sampling or treatment assignment, -must have a master dataset, -and that master dataset -- not field data -- -should be used when sampling or assigning treatment. -Master data sets are often created from field data, -but master data sets should be treated differently -as there can be several field datasets, -but only one authoritative master dataset. -Master data sets should be created as soon as you -start to get information about units. -For example, when receiving a type of administrative data set for the first time, -or after doing a respondent listing before a survey. - -You also need to record how all datasets for each unit of observation -will link or merge with each other as needed. -This linking scheme is called a \textbf{data map}\sidenote{ - \url{https://dimewiki.worldbank.org/data\_map} (TO BE CREATED)}. -\index{data maps} -A data map is more than just a list of datasets. -Its purpose is to specify the characteristics and linkages of those datasets. -To link properly, the master datasets must include fully unique ID variables.\sidenote{ - \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} -The master datasets should indicate whether datasets should be merge one-to-one, -for example, merging baseline data and endline data that use the same unit of observation, -or whether two datasets should be merged many-to-one, -for example, school administrative data merged with student data. -Your data map must indicate which ID variables can be used and how to merge datasets. -It is common that administrative data use IDs -that are different than the project IDs, -and the linkage between those should be clear from your master dataset. - -The data map should also include metadata about the handling of all information. -These characteristics may be updated as the project progresses. -For example, you will need to note the original source of each dataset, -as well as the project folder where -the raw original data and codebooks are stored -and where the back-ups for the each raw dataset are stored. - -Some of the characteristics in your master datasets and your data map -should be filled in during the planning stage, -but both of them should be active resources -that are updated as your project evolves. -Finally, your master data should not include any unlabeled missing values. -If the information is missing for one unit, -then the reason should always be indicated with a code. -An example for such reason could be that a unit was not included in the treatment assignment -as it was not sampled in the first place, -or was not located in the data collection at a given round. - -\subsection{Defining study comparisons using master data} - -Your research design will determine what statistical comparisons you need -to estimate in your analytical work. -The research designs discussed here compare a group that received -some kind of \textbf{treatment}\sidenote{ +that will be used in any significant research activity. +Examples of such activities are data collection, data analysis, +sampling, and treatment assignment. +The master dataset should include and be the authoritative source of +all \textbf{research variables}\sidenote{ + \textbf{Research variables:} Research data that identifies observations + and maps research design information to those observations. + Research variables are time-invariant and + often, but not always, controlled by the research team. + Examples include + ID variables, sampling status, treatment status, and treatment uptake.} +but not include any \textbf{measurement variables}.\sidenote{ + \textbf{Measurement variables:} Data that + measures attributes or records responses of research subjects. + Measurement variables are not controlled by the research team + and often vary over time. + Examples include characteristics of the research subject, + outcome variables, and control variables among many others.} +Research variables and measurement variables +often come from the same source, +but should not be stored in the same way. +For example, if you acquire administrative data that both includes +information on eligibility for the study (research variable) +and data on the topic of your study (measurement variable) +you should first decide which variables are research variables, +and store them in the master dataset, +while storing the measurement variables in a cleaned dataset +as described in Chapter 5. +It is common that you will have to update +your master datasets throughout your project. + +The most important function of the master dataset +is to be the authoritative source +for how all observations are identified. +This means that the master datasets should include +identifying information such as names, contact information, +but also your \textbf{project ID}.\sidenote{ + \textbf{Project ID:} The main ID used in your project to identify + observations. + You should never have multiple project IDs for the same unit of observation. + The project ID must uniquely and fully identify all observations in the project. + See \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties} for more details.} +The project ID is the ID variable used in the data linkage table, +and is therefore how observations are linked across datasets. +Your master dataset may list alternative IDs that are used, +for example, by a partner organization. +However, you must not use such an ID as your project ID, +as you would then not be in control over +who can re-identify data that you publish. +The project ID must be created by the project team, +and the linkage to direct identifiers +should only be known to people listed on the IRB. +If you receive a dataset with an alternative ID, +you should immediately replace it with your project ID, +and the alternative ID should be dropped +as a part of your de-identification (see Chapters 5 and 7). +Your master dataset serves as the linkage between +all other identifying information and your project ID. +Since your master dataset is full of identifying information, +it must always be encrypted. + +The starting point for the master dataset is typically a sampling frame +(more on sampling frames later in this chapter). +However, you should continuously update the master dataset with +all observations ever encountered in your project, +even if those observations are not eligible for the study. +Examples include new observations listed during monitoring activities +or observations that are connected to respondents in the study, +for example in a social network module. +This is useful because, +if you ever need to perform a record linkage such as a fuzzy match +on string variables like proper names, +the more information you have the fewer errors you are likely to make +If you ever need to do a fuzzy match, +you should always do that between the master dataset +and the dataset without an unambiguous identifier. +You should not do anything with that dataset until +you have successfully merged +the project IDs from the master dataset. +Any new observations that you + +Since the master datasets is the authoritative source +of the project ID and all research variables, +it serves as an unambiguous method of mapping +the observations in your study to your research design. + +\subsection{Data flowcharts} + +The third and final step in creating the data map is to create \textbf{data flowcharts}. +Each analysis dataset +(see Chapter 6 for discussion on why you likely need multiple analysis datasets) +should have a data flowchart showing how it was created. +The flowchart is a diagram +where each starting point is either a master dataset +or a dataset listed in the data linkage table. +The data flowchart should include instructions on how +the datasets can be combined to create the analysis dataset. +The operations used to combine the data could include: +appending, one-to-one merging, +many-to-one or one-to-many merging, collapsing, or a broad variety of others. +You must list which variable or set of variables +should be used in each operation, +and note whether the operation creates a new variable or combination of variables +to identify the newly linked data. +These variables should be project IDs when possible. +Examples of exception are time variables in longitudinal data, +and sub-units like farm plots that belong to farmers with project IDs. +Once you have acquired the datasets listed in the flowchart, +you can add to the data flowcharts the number of observations that +the starting point dataset has +and the number of observation each resulting datasets +should have after each operation. +This is a great method to track attrition and to make sure that +the operations used to combine datasets did not create unwanted duplicates +or incorrectly drop any observations. + +A data flowchart can be created in a flowchart drawing tool +(there are many free alternatives online) or +by using shapes in Microsoft PowerPoint. +You can also do this simply by drawing on a piece of paper and taking a photo, +but we recommend a digital tool +so that flowcharts can easily be updated over time. + +\section{Relate research design to a data map} + +After you have set up your data map, +you need to carefully think about your research design +and which research variables you will need in the data analysis +to infer the relation between differences in measurement variables +and your research design. +We assume you have a working familiarity +with the research designs mentioned here. +If needed, you can reference \textcolor{red}{Appendix XYZ}, +where you will find more details +and specific references for common impact evaluation methods. + +\subsection{Defining research variables related to your research design} + +As DIME primarily works on impact evaluations, +we focus our discussion here on research designs +that compare a group that received +some kind of \textbf{treatment}\index{Treatment}\sidenote{ \textbf{Treatment:} The general word for the evaluated intervention or event. - This includes being offered training or cash transfer from a program, experiencing a natural disaster etc.} -against a counterfactual control group.\sidenote{ - \textbf{Counterfactual:} A statistical description of what would have happened + This includes things like being offered a training, + a cash transfer from a program, + or experiencing a natural disaster, among many others.} +against a counterfactual control group\index{Counterfactual}.\sidenote{ + \textbf{Counterfactual:} A statistical description of + what would have happened to specific individuals in an alternative scenario, for example, a different treatment assignment outcome.} -\index{counterfactual} + The key assumption is that each -person, facility, or village (or whatever the unit of intervention is) +person, facility, or village +(or whatever the unit of treatment is) had two possible states: their outcome if they did receive the treatment and their outcome if they did not receive that treatment. The average impact of the treatment, or the ATE\sidenote{ - The \textbf{average treatment effect (ATE)} - is the expected average change in outcome - that untreated units would have experienced + The \textbf{average treatment effect (ATE)} + is the expected average change in outcome + that untreated units would have experienced had they been treated.}, -is defined as the difference +is defined as the difference between these two states averaged over all units. However, we can never observe the same unit @@ -172,32 +306,36 @@ \subsection{Defining study comparisons using master data} Instead, the treatment group is compared to a control group that is statistically indistinguishable, which makes the average impact of the treatment -mathematically equivalent to the difference in averages between the groups. +mathematically equivalent to +the difference in averages between the groups. Statistical similarity is often defined -as \textbf{balance} between two or more groups. +as \textbf{balance} between two or more groups. Since balance tests are commonly run for impact evaluations, -DIME Analytics created a Stata command to +DIME Analytics created a Stata command to standardize and automate the creation of nicely-formatted balance tables: \texttt{iebaltab}\sidenote{ \url{https://dimewiki.worldbank.org/iebaltab}}. -Each research design has a different method for identifying the statistically-similar control group. -for how the statistically similar control group is identified. -The rest of this section covers how data requirements differ -between different research designs. -What does not differ, however, is that the authoritative source -for which units are in the treatment group and which are in the control group -should always be one or several variables in your master dataset. -You will often have to merge that data to other datasets, -but that is an easy task if you created a data map. +Each research design has a different method for +identifying the statistically-similar control group. +The rest of this section covers how research data requirements +differ between those different methods. +What does not differ, however, +is that these data requirements are all research variables. +And that the research variables discussed below +should always be included in the master dataset. +You will often have to merge +the research variables to other datasets, +but that is an easy task +if you created a data linkage table. %%%%% Experimental design -In \textbf{experimental research designs}, -such as \textbf{randomized control trials (RCTs)},\sidenote{ +In \textbf{experimental research designs}, +such as\index{randomized control trials}\index{experimental research designs} +\textbf{randomized control trials (RCTs)},\sidenote{ \url{https://dimewiki.worldbank.org/Randomized\_Control\_Trials}} -\index{randomized control trials} \index{experimental research designs} the research team determines which members of the studied population will receive the treatment. This is typically done by a randomized process @@ -209,8 +347,10 @@ \subsection{Defining study comparisons using master data} then the two groups will, on average, be statistically indistinguishable. The treatment will therefore not be correlated with anything but the impact of that treatment.\cite{duflo2007using} -The randomized assignment should be done using the master data, -and the result should be saved there before being merged to other datasets. +The randomized assignment should be done +using data from the master dataset, +and the result should be saved back to the master dataset, +before being merged to other datasets. %%%%% Quasi-experimental design @@ -229,114 +369,85 @@ \subsection{Defining study comparisons using master data} to exploit events that occurred in the past. Therefore, these methods often use either secondary data, including administrative data or other classes of routinely-collected information, -and it is important that your data map documents -how that data is merged to any other data. +and it is important that your data linkage table documents +how this data can be linked to the rest of the data in your project. -%%%%% Regression discontinuity +%%%%% Research variables -\textbf{Regression discontinuity (RD)}\sidenote{ +No matter the design, you should be very clear about +which data points you observe or collect are research variables. +For example, +\textbf{regression discontinuity (RD)}\sidenote{ \url{https://dimewiki.worldbank.org/Regression\_Discontinuity}} \index{regression discontinuity} designs exploit sharp breaks or limits -in policy designs to separate a single group of potentially eligible recipients -into comparable groups of individuals who do and do not receive a treatment. -Common examples are test score thresholds and income thresholds, -where the individuals on one side of some threshold receive -a treatment but those on the other side do not.\sidenote{ - \url{https://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} -The intuition is that, on average, -individuals immediately on one side of the threshold -are statistically indistinguishable from the individuals on the other side, -and the only difference is receiving the treatment. -In your data you need an unambiguous way -to define which observations were above or below the cutoff. +in policy designs. The cutoff determinant, or running variable, -is often a continuous variable -that is used to divide the sample into two or more groups. -Both the running variable and a categorical cutoff variable, should be saved in your master dataset. - -%%%%% IV regression - -\textbf{Instrumental variables (IV)}\sidenote{ +In \textbf{instrumental variables (IV)}\sidenote{ \url{https://dimewiki.worldbank.org/Instrumental\_Variables}} \index{instrumental variables} -designs, unlike the previous approaches, -assume that the treatment effect is not directly identifiable. -Similar to RD designs, -IV designs focus on a subset of the variation in treatment take-up. -Where RD designs use a \textit{sharp} or binary cutoff, -IV designs are \textit{fuzzy}, meaning that the input does not completely determine -the treatment status, but instead influence the \textit{probability of treatment}. -You will need variables in your data -that can be used to estimate the probability of treatment for each unit. -These variables are called \textbf{instruments}. -In IV designs, instead of the ordinary regression estimator, -a special version called two-stage least squares (2SLS) is used -to estimate the impact of the treatment. -Stata has a built-in command called \texttt{ivregress}, -and another popular implementation is the user-written command \texttt{ivreg2}. - - -%%%%% Matching - -\textbf{Matching}\sidenote{ +designs, the \textbf{instruments} influence the \textit{probability} of treatment. +These research variables should be collected +and stored in the master dataset. +Both the running variable in RD designs +and the instruments in IV designs, +are among the rare examples of research variables +that may vary over time. +In such cases your research design should +ex-ante clearly indicate what point of time they will be recorded, +and this should be clearly documented in your master dataset. + +In \textbf{matching} designs, observations are often grouped +by a strata, grouping, index, or propensity score.\sidenote{ \url{https://dimewiki.worldbank.org/Matching}} -methods use observable characteristics to construct -sets of treatment and control units -where the observations in each set -are as similar as possible. \index{matching} -These sets can either consist of exactly one treatment and one control observation (one-to-one), -a set of observations where -both groups have more than one observation represented (many-to-many), -or where only one group has more than one observation included (one-to-many). -By now you can probably guess that -the result of the matching needs to be saved in the master dataset. -This is best done by assigning an ID to each matching set, -and create a variable in the master dataset -with the ID for the set each unit belongs to. -The matching can even be done before the randomized assignment, -so that treatment can be randomized within each matching set. -This is a type of experimental design. -Furthermore, if no control observations were identified before the treatment, -then matching can be used to ex-post identify a control group. -Many matching algorithms can only match on a single variable, -so you first have to turn many variables into a single variable -by using an index or a propensity score.\sidenote{ - \url{https://dimewiki.worldbank.org/Propensity\_Score\_Matching}} -DIME Analytics developed a command to match observations -based on this single continuous variable: \texttt{iematch}\sidenote{ - \url{https://dimewiki.worldbank.org/iematch}}, -part of the \texttt{ietoolkit} package. +Like all research variables, the matching results +should be stored in the master dataset. +This is best done by assigning a matching ID +to each matched pair or group, +and create a variable in the master dataset +with the matching ID each unit belongs to. +In all these cases, fidelity to the design is important to note as well. +A program intended for students that scored under 50\% on a test +might have some cases where the program is offered to someone that scored 51\% at the test, +or someone that scored 49\% at the might decline to participate in the program. +Differences between assignments and realizations +should also be recorded in the master datasets. %----------------------------------------------------------------------------------------------- -\subsection{Structuring complex data} +\subsection{Time periods in data maps} -Your data map and master dataset requirements also depends on +Your data map should also take into consideration whether you are using data from one time period or several. A study that observes data in only one time period is called a \textbf{cross-sectional study}. \index{cross-sectional data} This type of data is relatively easy to collect and handle because -you do not need to track individuals across time. +you do not need to track individuals across time, +and therefore requires no additional information in your data map. Instead, the challenge in a cross-sectional study is to show that the control group is indeed a valid counterfactual to the treatment group. -Observations over multiple time periods, -referred to as \textbf{longitudinal data}, -\index{longitudinal data} -can consist of either \textbf{repeated cross-sections} -\index{repeated cross-sectional data} -or \textbf{panel data}. -\index{panel data} +Observations over multiple time periods, +referred to as \textbf{longitudinal data}\index{longitudinal data}, +can consist of either +\textbf{repeated cross-sections}\index{repeated cross-sectional data} +or \textbf{panel data}\index{panel data}. In repeated cross-sections, each successive round of data collection uses a new random sample of observations from the treatment and control groups, -but in a panel data study the same observations are tracked and included each round. -If you are using panel data, -then your data map must document how the different rounds will be merged or appended, -and your master dataset will be the authoritative source of which ID that will be used. +but in a panel data study +the same observations are tracked and included each round. +If each round of data collection is a separate activity, +then they should be treated as separate sources of data +and get their own row in the data linkage table. +If the data is continuously collected, +or at frequent intervals, +then it can be treated as a single data source. +The data linkage table must document +how the different rounds will be merged or appended +when panel data is collected in separate activities. You must keep track of the \textit{attrition rate} in panel data, which is the share of observations not observed in follow-up data. @@ -345,97 +456,86 @@ \subsection{Structuring complex data} For example, poorer households may live in more informal dwellings, patients with worse health conditions might not survive to follow-up, and so on. -If this is the case, then your results might only be an effect of your remaining sample -being a subset of the original sample that were better or worse off from the beginning. -You should have a variable in your master dataset that indicates attrition. -A balance check using the attrition variable can provide insights -as to whether the lost observations were systematically different +If this is the case, +then your results might only be an effect of your remaining sample +being a subset of the original sample +that were better or worse off from the beginning. +You should have a variable in your master dataset + that indicates attrition. +A balance check using the attrition variable +can provide insights as to whether the lost observations +were systematically different compared to the rest of the sample. %----------------------------------------------------------------------------------------------- \subsection{Monitoring data} -For any study with an ex-ante design, -\textbf{monitoring data}\sidenote{\url{ +For any study with an ex-ante design, +\textbf{monitoring data}\index{monitoring data}\sidenote{\url{ https://dimewiki.worldbank.org/Monitoring\_Data}} -is very important for understanding whether field realities match the research design. -\index{monitoring data} -Monitoring data is used to understand if the -assumptions made during the research design corresponds to what is true in reality. -The most typical example is to make sure that, in an experimental design, +is very important for understanding if the +assumptions made during the research design +corresponds to what is true in reality. +The most typical example is to make sure that, +in an experimental design, the treatment was implemented according to your treatment assignment. +While it is always better to monitor all activities, +it might be to costly. +In those cases you can sample a smaller number of critical activities and monitor them. +This will not be detailed enough to be used as a control in your analysis, +but it will still give a way to +estimate the validity of your research design assumptions. Treatment implementation is often carried out by partners, and field realities may be more complex realities than foreseen during research design. Furthermore, the field staff of your partner organization, might not be aware that their actions are the implementation of your research. -Therefore, you must acquire monitoring data that tells you how well the treatment assignment in the field +Therefore, you must acquire monitoring data that +tells you how well the treatment assignment in the field corresponds to your intended treatment assignment, -for nearly all research designs. - -Another example of a research design where monitoring data is important -are regression discontinuity (RD) designs -where the discontinuity is a cutoff for eligibility of the treatment. -For example, -let's say your project studies the impact of a program for students that scored under 50\% at a test. -We might have the exact results of the tests for all students, -and therefore know who should be offered the program, -however that is not the same as knowing who attended the program. -A teacher might offer the program to someone that scored 51\% at the test, -and someone that scored 49\% at the might decline to participate in the program. -We should not pass judgment on a teacher that offers a program to a student -they think can benefit from it, -but if that was not inline with our research assumptions, -then we need to understand how common that was. -Otherwise the result of our research will not be helpful -in evaluating the program. - -Monitoring data is particularly prone to errors -relating to merging with other data set. -If you send a team to simply record the name of all people attending a training, -then potentially different spellings of names --- especially when names have to be transcribed from other alphabets to the Latin alphabet -- -will be the biggest source of error in your monitoring activity. -The time to discuss and document how monitoring data will be merged with precision -to the rest of your data, is when you are creating your data map. -The solution is often very simple, it is just a matter of solving it before it is too late. -An example of a solution is to provide the monitoring teams -with lists of the people's names spelled the same way as in your master dataset. -Include both the people you expect to attend the training, -and people that you do not expect to attend, -but do not tell the monitoring team which person you expect to attend the training. - -Additionally, instruct the monitoring teams to record the name of any -participant not in your lists. -Add those names to your master datasets, -as the most complete information possible will help you -if you at any point in your project end up without an ID to merge on, -and will have to compare names when merging data. -Finally, while it is always better to monitor all activities, -it might be to costly. -In those cases you can sample a smaller number of critical activities and monitor them. -This will not be detailed enough to be used as a control in your analysis, -but it will still give an idea of the validity of your research design assumptions. +for nearly all experimental research designs. + +Monitoring data is particularly prone to errors +when linking it with the rest of the data in your project. +Often monitoring activities are done by +sending a team to simply record the name of all people attending a training, +or by a partner organization sharing their administrative data, +which is rarely maintained in the same format or structure as your research data. +In both those cases it can be difficult to make sure that +the project ID or any other unambiguous identifiers in our master datasets +is used to record who is who. +Planning ahead for this when the monitoring activity is added to the data linkage table +is the best protection from ending up with poor correlation +between treatment uptake and treatment assignment, +without a way to tell if the poor correlation is just +a result of a fuzzy link between monitoring data and the rest of your data. %----------------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------------- -\section{Implementing random sampling and treatment assignments} +\section{Research variables created by randomization} -Random sampling and treatment assignment are two core elements of research design. -In experimental methods, random sampling and treatment assignment directly determine +Random sampling and treatment assignment are two research activities +at the core elements of research design, +that generated research variables. +In experimental methods, +random sampling and treatment assignment directly determine the set of individuals who are going to be observed and what their status will be for the purpose of effect estimation. -In quasi-experimental methods, random sampling determines what populations the study +In quasi-experimental methods, +random sampling determines what populations the study will be able to make meaningful inferences about, and random treatment assignment creates counterfactuals. -Randomization\sidenote{ - \textbf{Randomization} is often used interchangeably to mean random treatment assignment. - In this book however, \textit{randomization} will only be used to describe the process of generating - a sequence of unrelated numbers, i.e. a random process. - \textit{Randomization} will never be used to mean the process of assigning units in treatment and control groups, +\textbf{Randomization}\sidenote{ + \textbf{Randomization} is often used interchangeably + to mean random treatment assignment. + In this book however, \textit{randomization} will only + be used to describe the process of generating + a sequence of unrelated numbers, i.e. a random process. + \textit{Randomization} will never be used to mean + the process of assigning units in treatment and control groups, that will always be called \textit{random treatment assignment}, - or a derivative thereof.} + or a derivative thereof.} is used to ensure that a sample is representative and that any treatment and control groups are statistically indistinguishable after treatment assignment. @@ -455,7 +555,7 @@ \section{Implementing random sampling and treatment assignments} \textit{Power calculation} and \textit{randomization inference} are the main methods by which these probabilities of error are assessed. These analyses are particularly important in the initial phases of development research -- -typically conducted before any actual field work occurs -- +typically conducted before any data acquisition or field work occurs -- and have implications for feasibility, planning, and budgeting. %----------------------------------------------------------------------------------------------- @@ -467,7 +567,7 @@ \subsection{Randomizing sampling and treatment assignment} This process can be used, for example, to select a subset from all eligible units to be included in data collection when the cost of collecting data on everyone is prohibitive.\sidenote{ \url{https://dimewiki.worldbank.org/Sample\_Size\_and\_Power\_Calculations}} -But it can also be used to select a sub-sample of your observations to test a computationally heavy process +But it can also be used to select a sub-sample of your observations to test a computationally heavy process before running it on the full data. \textbf{Randomized treatment assignment} is the process of assigning observations to different treatment arms. This process is central to experimental research design. @@ -477,16 +577,26 @@ \subsection{Randomizing sampling and treatment assignment} will be observed at all in the course of data collection, randomized assignment determines if each individual will be observed as a treatment observation or used as a counterfactual. + The list of units to sample or assign from may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. -This list should always be your \textbf{master dataset} when possible. -The rare exceptions when master datasets cannot be used is when sampling must be done in real time -- -for example, randomly sampling patients as they arrive at a health facility. -In those cases, it is important that you collect enough data during the real time sampling, -such that you can create a master dataset over these individuals afterwards. +This list should always be your \textbf{master dataset} when possible, +and the result should always be saved in the master dataset +before merged to any other data. +One example of the rare exceptions +when master datasets cannot be used is +when sampling must be done in real time -- +for example, randomly sampling patients +as they arrive at a health facility. +In those cases, +it is important that you collect enough data +during the real time sampling, +such that you can add these individuals, +and the result of the sampling, +to your master dataset afterwards. % implement uniform-probability random sampling -The simplest form of sampling is +The simplest form of sampling is \textbf{uniform-probability random sampling}. This means that every eligible observation in the master dataset has an equal probability of being selected. @@ -536,7 +646,7 @@ \subsection{Programming reproducible random processes} This section introduces strict rules: these are non-negotiable (but thankfully simple). Stata, like most statistical software, uses a \textbf{pseudo-random number generator} -which, in ordinary research use, +which, in ordinary research use, produces sequences of number that are as good as random.\sidenote{ \url{https://dimewiki.worldbank.org/Randomization\_in\_Stata}} However, for \textit{reproducible} randomization, we need two additional properties: @@ -617,19 +727,19 @@ \subsection{Programming reproducible random processes} that randomized assignment results be revealed in the field. It is possible to do this using survey software or live events, such as a live lottery. These methods typically do not leave a record of the randomization, -and as such are never reproducible. -However, you can often run your randomization in advance +and as such are never reproducible. +However, you can often run your randomization in advance even when you do not have list of eligible units in advance. -Let's say you want to, at various health facilities, +Let's say you want to, at various health facilities, randomly select a sub-sample of patients as they arrive. -You can then have a pre-generated list +You can then have a pre-generated list with a random order of ''in sample'' and ''not in sample''. Your field staff would then go through this list in order and cross off one randomized result as it is used for a patient. This is especially beneficial if you are implementing a more complex randomization, -for example, sample 10\% of the patients, show a video for 50\% of the sample, -and ask a longer version of the questionnaire to 20\% of both +for example, sample 10\% of the patients, show a video for 50\% of the sample, +and ask a longer version of the questionnaire to 20\% of both the group of patients that watch the video and those that did not. The real time randomization is much more likely to be implemented correctly, if your field staff simply can follow a list with the randomized categories @@ -640,7 +750,7 @@ \subsection{Programming reproducible random processes} Finally, if this real-time randomization implementation is done using survey software, then the pre-generated list of randomized categories can be preloaded into the questionnaire. -Then the field team can follow a list of respondent IDs +Then the field team can follow a list of respondent IDs that are randomized into the appropriate categories, and the survey software can show a video and control which version of the questionnaire is asked. This way, you reduce the risk of errors in field randomization. @@ -772,12 +882,12 @@ \section{Doing power calculations for research design} so such a study would never be able to say anything about the effect size that is practically relevant. Conversely, the \textbf{minimum sample size} pre-specifies expected effect sizes and tells you how large a study's sample would need to be to detect that effect, -which can tell you what resources you would need +which can tell you what resources you would need to implement a useful study. % what is randomization inference \textbf{Randomization inference}\sidenote{ - \url{https://dimewiki.worldbank.org/Randomization\_Inference}} + \url{https://dimewiki.worldbank.org/Randomization\_Inference}} is used to analyze the likelihood \index{randomization inference} that the randomized assignment process, by chance,