From fdfc528e96e3fab2a1dbcc6357c3e95918278496 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 10:41:04 -0400 Subject: [PATCH 01/41] [ch3] data plan intro --- chapters/3-measurement.tex | 35 +++++++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index c2ecc4ab2..bb0f2ae64 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -37,17 +37,13 @@ %----------------------------------------------------------------------------------------------- -\section{Translating research design to master data} +\section{Translating research design to a data plan} In most projects, more than one data source is needed to answer the research question. These could be multiple survey rounds, data acquired from different partners (e.g. administrative data, web scraping, implementation monitoring, etc) or complex combinations of these. -For example, you may have different \textbf{units of observation}\sidenote{ - The \textbf{unit of observation} is the unit at or for which data is collected. See - \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}}, -and their level may vary from round to round. However your study is structured, you need to know how to link data from all sources and analyze the relationships between the units that appear in them @@ -56,7 +52,34 @@ \section{Translating research design to master data} but your whole research team is unlikely to have the same understanding, at all times, of all the datasets required. The only way to make sure that the full team shares the same understanding -is to create \textit{master datasets} and a \textit{data map}. +is to create a \textbf{data plan}\index{Data plan}.\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Plan}} +DIME's data plan template has three components; +one \textit{data linkage table},\index{Data linkage table} +one or several \textit{master datasets}\index{Master datasets} +and one or several \textit{data flow charts}.\index{Data flowchart} + +A \textbf{data linkage table}\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Linkage\_Table}} +lists all the datasets that will be used in the project. +Its most important function is to indicate how all those datasets can be be linked when +combining information from multiple data sources. + +\textbf{Master datasets}\sidenote{ + \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} +list all observations your project ever encounter. +You will need one master dataset per \textbf{unit of observation}. +The master dataset is the authoritative source for all research meta information +such as ID values, sampling and treatment status, etc. + +\textbf{Data flow charts}\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} +list all datasets that are needed to create each analysis dataset, +and how they should be combined. +You will, in general, need one data flow chart per analysis dataset +(see Chapter 6 for discussion on why you most likely need multiple analysis datasets) +but if two analysis datasets are very similar, +they can be included in the same chart. \index{Master datasets}\textbf{Master datasets} serve three key functions. First, they list all the units that are eligible for the study, From 40b1459cbb5a585b7ba5c7c7b604a3df1e60b31b Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 11:35:03 -0400 Subject: [PATCH 02/41] [ch3] data linkage table --- chapters/3-measurement.tex | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index bb0f2ae64..4bdecd867 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -82,6 +82,33 @@ \section{Translating research design to a data plan} they can be included in the same chart. \index{Master datasets}\textbf{Master datasets} serve three key functions. +\subsection{Creating a data plan} + +The components of the data plan is the best tool a research team has for +the lead researchers to communicate their vision for the data work, +and for the research assistants to communicate their understanding of that vision. +While the time to create the data plan is before any data has been acquired, +it should be a dynamic documentation, +that you should keep up to date as the project evolves. + +To create a \textbf{data linkage table} you simply start by listing +all the datasets you know that you will use in a spreadsheet. +If one source of data will results in two different dataset, +then list each dataset on a new row. +Then for each dataset list the unit of observation, +and the name of the ID variable for that unit of observation. +Your project should only have one main ID variable per unit of observation, +so make sure that all datasets with the same unit of observation +have the same ID variable. +Make sure that before acquiring any data, +you have listed the dataset in the data linkage table, +and have a plan for how you will make sure that the data +is identified using your main ID variable. +Something very labor expensive to fix and a big source of error, +is when a research team ends up with a dataset they do not have an +unambiguous way to link to other data sources. +The data linkage table is also a great place to list other type of meta data, +such as the source of your data, and backup location. First, they list all the units that are eligible for the study, and enable you to map the data to the research design. Second, in designs where your team has direct control over interventions or other field work, From 61280a20af47ca1aa3558e7706e3fdd170293e0c Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 14:29:12 -0400 Subject: [PATCH 03/41] [ch3] master datasets --- chapters/3-measurement.tex | 142 +++++++++++++++---------------------- 1 file changed, 59 insertions(+), 83 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 4bdecd867..ce227cdf0 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -81,7 +81,7 @@ \section{Translating research design to a data plan} but if two analysis datasets are very similar, they can be included in the same chart. -\index{Master datasets}\textbf{Master datasets} serve three key functions. + \subsection{Creating a data plan} The components of the data plan is the best tool a research team has for @@ -108,88 +108,64 @@ \subsection{Creating a data plan} is when a research team ends up with a dataset they do not have an unambiguous way to link to other data sources. The data linkage table is also a great place to list other type of meta data, -such as the source of your data, and backup location. -First, they list all the units that are eligible for the study, -and enable you to map the data to the research design. -Second, in designs where your team has direct control over interventions or other field work, -they allow you to plan sampling and treatment assignment before going to the field. -Finally, they constitute the single unambiguous location where all information -related to the implementation and validity of your research is stored, -as well as all information needed to correctly identify any observation in any of your project's datasets. -The \textbf{data map} diagrams each data source the project will use, -the unit of response, frequency of measurement, -and the level(s) at which it can be linked to other datasets within the framework. - - -\subsection{Creating master datasets and a data map} - -A \textbf{master dataset}\sidenote{ - \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} -\index{master datasets} -details all project-wide time-invariant information -about all observations encountered, -as well as their relationship to the research design, -typically summarized by sampling and treatment status. -Having a plan for how to get the raw data into analysis shape -before you acquire it, -and making sure that the full research team knows where to find this information, -will save you a ton of time during the course of the project -and increase the quality of your research. - -You should create a master dataset -for each unit of observation -relevant to the research. -This includes all units used for significant research activity, -like data collection or data analysis. -Therefore, any unit -that will be used in sampling or treatment assignment, -must have a master dataset, -and that master dataset -- not field data -- -should be used when sampling or assigning treatment. -Master data sets are often created from field data, -but master data sets should be treated differently -as there can be several field datasets, -but only one authoritative master dataset. -Master data sets should be created as soon as you -start to get information about units. -For example, when receiving a type of administrative data set for the first time, -or after doing a respondent listing before a survey. - -You also need to record how all datasets for each unit of observation -will link or merge with each other as needed. -This linking scheme is called a \textbf{data map}\sidenote{ - \url{https://dimewiki.worldbank.org/data\_map} (TO BE CREATED)}. -\index{data maps} -A data map is more than just a list of datasets. -Its purpose is to specify the characteristics and linkages of those datasets. -To link properly, the master datasets must include fully unique ID variables.\sidenote{ - \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} -The master datasets should indicate whether datasets should be merge one-to-one, -for example, merging baseline data and endline data that use the same unit of observation, -or whether two datasets should be merged many-to-one, -for example, school administrative data merged with student data. -Your data map must indicate which ID variables can be used and how to merge datasets. -It is common that administrative data use IDs -that are different than the project IDs, -and the linkage between those should be clear from your master dataset. - -The data map should also include metadata about the handling of all information. -These characteristics may be updated as the project progresses. -For example, you will need to note the original source of each dataset, -as well as the project folder where -the raw original data and codebooks are stored -and where the back-ups for the each raw dataset are stored. - -Some of the characteristics in your master datasets and your data map -should be filled in during the planning stage, -but both of them should be active resources -that are updated as your project evolves. -Finally, your master data should not include any unlabeled missing values. -If the information is missing for one unit, -then the reason should always be indicated with a code. -An example for such reason could be that a unit was not included in the treatment assignment -as it was not sampled in the first place, -or was not located in the data collection at a given round. +such as the source of your data, and backup locations etc. + +You should have one \textbf{master dataset} for each \textbf{unit of observation}\sidenote{ + \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} +used in any significant research activity. +Examples of such activities are data collection, data analysis, +sampling or treatment assignment. +The master dataset should include and be the authoritative source of +all \textbf{research variables}\sidenote{ + \textbf{Research variables:} Research related meta data that identifies observations + and maps research design information those observation. + Research variables are time-invariant and + often, but not always, controlled by the research team. + Examples include + ID variables, sampling status, treatment status, treatment uptake.} +but not include any \textbf{measurement variables}\sidenote{ + \textbf{Measurement variables:} Data that corresponds to observations of the real world. Research variables are not controlled by the research team and often vary over time. + Examples include characteristics of the research subject, outcome variables, input variables among many others.}. +Research variables and measurement variables often come from the same source, +but should not be stored in the same way. +For example, if you are shared administrative data that both includes +information on eligibility to be included in the study (research variable) +and information on outcome on the topic of your study (measurement variable) +you should first decide which variables are research variables, +remove them during the data cleaning (see Chapter 6) +and instead store them in your master dataset. +It is common that you will have to update +your master datasets throughout your project. +Examples of research variables +that you cannot know when you initially set up your master dataset +are treatment uptake and attrition variables. + +The most important function of the master dataset +is to be the authoritative source of identifiers, +and all observations listed should have a unique project ID\sidenote{ + \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}}. +You should also list all other identifiers your project interact with, +such as names, addresses and other ID values used by your partner organization, +and serve as the linkage between those identifiers and the project ID. +Because of this, there are very few cases where your master datasets +does not need to be encrypted. +Even when a partner organization have a unique identifier, +you should always create a project ID specific to your project only, +as you are otherwise not in control over who can re-identify your de-identified dataset. +You should include all observations ever encountered, +even if they are not eligible for your study, +because if you ever end up needed to do a fuzzy match, +you will do fewer errors the more information you have. +If you acquire any data without your project ID, +you should always start by understanding how that data + can be linked to the master dataset, +and then merge the project IDs to the new data, +before you do anything with that dataset. +With the master datasets as an up to date authoritative source +of the project IDs and all research variables, +you have an unambiguous method of mapping +the observations in your study to your research design. + \subsection{Defining study comparisons using master data} From 5f4c08bf498396abc641959f2f1ed4756538bc8a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 14:54:47 -0400 Subject: [PATCH 04/41] [ch3] data flow chart --- chapters/3-measurement.tex | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index ce227cdf0..c71070302 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -166,6 +166,35 @@ \subsection{Creating a data plan} you have an unambiguous method of mapping the observations in your study to your research design. +The third and final component of the data plan is the \textbf{data flow charts}. +All analysis datasets +(see Chapter 6 for discussion on why you likely need multiple analysis datasets) +should have a data flow chart that is a diagram +where each starting point is either a master dataset +or a dataset listed in the data linkage table. +The data flow chart should include instructions on how +all datasets should be combined to create the analysis dataset. +The operations used to combine the data could be +appending, one-to-one merging, +many-to-one/one-to-many merging or any other method. +You should also list which ID variable or set of ID variables +should be used in these operations, +and if the operation results in a new variable or set of variable +that identifies the data. +Once you have the data used in the flow chart, +you can also start listing the number of observations the datasets +should have before and after each observation. +This is a great tool to track attrition and to make sure that +the operations used to combined dataset did not create unwanted duplicates +or incorrectly dropped any observations. + +A data flow chart can be created in a flow chart drawing tool +(there are many free alternatives online), +by using shapes in Microsoft PowerPoint, +or simply by drawing on a piece of paper and take a photo. +However, we recommend that the charts are created in a digital tool +so that new versions can easily be created +once you learn new things about your research throughout your project. \subsection{Defining study comparisons using master data} From 7ff7e863a28e414d54f586c7677baa3c27cf2e08 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 14:55:06 -0400 Subject: [PATCH 05/41] [ch3] data plan edits --- chapters/3-measurement.tex | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index c71070302..dccf07a63 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -64,21 +64,18 @@ \section{Translating research design to a data plan} lists all the datasets that will be used in the project. Its most important function is to indicate how all those datasets can be be linked when combining information from multiple data sources. - \textbf{Master datasets}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} list all observations your project ever encounter. -You will need one master dataset per \textbf{unit of observation}. The master dataset is the authoritative source for all research meta information such as ID values, sampling and treatment status, etc. - \textbf{Data flow charts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, -and how they should be combined. +and how they should be created by merging, appending +or in any other way link different datasets. You will, in general, need one data flow chart per analysis dataset -(see Chapter 6 for discussion on why you most likely need multiple analysis datasets) -but if two analysis datasets are very similar, +but if two or more analysis datasets are created very similarly, they can be included in the same chart. @@ -100,6 +97,12 @@ \subsection{Creating a data plan} Your project should only have one main ID variable per unit of observation, so make sure that all datasets with the same unit of observation have the same ID variable. +The data linkage table should indicate whether datasets should be merge one-to-one, +for example, merging baseline data and endline data that use the same unit of observation, +or whether two datasets should be merged many-to-one, +for example, school administrative data merged with student data. +Your data map must indicate which ID variables can be used and how to merge datasets. + Make sure that before acquiring any data, you have listed the dataset in the data linkage table, and have a plan for how you will make sure that the data From 15c6078d7bc902f043b0d5d612e88b2f2ee96455 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 15:26:53 -0400 Subject: [PATCH 06/41] [ch3] research designs and research variables --- chapters/3-measurement.tex | 55 +++++++++++++++++++++----------------- 1 file changed, 31 insertions(+), 24 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index dccf07a63..a4a1c296c 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -9,11 +9,7 @@ the theories behind different research designs. This chapter will instead focus on how the design impacts a project's data requirements. -We assume you have a working familiarity -with the research designs mentioned here. -If needed, you can reference \textcolor{red}{Appendix XYZ}, -where you will find more details -and specific references for common impact evaluation methods. + Planning data requirements is more than just listing key outcome variables. It requires understanding how to structure the project's data to best answer the research questions, @@ -199,10 +195,19 @@ \subsection{Creating a data plan} so that new versions can easily be created once you learn new things about your research throughout your project. -\subsection{Defining study comparisons using master data} +\subsection{Defining research variables related to your research design} + +After you have set up your data plan +you need to carefully think about your research design +and which research variables you will need +to make inferences about differences in measurements variables +in relation to your research design. +We assume you have a working familiarity +with the research designs mentioned here. +If needed, you can reference \textcolor{red}{Appendix XYZ}, +where you will find more details +and specific references for common impact evaluation methods. -Your research design will determine what statistical comparisons you need -to estimate in your analytical work. The research designs discussed here compare a group that received some kind of \textbf{treatment}\sidenote{ \textbf{Treatment:} The general word for the evaluated intervention or event. @@ -239,23 +244,25 @@ \subsection{Defining study comparisons using master data} \texttt{iebaltab}\sidenote{ \url{https://dimewiki.worldbank.org/iebaltab}}. Each research design has a different method for identifying the statistically-similar control group. -for how the statistically similar control group is identified. The rest of this section covers how data requirements differ between different research designs. -What does not differ, however, is that the authoritative source -for which units are in the treatment group and which are in the control group -should always be one or several variables in your master dataset. +What does not differ, however, +is that these data requirements are all research variables. +The source for this data requirements varies +between research design and between projects, +but the authoritative source for that type of data should +always be the master datasets. You will often have to merge that data to other datasets, -but that is an easy task if you created a data map. +but that is an easy task if you created a data linkage table. %%%%% Experimental design In \textbf{experimental research designs}, -such as \textbf{randomized control trials (RCTs)},\sidenote{ +such as\index{randomized control trials}\index{experimental research designs} +\textbf{randomized control trials (RCTs)},\sidenote{ \url{https://dimewiki.worldbank.org/Randomized\_Control\_Trials}} -\index{randomized control trials} \index{experimental research designs} the research team determines which members of the studied population will receive the treatment. This is typically done by a randomized process @@ -287,8 +294,8 @@ \subsection{Defining study comparisons using master data} to exploit events that occurred in the past. Therefore, these methods often use either secondary data, including administrative data or other classes of routinely-collected information, -and it is important that your data map documents -how that data is merged to any other data. +and it is important that your data linkage table documents +how this data is linked to the rest of the data in your project. %%%%% Regression discontinuity @@ -330,11 +337,11 @@ \subsection{Defining study comparisons using master data} You will need variables in your data that can be used to estimate the probability of treatment for each unit. These variables are called \textbf{instruments}. -In IV designs, instead of the ordinary regression estimator, -a special version called two-stage least squares (2SLS) is used -to estimate the impact of the treatment. -Stata has a built-in command called \texttt{ivregress}, -and another popular implementation is the user-written command \texttt{ivreg2}. +Instrument variables is a rare example of research variables that might vary over time, +as your probability of being treated might change. +In these cases your research design should +ex-ante clearly indicate what point of time this will be recorded, +and this should be clearly documented in your master dataset. %%%%% Matching @@ -351,10 +358,10 @@ \subsection{Defining study comparisons using master data} or where only one group has more than one observation included (one-to-many). By now you can probably guess that the result of the matching needs to be saved in the master dataset. -This is best done by assigning an ID to each matching set, +This is best done by assigning a matching ID to each matched set, and create a variable in the master dataset with the ID for the set each unit belongs to. -The matching can even be done before the randomized assignment, +The matching can also be done before the randomized assignment, so that treatment can be randomized within each matching set. This is a type of experimental design. Furthermore, if no control observations were identified before the treatment, From 0b9af99d064a130aae0a909d3d37ffab364516cc Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 17:29:34 -0400 Subject: [PATCH 07/41] [ch3] time periods --- chapters/3-measurement.tex | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index a4a1c296c..41571daff 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -376,32 +376,36 @@ \subsection{Defining research variables related to your research design} part of the \texttt{ietoolkit} package. %----------------------------------------------------------------------------------------------- -\subsection{Structuring complex data} +\subsection{Time periods and data plans} -Your data map and master dataset requirements also depends on +Your data plan should also take into consideration whether you are using data from one time period or several. A study that observes data in only one time period is called a \textbf{cross-sectional study}. \index{cross-sectional data} This type of data is relatively easy to collect and handle because -you do not need to track individuals across time. +you do not need to track individuals across time, +and therefore requires no additional information in your data plan. Instead, the challenge in a cross-sectional study is to show that the control group is indeed a valid counterfactual to the treatment group. Observations over multiple time periods, -referred to as \textbf{longitudinal data}, -\index{longitudinal data} -can consist of either \textbf{repeated cross-sections} -\index{repeated cross-sectional data} -or \textbf{panel data}. -\index{panel data} +referred to as \textbf{longitudinal data}\index{longitudinal data}, +can consist of either \textbf{repeated cross-sections}\index{repeated cross-sectional data} +or \textbf{panel data}\index{panel data}. In repeated cross-sections, each successive round of data collection uses a new random sample of observations from the treatment and control groups, but in a panel data study the same observations are tracked and included each round. -If you are using panel data, -then your data map must document how the different rounds will be merged or appended, -and your master dataset will be the authoritative source of which ID that will be used. +If each round of data collection is a separate activity, +they should be treated as a separate source of data +and get their own row in the data linkage table. +If the data is continuously collected, +or at frequent intervals, +then it can be treated as a single data source. +The data linkage table must document +how the different rounds will be merged or appended +when panel data collected in separate activities. You must keep track of the \textit{attrition rate} in panel data, which is the share of observations not observed in follow-up data. From 56b79f497951d8a31f4091e7ae0a79afe8261acb Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 17:48:30 -0400 Subject: [PATCH 08/41] [ch3] monitoring data --- chapters/3-measurement.tex | 48 ++++++++++++++++---------------------- 1 file changed, 20 insertions(+), 28 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 41571daff..658d80030 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -425,20 +425,26 @@ \subsection{Time periods and data plans} \subsection{Monitoring data} For any study with an ex-ante design, -\textbf{monitoring data}\sidenote{\url{ +\textbf{monitoring data}\index{monitoring data}\sidenote{\url{ https://dimewiki.worldbank.org/Monitoring\_Data}} is very important for understanding whether field realities match the research design. -\index{monitoring data} Monitoring data is used to understand if the assumptions made during the research design corresponds to what is true in reality. The most typical example is to make sure that, in an experimental design, the treatment was implemented according to your treatment assignment. +While it is always better to monitor all activities, +it might be to costly. +In those cases you can sample a smaller number of critical activities and monitor them. +This will not be detailed enough to be used as a control in your analysis, +but it will still give a way to +estimate the validity of your research design assumptions. Treatment implementation is often carried out by partners, and field realities may be more complex realities than foreseen during research design. Furthermore, the field staff of your partner organization, might not be aware that their actions are the implementation of your research. -Therefore, you must acquire monitoring data that tells you how well the treatment assignment in the field +Therefore, you must acquire monitoring data that +tells you how well the treatment assignment in the field corresponds to your intended treatment assignment, for nearly all research designs. @@ -460,31 +466,17 @@ \subsection{Monitoring data} in evaluating the program. Monitoring data is particularly prone to errors -relating to merging with other data set. -If you send a team to simply record the name of all people attending a training, -then potentially different spellings of names --- especially when names have to be transcribed from other alphabets to the Latin alphabet -- -will be the biggest source of error in your monitoring activity. -The time to discuss and document how monitoring data will be merged with precision -to the rest of your data, is when you are creating your data map. -The solution is often very simple, it is just a matter of solving it before it is too late. -An example of a solution is to provide the monitoring teams -with lists of the people's names spelled the same way as in your master dataset. -Include both the people you expect to attend the training, -and people that you do not expect to attend, -but do not tell the monitoring team which person you expect to attend the training. - -Additionally, instruct the monitoring teams to record the name of any -participant not in your lists. -Add those names to your master datasets, -as the most complete information possible will help you -if you at any point in your project end up without an ID to merge on, -and will have to compare names when merging data. -Finally, while it is always better to monitor all activities, -it might be to costly. -In those cases you can sample a smaller number of critical activities and monitor them. -This will not be detailed enough to be used as a control in your analysis, -but it will still give an idea of the validity of your research design assumptions. +relating to correctly linking this data with the rest of the data in your project. +Often monitoring activities is done by +sending a team to simply record the name of all people attending a training, +or by a partner organization share their administrative data. +In both those cases it can be difficult to make sure that +the same identifiers we have in our master datasets are using to record who is who. +Planning ahead for this, +already when the monitoring activity is added to the data linkage table, +is the best protection you have from ending up with poor correlation +between treatment uptake and your expectations that +is mainly a result of a fuzzy link between your monitor data and the rest of your data. %----------------------------------------------------------------------------------------------- From fb9f870f64e361fb4510c14606c51caefe15f675 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 17:59:48 -0400 Subject: [PATCH 09/41] [ch3] linking dataplan and randomization section --- chapters/3-measurement.tex | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 658d80030..7e55348be 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -481,16 +481,17 @@ \subsection{Monitoring data} %----------------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------------- -\section{Implementing random sampling and treatment assignments} +\section{Randomized creation of research variables} -Random sampling and treatment assignment are two core elements of research design. +Random sampling and treatment assignment are two research variables +often created by the research team that are core elements of research design. In experimental methods, random sampling and treatment assignment directly determine the set of individuals who are going to be observed and what their status will be for the purpose of effect estimation. In quasi-experimental methods, random sampling determines what populations the study will be able to make meaningful inferences about, and random treatment assignment creates counterfactuals. -Randomization\sidenote{ +\textbf{Randomization}\sidenote{ \textbf{Randomization} is often used interchangeably to mean random treatment assignment. In this book however, \textit{randomization} will only be used to describe the process of generating a sequence of unrelated numbers, i.e. a random process. @@ -540,11 +541,15 @@ \subsection{Randomizing sampling and treatment assignment} as a treatment observation or used as a counterfactual. The list of units to sample or assign from may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. -This list should always be your \textbf{master dataset} when possible. -The rare exceptions when master datasets cannot be used is when sampling must be done in real time -- +This list should always be your \textbf{master dataset} when possible, +and the result should always be saved in the master dataset before merged to any other data. +The rare exceptions when master datasets cannot be used is +when sampling must be done in real time -- for example, randomly sampling patients as they arrive at a health facility. In those cases, it is important that you collect enough data during the real time sampling, -such that you can create a master dataset over these individuals afterwards. +such that you can add these individuals, +and the result of the sampling, +to your master dataset afterwards. % implement uniform-probability random sampling The simplest form of sampling is From 5245326ed56f08c99db722aba149d8774be765fb Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 18:12:34 -0400 Subject: [PATCH 10/41] [ch3] update intro --- chapters/3-measurement.tex | 39 +++++++++++++++++++------------------- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 7e55348be..c9dd780f3 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -3,28 +3,29 @@ \begin{fullwidth} In this chapter we will show how you can save a lot of time -and increase the quality of your research by planning your project's data requirements -in advance, based on your research design. -There are many published resources about -the theories behind different research designs. -This chapter will instead focus on how the design -impacts a project's data requirements. - - +and increase the quality of your research by +planning your project's data requirements in advance. Planning data requirements is more than just listing key outcome variables. -It requires understanding how to structure the project's data to best answer the research questions, +It requires understanding how to structure the project's data +to best answer the research questions, and creating the tools to share this understanding across your team. + The first section of this chapter discusses how to determine the data needs of the project, -based on the research design and measurement framework, -and how to document these through a data map and master dataset(s). -The second section of this chapter covers random sampling and assignment -and the necessary practices to ensure that -these and other random processes are reproducible. -Almost all research designs rely on a random component -for the results of the research to be a valid interpretation of the real world. -This includes both how a sample is representative to the population studied, -and how the counterfactual observations in experimental design are statistically indistinguishable -from the treatment observations. +and introduces DIME's data plan template that is composed of +one data linkage table, +one or several master datasets and +one or several data flow charts. +This section also discusses what specific research data you need +based on your projects research design, +and how those data needs are documented in the data plan. + +The second section of this chapter covers two types of research data +that instead of being observed in the real world are created by the research team. +Those two types of research data are random sampling and random assignment. +Special focus is spent on how to ensure that +these and other random processes are reproducible, +which is critical for the credibility of your research. + The chapter concludes with a discussion of power calculations and randomization inference, and how both are important tools to make optimal choices when planning data work. From f048fea5b82ccaae431c23d9e0ab87a19aa1b159 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 18:20:16 -0400 Subject: [PATCH 11/41] [ch3] research/measurement data def update --- chapters/3-measurement.tex | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index c9dd780f3..5e35f465d 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -75,7 +75,6 @@ \section{Translating research design to a data plan} but if two or more analysis datasets are created very similarly, they can be included in the same chart. - \subsection{Creating a data plan} The components of the data plan is the best tool a research team has for @@ -117,14 +116,18 @@ \subsection{Creating a data plan} sampling or treatment assignment. The master dataset should include and be the authoritative source of all \textbf{research variables}\sidenote{ - \textbf{Research variables:} Research related meta data that identifies observations - and maps research design information those observation. + \textbf{Research variables:} Research data that identifies observations + and maps research design information to those observations. Research variables are time-invariant and often, but not always, controlled by the research team. Examples include ID variables, sampling status, treatment status, treatment uptake.} but not include any \textbf{measurement variables}\sidenote{ - \textbf{Measurement variables:} Data that corresponds to observations of the real world. Research variables are not controlled by the research team and often vary over time. + \textbf{Measurement variables:} Data that + corresponds to direct observations of the real world, + recorded sentiments of the subjects of the research + or any other aspect we are trying to measure. + Research variables are not controlled by the research team and often vary over time. Examples include characteristics of the research subject, outcome variables, input variables among many others.}. Research variables and measurement variables often come from the same source, but should not be stored in the same way. From beb39a61e8d432e7bbca2c1e8db3ed321ff8c9b5 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 20:38:08 -0400 Subject: [PATCH 12/41] [ch3] proofread - intro --- chapters/3-measurement.tex | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 5e35f465d..38b7c666c 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -5,23 +5,28 @@ In this chapter we will show how you can save a lot of time and increase the quality of your research by planning your project's data requirements in advance. -Planning data requirements is more than just listing key outcome variables. +Planning data requirements should be more than +just a listing of the key outcome variables. It requires understanding how to structure the project's data to best answer the research questions, and creating the tools to share this understanding across your team. -The first section of this chapter discusses how to determine the data needs of the project, +The first section of this chapter discusses how to +determine the data needs of the project, and introduces DIME's data plan template that is composed of one data linkage table, one or several master datasets and one or several data flow charts. +These three tools should be the way data requirements are communicated +both across the team and across time. This section also discusses what specific research data you need based on your projects research design, and how those data needs are documented in the data plan. -The second section of this chapter covers two types of research data -that instead of being observed in the real world are created by the research team. -Those two types of research data are random sampling and random assignment. +The second section of this chapter covers two activities where +research data is created by the research team +instead of being observed in the real world. +Those two research activities are random sampling and random assignment. Special focus is spent on how to ensure that these and other random processes are reproducible, which is critical for the credibility of your research. From 8aeb59cedd05e8102d596d7fd6697eedc68f89a2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 21:12:28 -0400 Subject: [PATCH 13/41] [ch3] proofread - create a dataplan --- chapters/3-measurement.tex | 133 ++++++++++++++++++++----------------- 1 file changed, 72 insertions(+), 61 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 38b7c666c..ab0d11898 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -46,7 +46,6 @@ \section{Translating research design to a data plan} data acquired from different partners (e.g. administrative data, web scraping, implementation monitoring, etc) or complex combinations of these. - However your study is structured, you need to know how to link data from all sources and analyze the relationships between the units that appear in them to answer all your research questions. @@ -64,58 +63,61 @@ \section{Translating research design to a data plan} A \textbf{data linkage table}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Linkage\_Table}} lists all the datasets that will be used in the project. -Its most important function is to indicate how all those datasets can be be linked when +Its most important function is to indicate +how all those datasets can be be linked when combining information from multiple data sources. \textbf{Master datasets}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} -list all observations your project ever encounter. -The master dataset is the authoritative source for all research meta information +list all observations your project ever encounter +are the authoritative source for all research data such as ID values, sampling and treatment status, etc. \textbf{Data flow charts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, and how they should be created by merging, appending or in any other way link different datasets. -You will, in general, need one data flow chart per analysis dataset -but if two or more analysis datasets are created very similarly, -they can be included in the same chart. \subsection{Creating a data plan} -The components of the data plan is the best tool a research team has for +The components of the data plan is the best tools a research team has for the lead researchers to communicate their vision for the data work, and for the research assistants to communicate their understanding of that vision. While the time to create the data plan is before any data has been acquired, it should be a dynamic documentation, that you should keep up to date as the project evolves. -To create a \textbf{data linkage table} you simply start by listing +To create a data plan according to DIME's template, first, +start by creating a \textbf{data linkage table} by listing all the datasets you know that you will use in a spreadsheet. If one source of data will results in two different dataset, -then list each dataset on a new row. -Then for each dataset list the unit of observation, -and the name of the ID variable for that unit of observation. -Your project should only have one main ID variable per unit of observation, -so make sure that all datasets with the same unit of observation -have the same ID variable. -The data linkage table should indicate whether datasets should be merge one-to-one, -for example, merging baseline data and endline data that use the same unit of observation, +then list each dataset on a their own row. +Then for each dataset list the \textbf{unit of observation}\sidenote{ + \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}}, +and the name of the main ID variable for that unit of observation. +Your project should only have one main ID variable per unit of observation, +called the project ID. +When you list a dataset in the data linkage table -- +which should be done before that dataset is acquired -- +you should always make sure that the dataset will +be fully identified by the project ID, +or make a plan for how +the new dataset will be linked to the project ID. +It is very labor expensive to work with a dataset that +do not have an unambiguous way to link to the project ID, +and it is a big source of error. + +The data linkage table should indicate whether +datasets should be merge one-to-one, for example, +merging baseline data and endline data that use the same unit of observation, or whether two datasets should be merged many-to-one, for example, school administrative data merged with student data. -Your data map must indicate which ID variables can be used and how to merge datasets. - -Make sure that before acquiring any data, -you have listed the dataset in the data linkage table, -and have a plan for how you will make sure that the data -is identified using your main ID variable. -Something very labor expensive to fix and a big source of error, -is when a research team ends up with a dataset they do not have an -unambiguous way to link to other data sources. +Your data map must indicate which ID variables +can be used and how when marging datasets. The data linkage table is also a great place to list other type of meta data, -such as the source of your data, and backup locations etc. +such as the source of your data, backup locations, etc. -You should have one \textbf{master dataset} for each \textbf{unit of observation}\sidenote{ - \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} +Second, you should then create one \textbf{master dataset} +for each unit of observation used in any significant research activity. Examples of such activities are data collection, data analysis, sampling or treatment assignment. @@ -131,10 +133,13 @@ \subsection{Creating a data plan} \textbf{Measurement variables:} Data that corresponds to direct observations of the real world, recorded sentiments of the subjects of the research - or any other aspect we are trying to measure. - Research variables are not controlled by the research team and often vary over time. - Examples include characteristics of the research subject, outcome variables, input variables among many others.}. -Research variables and measurement variables often come from the same source, + or any other aspect your project is studying. + Research variables are not controlled by the research team + and often vary over time. + Examples include characteristics of the research subject, + outcome variables, input variables among many others.}. +Research variables and measurement variables +often come from the same source, but should not be stored in the same way. For example, if you are shared administrative data that both includes information on eligibility to be included in the study (research variable) @@ -144,34 +149,39 @@ \subsection{Creating a data plan} and instead store them in your master dataset. It is common that you will have to update your master datasets throughout your project. -Examples of research variables -that you cannot know when you initially set up your master dataset -are treatment uptake and attrition variables. The most important function of the master dataset -is to be the authoritative source of identifiers, -and all observations listed should have a unique project ID\sidenote{ - \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}}. -You should also list all other identifiers your project interact with, -such as names, addresses and other ID values used by your partner organization, -and serve as the linkage between those identifiers and the project ID. -Because of this, there are very few cases where your master datasets -does not need to be encrypted. +is to be the authoritative source of for the project ID. +This means that all observations listed +should have a uniquely and fully identified by the project ID.\sidenote{ + \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} +You should also list all other identifiers used in your project, +such as names, addresses and other IDs used by your partner organizations, +and the master datasets serves as +the linkage between those identifiers and the project ID. +Because of this, master datasets must, +with very few exceptions, always be encrypted. Even when a partner organization have a unique identifier, you should always create a project ID specific to your project only, -as you are otherwise not in control over who can re-identify your de-identified dataset. -You should include all observations ever encountered, -even if they are not eligible for your study, -because if you ever end up needed to do a fuzzy match, +as you are otherwise not in control over +who can re-identify your de-identified dataset. + +You should include all observations ever encountered +in your master datasets, +even if they are not eligible for your study. +Because if you ever need to do a fuzzy match, +on string variables like proper names, you will do fewer errors the more information you have. -If you acquire any data without your project ID, -you should always start by understanding how that data - can be linked to the master dataset, -and then merge the project IDs to the new data, -before you do anything with that dataset. -With the master datasets as an up to date authoritative source -of the project IDs and all research variables, -you have an unambiguous method of mapping +If you ever need to do a fuzzy match, +you should always do that between the master dataset +and the dataset without an unambiguous identifier. +You should not do anything with that dataset until +you have successfully merged +the project IDs from the master dataset. + +With the master datasets as an authoritative source +of the project ID and all research variables, +it serves as an unambiguous method of mapping the observations in your study to your research design. The third and final component of the data plan is the \textbf{data flow charts}. @@ -189,11 +199,12 @@ \subsection{Creating a data plan} should be used in these operations, and if the operation results in a new variable or set of variable that identifies the data. -Once you have the data used in the flow chart, -you can also start listing the number of observations the datasets -should have before and after each observation. -This is a great tool to track attrition and to make sure that -the operations used to combined dataset did not create unwanted duplicates +Once you have acquired the datasets listed in the flow chart, +you can also start listing the number of observations that both +the starting point dataset should start with and +the combined datasets should have after each operation. +This is a great method to track attrition and to make sure that +the operations used to combine datasets did not create unwanted duplicates or incorrectly dropped any observations. A data flow chart can be created in a flow chart drawing tool From 3bf93f7a3df30e03af512a9bc724a31cafed61d4 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 21:47:24 -0400 Subject: [PATCH 14/41] [ch3] proofread - reserach vars in research design --- chapters/3-measurement.tex | 70 ++++++++++++++++++++++++-------------- 1 file changed, 45 insertions(+), 25 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index ab0d11898..c13c6d39b 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -219,9 +219,9 @@ \subsection{Defining research variables related to your research design} After you have set up your data plan you need to carefully think about your research design -and which research variables you will need -to make inferences about differences in measurements variables -in relation to your research design. +and which research variables you will need in the data analysis +to infer the relation between differences in measurements variables +and your research design. We assume you have a working familiarity with the research designs mentioned here. If needed, you can reference \textcolor{red}{Appendix XYZ}, @@ -229,16 +229,20 @@ \subsection{Defining research variables related to your research design} and specific references for common impact evaluation methods. The research designs discussed here compare a group that received -some kind of \textbf{treatment}\sidenote{ +some kind of \textbf{treatment}\index{Treatment}\sidenote{ \textbf{Treatment:} The general word for the evaluated intervention or event. - This includes being offered training or cash transfer from a program, experiencing a natural disaster etc.} -against a counterfactual control group.\sidenote{ - \textbf{Counterfactual:} A statistical description of what would have happened + This includes being offered training + or cash transfer from a program, + experiencing a natural disaster etc.} +against a counterfactual control group\index{Counterfactual}.\sidenote{ + \textbf{Counterfactual:} A statistical description of + what would have happened to specific individuals in an alternative scenario, for example, a different treatment assignment outcome.} -\index{counterfactual} + The key assumption is that each -person, facility, or village (or whatever the unit of intervention is) +person, facility, or village +(or whatever the unit of treatment is) had two possible states: their outcome if they did receive the treatment and their outcome if they did not receive that treatment. The average impact of the treatment, or the ATE\sidenote{ @@ -255,7 +259,8 @@ \subsection{Defining research variables related to your research design} Instead, the treatment group is compared to a control group that is statistically indistinguishable, which makes the average impact of the treatment -mathematically equivalent to the difference in averages between the groups. +mathematically equivalent to +the difference in averages between the groups. Statistical similarity is often defined as \textbf{balance} between two or more groups. Since balance tests are commonly run for impact evaluations, @@ -263,18 +268,21 @@ \subsection{Defining research variables related to your research design} standardize and automate the creation of nicely-formatted balance tables: \texttt{iebaltab}\sidenote{ \url{https://dimewiki.worldbank.org/iebaltab}}. -Each research design has a different method for identifying the statistically-similar control group. -The rest of this section covers how data requirements differ -between different research designs. +Each research design has a different method for +identifying the statistically-similar control group. +The rest of this section covers how research data requirements +differ between those different methods. What does not differ, however, is that these data requirements are all research variables. -The source for this data requirements varies -between research design and between projects, +The source for the required research variables varies +between research designs and between projects, but the authoritative source for that type of data should -always be the master datasets. -You will often have to merge that data to other datasets, -but that is an easy task if you created a data linkage table. +always be a master dataset. +You will often have to merge +the research variables to other datasets, +but that is an easy task +if you created a data linkage table. %%%%% Experimental design @@ -294,8 +302,10 @@ \subsection{Defining research variables related to your research design} then the two groups will, on average, be statistically indistinguishable. The treatment will therefore not be correlated with anything but the impact of that treatment.\cite{duflo2007using} -The randomized assignment should be done using the master data, -and the result should be saved there before being merged to other datasets. +The randomized assignment should be done +using data from the master dataset, +and the result should be saved back to the master dataset, +before being merged to other datasets. %%%%% Quasi-experimental design @@ -315,7 +325,7 @@ \subsection{Defining research variables related to your research design} Therefore, these methods often use either secondary data, including administrative data or other classes of routinely-collected information, and it is important that your data linkage table documents -how this data is linked to the rest of the data in your project. +how this data can be linked to the rest of the data in your project. %%%%% Regression discontinuity @@ -341,6 +351,16 @@ \subsection{Defining research variables related to your research design} that is used to divide the sample into two or more groups. Both the running variable and a categorical cutoff variable, should be saved in your master dataset. +The running variable is one of the exceptions where +a research variable in the master dataset +may vary over time, +and an observation may be on different sides of the cutoff, +depending on when you make the cutoff. +In these cases your research design should +ex-ante clearly indicate what point in time +the running variable will be recorded, +and this should be clearly documented in your master dataset. + %%%%% IV regression @@ -357,13 +377,13 @@ \subsection{Defining research variables related to your research design} You will need variables in your data that can be used to estimate the probability of treatment for each unit. These variables are called \textbf{instruments}. -Instrument variables is a rare example of research variables that might vary over time, +Instrument variables are another example of research variables +that might vary over time, as your probability of being treated might change. -In these cases your research design should +Again, in these cases your research design should ex-ante clearly indicate what point of time this will be recorded, and this should be clearly documented in your master dataset. - %%%%% Matching \textbf{Matching}\sidenote{ @@ -383,7 +403,7 @@ \subsection{Defining research variables related to your research design} with the ID for the set each unit belongs to. The matching can also be done before the randomized assignment, so that treatment can be randomized within each matching set. -This is a type of experimental design. +This would then be a type of experimental design. Furthermore, if no control observations were identified before the treatment, then matching can be used to ex-post identify a control group. Many matching algorithms can only match on a single variable, From f104481b1862585e8f7b62a3f305ce85bd098643 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Tue, 18 Aug 2020 22:02:38 -0400 Subject: [PATCH 15/41] [ch3] time and monitoring --- chapters/3-measurement.tex | 102 ++++++++++++++++++++++--------------- 1 file changed, 62 insertions(+), 40 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index c13c6d39b..6d4b1f952 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -416,7 +416,7 @@ \subsection{Defining research variables related to your research design} part of the \texttt{ietoolkit} package. %----------------------------------------------------------------------------------------------- -\subsection{Time periods and data plans} +\subsection{Time periods in data plans} Your data plan should also take into consideration whether you are using data from one time period or several. @@ -431,21 +431,23 @@ \subsection{Time periods and data plans} Observations over multiple time periods, referred to as \textbf{longitudinal data}\index{longitudinal data}, -can consist of either \textbf{repeated cross-sections}\index{repeated cross-sectional data} +can consist of either +\textbf{repeated cross-sections}\index{repeated cross-sectional data} or \textbf{panel data}\index{panel data}. In repeated cross-sections, each successive round of data collection uses a new random sample of observations from the treatment and control groups, -but in a panel data study the same observations are tracked and included each round. +but in a panel data study +the same observations are tracked and included each round. If each round of data collection is a separate activity, -they should be treated as a separate source of data +then they should be treated as a separate source of data and get their own row in the data linkage table. If the data is continuously collected, or at frequent intervals, then it can be treated as a single data source. The data linkage table must document how the different rounds will be merged or appended -when panel data collected in separate activities. +when panel data is collected in separate activities. You must keep track of the \textit{attrition rate} in panel data, which is the share of observations not observed in follow-up data. @@ -454,11 +456,15 @@ \subsection{Time periods and data plans} For example, poorer households may live in more informal dwellings, patients with worse health conditions might not survive to follow-up, and so on. -If this is the case, then your results might only be an effect of your remaining sample -being a subset of the original sample that were better or worse off from the beginning. -You should have a variable in your master dataset that indicates attrition. -A balance check using the attrition variable can provide insights -as to whether the lost observations were systematically different +If this is the case, +then your results might only be an effect of your remaining sample +being a subset of the original sample +that were better or worse off from the beginning. +You should have a variable in your master dataset + that indicates attrition. +A balance check using the attrition variable +can provide insights as to whether the lost observations +were systematically different compared to the rest of the sample. %----------------------------------------------------------------------------------------------- @@ -467,10 +473,11 @@ \subsection{Monitoring data} For any study with an ex-ante design, \textbf{monitoring data}\index{monitoring data}\sidenote{\url{ https://dimewiki.worldbank.org/Monitoring\_Data}} -is very important for understanding whether field realities match the research design. -Monitoring data is used to understand if the -assumptions made during the research design corresponds to what is true in reality. -The most typical example is to make sure that, in an experimental design, +is very important for understanding if the +assumptions made during the research design +corresponds to what is true in reality. +The most typical example is to make sure that, +in an experimental design, the treatment was implemented according to your treatment assignment. While it is always better to monitor all activities, it might be to costly. @@ -486,56 +493,65 @@ \subsection{Monitoring data} Therefore, you must acquire monitoring data that tells you how well the treatment assignment in the field corresponds to your intended treatment assignment, -for nearly all research designs. +for nearly all experimental research designs. -Another example of a research design where monitoring data is important +An example of a non-experimental research design +for which monitoring data also is important are regression discontinuity (RD) designs -where the discontinuity is a cutoff for eligibility of the treatment. +where he discontinuity is +a cutoff for eligibility of the treatment. For example, -let's say your project studies the impact of a program for students that scored under 50\% at a test. +let's say your project studies the impact of a program +for students that scored under 50\% at a test. We might have the exact results of the tests for all students, and therefore know who should be offered the program, however that is not the same as knowing who attended the program. A teacher might offer the program to someone that scored 51\% at the test, and someone that scored 49\% at the might decline to participate in the program. -We should not pass judgment on a teacher that offers a program to a student -they think can benefit from it, -but if that was not inline with our research assumptions, -then we need to understand how common that was. +We need to understand how common this was, +and if one case was more common than the other. Otherwise the result of our research will not be helpful in evaluating the program. Monitoring data is particularly prone to errors -relating to correctly linking this data with the rest of the data in your project. +when linking it with the rest of the data in your project. Often monitoring activities is done by sending a team to simply record the name of all people attending a training, or by a partner organization share their administrative data. In both those cases it can be difficult to make sure that -the same identifiers we have in our master datasets are using to record who is who. +the project ID or any other unambiguous identifiers in our master datasets +is used to record who is who. Planning ahead for this, already when the monitoring activity is added to the data linkage table, -is the best protection you have from ending up with poor correlation -between treatment uptake and your expectations that -is mainly a result of a fuzzy link between your monitor data and the rest of your data. +is the best protection from ending up with poor correlation +between treatment uptake and treatment assignment, +without a way to tell if the poor correlation is just +a result of a fuzzy link between your monitor data and the rest of your data. %----------------------------------------------------------------------------------------------- %----------------------------------------------------------------------------------------------- -\section{Randomized creation of research variables} +\section{Research variables created by randomization} -Random sampling and treatment assignment are two research variables -often created by the research team that are core elements of research design. -In experimental methods, random sampling and treatment assignment directly determine +Random sampling and treatment assignment are two research activities +at the core elements of research design, +that generated research variables. +In experimental methods, +random sampling and treatment assignment directly determine the set of individuals who are going to be observed and what their status will be for the purpose of effect estimation. -In quasi-experimental methods, random sampling determines what populations the study +In quasi-experimental methods, +random sampling determines what populations the study will be able to make meaningful inferences about, and random treatment assignment creates counterfactuals. \textbf{Randomization}\sidenote{ - \textbf{Randomization} is often used interchangeably to mean random treatment assignment. - In this book however, \textit{randomization} will only be used to describe the process of generating + \textbf{Randomization} is often used interchangeably + to mean random treatment assignment. + In this book however, \textit{randomization} will only + be used to describe the process of generating a sequence of unrelated numbers, i.e. a random process. - \textit{Randomization} will never be used to mean the process of assigning units in treatment and control groups, + \textit{Randomization} will never be used to mean + the process of assigning units in treatment and control groups, that will always be called \textit{random treatment assignment}, or a derivative thereof.} is used to ensure that a sample is representative and @@ -557,7 +573,7 @@ \section{Randomized creation of research variables} \textit{Power calculation} and \textit{randomization inference} are the main methods by which these probabilities of error are assessed. These analyses are particularly important in the initial phases of development research -- -typically conducted before any actual field work occurs -- +typically conducted before any data acquisition or field work occurs -- and have implications for feasibility, planning, and budgeting. %----------------------------------------------------------------------------------------------- @@ -579,14 +595,20 @@ \subsection{Randomizing sampling and treatment assignment} will be observed at all in the course of data collection, randomized assignment determines if each individual will be observed as a treatment observation or used as a counterfactual. + The list of units to sample or assign from may be called a \textbf{sampling universe}, a \textbf{listing frame}, or something similar. This list should always be your \textbf{master dataset} when possible, -and the result should always be saved in the master dataset before merged to any other data. -The rare exceptions when master datasets cannot be used is +and the result should always be saved in the master dataset +before merged to any other data. +One example of the rare exceptions +when master datasets cannot be used is when sampling must be done in real time -- -for example, randomly sampling patients as they arrive at a health facility. -In those cases, it is important that you collect enough data during the real time sampling, +for example, randomly sampling patients +as they arrive at a health facility. +In those cases, +it is important that you collect enough data +during the real time sampling, such that you can add these individuals, and the result of the sampling, to your master dataset afterwards. From 22fe1ae382f8f4aa0be7cf4e660fe240a882b9a5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Wed, 19 Aug 2020 10:47:08 -0400 Subject: [PATCH 16/41] [ch3] apply ben's suggestions Co-authored-by: Benjamin Daniels --- chapters/3-measurement.tex | 103 ++++++++++++++++++------------------- 1 file changed, 51 insertions(+), 52 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 6d4b1f952..03a513a0b 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -20,13 +20,13 @@ These three tools should be the way data requirements are communicated both across the team and across time. This section also discusses what specific research data you need -based on your projects research design, +based on your project's research design, and how those data needs are documented in the data plan. The second section of this chapter covers two activities where research data is created by the research team instead of being observed in the real world. -Those two research activities are random sampling and random assignment. +Those two activities are random sampling and random assignment. Special focus is spent on how to ensure that these and other random processes are reproducible, which is critical for the credibility of your research. @@ -43,8 +43,8 @@ \section{Translating research design to a data plan} In most projects, more than one data source is needed to answer the research question. These could be multiple survey rounds, -data acquired from different partners (e.g. administrative data, -web scraping, implementation monitoring, etc) +data acquired from different partners (such as administrative data, +web scraping, implementation monitoring, and so on) or complex combinations of these. However your study is structured, you need to know how to link data from all sources and analyze the relationships between the units that appear in them @@ -69,42 +69,41 @@ \section{Translating research design to a data plan} \textbf{Master datasets}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} list all observations your project ever encounter -are the authoritative source for all research data -such as ID values, sampling and treatment status, etc. +and are the authoritative source for all research data +such as ID values, sampling and treatment statuses, and so on. \textbf{Data flow charts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, -and how they should be created by merging, appending -or in any other way link different datasets. +and how they should be created by merging, appending, +or in any other way linking different datasets. \subsection{Creating a data plan} -The components of the data plan is the best tools a research team has for +The data plan is the best tool a research team has for the lead researchers to communicate their vision for the data work, and for the research assistants to communicate their understanding of that vision. While the time to create the data plan is before any data has been acquired, -it should be a dynamic documentation, -that you should keep up to date as the project evolves. +it should be a piece of documentation that is +continuously updated as the project evolves. To create a data plan according to DIME's template, first, start by creating a \textbf{data linkage table} by listing -all the datasets you know that you will use in a spreadsheet. -If one source of data will results in two different dataset, -then list each dataset on a their own row. -Then for each dataset list the \textbf{unit of observation}\sidenote{ +all the datasets you know you will use in a spreadsheet. +If one source of data will result in two different datasets, +then list each dataset on its own row. +For each dataset, list the \textbf{unit of observation}\sidenote{ \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}}, -and the name of the main ID variable for that unit of observation. -Your project should only have one main ID variable per unit of observation, -called the project ID. +and the name of the project ID variable for that unit of observation. +Your project should only have one project ID variable per unit of observation. When you list a dataset in the data linkage table -- which should be done before that dataset is acquired -- you should always make sure that the dataset will -be fully identified by the project ID, +be fully and uniquely identified by the project ID, or make a plan for how the new dataset will be linked to the project ID. -It is very labor expensive to work with a dataset that +It is very labor intensive to work with a dataset that do not have an unambiguous way to link to the project ID, -and it is a big source of error. +and it is a major source of error. The data linkage table should indicate whether datasets should be merge one-to-one, for example, @@ -112,15 +111,15 @@ \subsection{Creating a data plan} or whether two datasets should be merged many-to-one, for example, school administrative data merged with student data. Your data map must indicate which ID variables -can be used and how when marging datasets. -The data linkage table is also a great place to list other type of meta data, -such as the source of your data, backup locations, etc. +can be used -- and how -- when merging datasets. +The data linkage table is also a great place to list other types of metadata, +such as the source of your data, its backup locations, and so on. Second, you should then create one \textbf{master dataset} for each unit of observation used in any significant research activity. Examples of such activities are data collection, data analysis, -sampling or treatment assignment. +sampling, and treatment assignment. The master dataset should include and be the authoritative source of all \textbf{research variables}\sidenote{ \textbf{Research variables:} Research data that identifies observations @@ -128,7 +127,7 @@ \subsection{Creating a data plan} Research variables are time-invariant and often, but not always, controlled by the research team. Examples include - ID variables, sampling status, treatment status, treatment uptake.} + ID variables, sampling status, treatment status, and treatment uptake.} but not include any \textbf{measurement variables}\sidenote{ \textbf{Measurement variables:} Data that corresponds to direct observations of the real world, @@ -151,17 +150,17 @@ \subsection{Creating a data plan} your master datasets throughout your project. The most important function of the master dataset -is to be the authoritative source of for the project ID. +is to be the authoritative source for the project ID. This means that all observations listed -should have a uniquely and fully identified by the project ID.\sidenote{ +should be uniquely and fully identified by the included project ID variable.\sidenote{ \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} You should also list all other identifiers used in your project, -such as names, addresses and other IDs used by your partner organizations, -and the master datasets serves as +such as names, addresses, or other IDs used by partner organizations, +and the master datasets can then serve as the linkage between those identifiers and the project ID. Because of this, master datasets must, with very few exceptions, always be encrypted. -Even when a partner organization have a unique identifier, +Even when a partner organization has a unique identifier, you should always create a project ID specific to your project only, as you are otherwise not in control over who can re-identify your de-identified dataset. @@ -169,9 +168,9 @@ \subsection{Creating a data plan} You should include all observations ever encountered in your master datasets, even if they are not eligible for your study. -Because if you ever need to do a fuzzy match, +This is because, if you ever need to perform a record linkage such as a fuzzy match on string variables like proper names, -you will do fewer errors the more information you have. +you will make fewer errors the more information you have. If you ever need to do a fuzzy match, you should always do that between the master dataset and the dataset without an unambiguous identifier. @@ -185,42 +184,42 @@ \subsection{Creating a data plan} the observations in your study to your research design. The third and final component of the data plan is the \textbf{data flow charts}. -All analysis datasets +Each analysis dataset (see Chapter 6 for discussion on why you likely need multiple analysis datasets) should have a data flow chart that is a diagram where each starting point is either a master dataset or a dataset listed in the data linkage table. The data flow chart should include instructions on how all datasets should be combined to create the analysis dataset. -The operations used to combine the data could be +The operations used to combine the data could, for example, be appending, one-to-one merging, -many-to-one/one-to-many merging or any other method. -You should also list which ID variable or set of ID variables +many-to-one or one-to-many merging, collapsing, or a broad variety of others. +You must list which ID variable or set of ID variables should be used in these operations, -and if the operation results in a new variable or set of variable -that identifies the data. +and if the operation creates a new variable or combination of variables +that identifies the newly linked data. Once you have acquired the datasets listed in the flow chart, you can also start listing the number of observations that both the starting point dataset should start with and the combined datasets should have after each operation. This is a great method to track attrition and to make sure that the operations used to combine datasets did not create unwanted duplicates -or incorrectly dropped any observations. +or incorrectly drop any observations. A data flow chart can be created in a flow chart drawing tool (there are many free alternatives online), by using shapes in Microsoft PowerPoint, -or simply by drawing on a piece of paper and take a photo. +or simply by drawing on a piece of paper and taking a photo. However, we recommend that the charts are created in a digital tool so that new versions can easily be created once you learn new things about your research throughout your project. \subsection{Defining research variables related to your research design} -After you have set up your data plan +After you have set up your data plan, you need to carefully think about your research design and which research variables you will need in the data analysis -to infer the relation between differences in measurements variables +to infer the relation between differences in measurement variables and your research design. We assume you have a working familiarity with the research designs mentioned here. @@ -231,9 +230,9 @@ \subsection{Defining research variables related to your research design} The research designs discussed here compare a group that received some kind of \textbf{treatment}\index{Treatment}\sidenote{ \textbf{Treatment:} The general word for the evaluated intervention or event. - This includes being offered training - or cash transfer from a program, - experiencing a natural disaster etc.} + This includes things like being offered a training, + a cash transfer from a program, + or experiencing a natural disaster, among many others.} against a counterfactual control group\index{Counterfactual}.\sidenote{ \textbf{Counterfactual:} A statistical description of what would have happened @@ -269,7 +268,7 @@ \subsection{Defining research variables related to your research design} \texttt{iebaltab}\sidenote{ \url{https://dimewiki.worldbank.org/iebaltab}}. -Each research design has a different method for +Each research design has a different method for identifying the statistically-similar control group. The rest of this section covers how research data requirements differ between those different methods. @@ -440,7 +439,7 @@ \subsection{Time periods in data plans} but in a panel data study the same observations are tracked and included each round. If each round of data collection is a separate activity, -then they should be treated as a separate source of data +then they should be treated as separate sources of data and get their own row in the data linkage table. If the data is continuously collected, or at frequent intervals, @@ -517,16 +516,16 @@ \subsection{Monitoring data} when linking it with the rest of the data in your project. Often monitoring activities is done by sending a team to simply record the name of all people attending a training, -or by a partner organization share their administrative data. +or by a partner organization sharing their administrative data, +which is rarely maintained in the same format or structure as your research data. In both those cases it can be difficult to make sure that the project ID or any other unambiguous identifiers in our master datasets is used to record who is who. -Planning ahead for this, -already when the monitoring activity is added to the data linkage table, +Planning ahead for this when the monitoring activity is added to the data linkage table is the best protection from ending up with poor correlation between treatment uptake and treatment assignment, without a way to tell if the poor correlation is just -a result of a fuzzy link between your monitor data and the rest of your data. +a result of a fuzzy link between monitoring data and the rest of your data. %----------------------------------------------------------------------------------------------- From 72f9c94deaaf61dd0e861a6f2eca10bb05cdb74a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Thu, 20 Aug 2020 13:22:51 -0400 Subject: [PATCH 17/41] [ch3] maria edits intro and data plan Co-authored-by: Maria --- chapters/3-measurement.tex | 107 ++++++++++++++++++++----------------- 1 file changed, 57 insertions(+), 50 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 03a513a0b..5a0508f71 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -5,23 +5,24 @@ In this chapter we will show how you can save a lot of time and increase the quality of your research by planning your project's data requirements in advance. -Planning data requirements should be more than -just a listing of the key outcome variables. -It requires understanding how to structure the project's data +Planning data requirements requires more than +simply listing the key outcome variables. +You need to understand how to structure the project's data to best answer the research questions, and creating the tools to share this understanding across your team. The first section of this chapter discusses how to -determine the data needs of the project, -and introduces DIME's data plan template that is composed of +determine your project's data needs, +and introduces DIME's data plan template. +The template includes: one data linkage table, -one or several master datasets and +one or several master datasets, and one or several data flow charts. -These three tools should be the way data requirements are communicated +These three tools will help to communicate the project's data requirements both across the team and across time. This section also discusses what specific research data you need based on your project's research design, -and how those data needs are documented in the data plan. +and how to document those data needs in the data plan. The second section of this chapter covers two activities where research data is created by the research team @@ -42,10 +43,10 @@ \section{Translating research design to a data plan} In most projects, more than one data source is needed to answer the research question. -These could be multiple survey rounds, -data acquired from different partners (such as administrative data, -web scraping, implementation monitoring, and so on) -or complex combinations of these. +These could be data from multiple survey rounds, +data acquired from different partners (such as administrative data, implementation data, sensor data), +web scraping, +or complex combinations of these and other sources. However your study is structured, you need to know how to link data from all sources and analyze the relationships between the units that appear in them to answer all your research questions. @@ -55,7 +56,7 @@ \section{Translating research design to a data plan} The only way to make sure that the full team shares the same understanding is to create a \textbf{data plan}\index{Data plan}.\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Plan}} -DIME's data plan template has three components; +DIME's data plan template has three components: one \textit{data linkage table},\index{Data linkage table} one or several \textit{master datasets}\index{Master datasets} and one or several \textit{data flow charts}.\index{Data flowchart} @@ -70,25 +71,28 @@ \section{Translating research design to a data plan} \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} list all observations your project ever encounter and are the authoritative source for all research data -such as ID values, sampling and treatment statuses, and so on. +such as unique identifiers, sample status and treatment assignment +(the following two sections of this chapter discuss how to generate these variables). \textbf{Data flow charts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, -and how they should be created by merging, appending, -or in any other way linking different datasets. +and what manipulation of these data sources is necessary +to get to the final analysis dataset(s), +such as merging, appending, or other linkages. \subsection{Creating a data plan} The data plan is the best tool a research team has for the lead researchers to communicate their vision for the data work, and for the research assistants to communicate their understanding of that vision. -While the time to create the data plan is before any data has been acquired, -it should be a piece of documentation that is -continuously updated as the project evolves. - -To create a data plan according to DIME's template, first, -start by creating a \textbf{data linkage table} by listing -all the datasets you know you will use in a spreadsheet. +The data plan should be drafted at the outset of a project, +before any data is acquired, +but it is not a static document; +it will need to be updated as the project evolves. + +To create a data plan according to DIME's template, +the first step is to create a \textbf{data linkage table} by listing +all the data sources you know you will use in a spreadsheet. If one source of data will result in two different datasets, then list each dataset on its own row. For each dataset, list the \textbf{unit of observation}\sidenote{ @@ -106,18 +110,20 @@ \subsection{Creating a data plan} and it is a major source of error. The data linkage table should indicate whether -datasets should be merge one-to-one, for example, -merging baseline data and endline data that use the same unit of observation, -or whether two datasets should be merged many-to-one, -for example, school administrative data merged with student data. +datasets can be merged one-to-one (for example, +merging baseline and endline datasets +that use the same unit of observation), +or whether two datasets need to be merged many-to-one +(for example, school administrative data merged with student data). Your data map must indicate which ID variables can be used -- and how -- when merging datasets. -The data linkage table is also a great place to list other types of metadata, -such as the source of your data, its backup locations, and so on. +The data linkage table is also a great place to list other metadata, +such as the source of your data, its backup locations, +the nature of the data license, and so on. -Second, you should then create one \textbf{master dataset} +The second step in creating a data plan is to create one \textbf{master dataset} for each unit of observation -used in any significant research activity. +that will be used in any significant research activity. Examples of such activities are data collection, data analysis, sampling, and treatment assignment. The master dataset should include and be the authoritative source of @@ -133,14 +139,14 @@ \subsection{Creating a data plan} corresponds to direct observations of the real world, recorded sentiments of the subjects of the research or any other aspect your project is studying. - Research variables are not controlled by the research team + Measurement variables are not controlled by the research team and often vary over time. Examples include characteristics of the research subject, outcome variables, input variables among many others.}. Research variables and measurement variables often come from the same source, but should not be stored in the same way. -For example, if you are shared administrative data that both includes +For example, if you acquire administrative data that both includes information on eligibility to be included in the study (research variable) and information on outcome on the topic of your study (measurement variable) you should first decide which variables are research variables, @@ -178,41 +184,42 @@ \subsection{Creating a data plan} you have successfully merged the project IDs from the master dataset. -With the master datasets as an authoritative source +Since the master datasets is the authoritative source of the project ID and all research variables, it serves as an unambiguous method of mapping the observations in your study to your research design. -The third and final component of the data plan is the \textbf{data flow charts}. +The third and final step in creating the data plan is to create \textbf{data flow charts}. Each analysis dataset (see Chapter 6 for discussion on why you likely need multiple analysis datasets) -should have a data flow chart that is a diagram +should have a data flow chart showing how it was created. +The flow chart is a diagram where each starting point is either a master dataset or a dataset listed in the data linkage table. The data flow chart should include instructions on how -all datasets should be combined to create the analysis dataset. -The operations used to combine the data could, for example, be +the datasets can be combined to create the analysis dataset. +The operations used to combine the data could include: appending, one-to-one merging, many-to-one or one-to-many merging, collapsing, or a broad variety of others. You must list which ID variable or set of ID variables -should be used in these operations, -and if the operation creates a new variable or combination of variables -that identifies the newly linked data. +should be used in each operation, +and note whether the operation creates a new variable or combination of variables +to identify the newly linked data. Once you have acquired the datasets listed in the flow chart, -you can also start listing the number of observations that both -the starting point dataset should start with and -the combined datasets should have after each operation. +you can add to the data flow charts the number of observations that +the starting point dataset has +and the number of observation each resulting datasets +should have after each operation. This is a great method to track attrition and to make sure that the operations used to combine datasets did not create unwanted duplicates or incorrectly drop any observations. A data flow chart can be created in a flow chart drawing tool -(there are many free alternatives online), -by using shapes in Microsoft PowerPoint, -or simply by drawing on a piece of paper and taking a photo. -However, we recommend that the charts are created in a digital tool -so that new versions can easily be created -once you learn new things about your research throughout your project. +(there are many free alternatives online) or +by using shapes in Microsoft PowerPoint. +You can also do this simply by drawing on a piece of paper and taking a photo, +but we recommend a digital tool +so that flow charts can easily be updated over time. \subsection{Defining research variables related to your research design} From 3cc719c7671cf7b570369e444d0307fa6c1702df Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Thu, 20 Aug 2020 13:31:01 -0400 Subject: [PATCH 18/41] [ch3] mj proofread Co-authored-by: Maria --- chapters/3-measurement.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 5a0508f71..1a74463b5 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -9,7 +9,7 @@ simply listing the key outcome variables. You need to understand how to structure the project's data to best answer the research questions, -and creating the tools to share this understanding across your team. +and create the tools to share this understanding across your team. The first section of this chapter discusses how to determine your project's data needs, From 768f421fbb015e70559cfa8681981d93752d0549 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Thu, 20 Aug 2020 13:43:43 -0400 Subject: [PATCH 19/41] [ch3] luiza proofread Co-authored-by: Luiza Andrade --- chapters/3-measurement.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 1a74463b5..e2bfb4ee8 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -162,7 +162,7 @@ \subsection{Creating a data plan} \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} You should also list all other identifiers used in your project, such as names, addresses, or other IDs used by partner organizations, -and the master datasets can then serve as +and the master datasets will then serve as the linkage between those identifiers and the project ID. Because of this, master datasets must, with very few exceptions, always be encrypted. From 096fd2aa151c77997da63695211f31c5c0708070 Mon Sep 17 00:00:00 2001 From: Luiza Date: Thu, 20 Aug 2020 13:54:24 -0400 Subject: [PATCH 20/41] ch 3 - data plan review --- auxiliary/preamble.tex | 14 ----- chapters/3-measurement.tex | 107 ++++++++++++++++++------------------- 2 files changed, 53 insertions(+), 68 deletions(-) diff --git a/auxiliary/preamble.tex b/auxiliary/preamble.tex index d7c18fc7f..efcd771c5 100644 --- a/auxiliary/preamble.tex +++ b/auxiliary/preamble.tex @@ -124,19 +124,6 @@ \usepackage{xstring} \usepackage{catchfile} -%Set this user input -\newcommand{\gitfolder}{.git} %relative path to .git folder from .tex doc -\newcommand{\reponame}{worldbank/dime-data-handbook} % Name of account and repo be set in URL - -%Based on this https://tex.stackexchange.com/questions/455396/how-to-include-the-current-git-commit-id-and-branch-in-my-document -\CatchFileDef{\headfull}{\gitfolder/HEAD}{} %Get path to head file for checked out branch -\StrGobbleRight{\headfull}{1}[\head] %Remove end of line character -\StrBehind[2]{\head}{/}[\branch] %Parse out the path only -\CatchFileDef{\commit}{\gitfolder/refs/heads/\branch}{} %Get the content of the branch head -\StrGobbleRight{\commit}{1}[\commithash] %Remove end of line characted - -%Build the URL to this commit based on the information we now have -\newcommand{\commiturl}{\url{https://github.com/\reponame/commit/\commithash}} %---------------------------------------------------------------------------------------- @@ -180,7 +167,6 @@ \par Compiled from commit: \newline \vspace{-0.5cm} -\commiturl \par Released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.\newline \vspace{-0.5cm} diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index e2bfb4ee8..cb9f2a105 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -1,59 +1,58 @@ %----------------------------------------------------------------------------------------------- \begin{fullwidth} - -In this chapter we will show how you can save a lot of time -and increase the quality of your research by -planning your project's data requirements in advance. -Planning data requirements requires more than -simply listing the key outcome variables. -You need to understand how to structure the project's data -to best answer the research questions, -and create the tools to share this understanding across your team. - -The first section of this chapter discusses how to -determine your project's data needs, -and introduces DIME's data plan template. -The template includes: -one data linkage table, -one or several master datasets, and -one or several data flow charts. -These three tools will help to communicate the project's data requirements -both across the team and across time. -This section also discusses what specific research data you need -based on your project's research design, -and how to document those data needs in the data plan. - -The second section of this chapter covers two activities where -research data is created by the research team -instead of being observed in the real world. -Those two activities are random sampling and random assignment. -Special focus is spent on how to ensure that -these and other random processes are reproducible, -which is critical for the credibility of your research. - -The chapter concludes with a discussion of power calculations and randomization inference, -and how both are important tools to make optimal choices when planning data work. - - + + In this chapter we will show how you can save a lot of time + and increase the quality of your research by + planning your project's data requirements in advance. + Planning data requirements should be more than + just a listing of the key outcome variables. + It requires understanding how to structure the project's data + to best answer the research questions, + and creating the tools to share this understanding across your team. + + The first section of this chapter discusses how to + determine the data needs of the project, + and introduces DIME's data plan template that is composed of + one data linkage table, + one or several master datasets and + one or several data flow charts. + These three tools should be the way data requirements are communicated + both across the team and across time. + This section also discusses what specific research data you need + based on your project's research design, + and how those data needs are documented in the data plan. + + The second section of this chapter covers two activities where + research data is created by the research team + instead of being observed in the real world. + Those two activities are random sampling and random assignment. + Special focus is spent on how to ensure that + these and other random processes are reproducible, + which is critical for the credibility of your research. + + The chapter concludes with a discussion of power calculations and randomization inference, + and how both are important tools to make optimal choices when planning data work. + + \end{fullwidth} %----------------------------------------------------------------------------------------------- \section{Translating research design to a data plan} -In most projects, more than one data source is needed to answer the research question. These could be data from multiple survey rounds, data acquired from different partners (such as administrative data, implementation data, sensor data), web scraping, or complex combinations of these and other sources. +Most projects require data source to answer all their research questions. However your study is structured, you need to know how to link data from all sources -and analyze the relationships between the units that appear in them -to answer all your research questions. +and analyze the relationship between their units +to manage its data successfully. You might think that you are able to keep all the relevant details in your head, but your whole research team is unlikely to have the same understanding, at all times, of all the datasets required. -The only way to make sure that the full team shares the same understanding +The only way to make sure that the whole team shares the same understanding is to create a \textbf{data plan}\index{Data plan}.\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Plan}} DIME's data plan template has three components: @@ -69,16 +68,14 @@ \section{Translating research design to a data plan} combining information from multiple data sources. \textbf{Master datasets}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} -list all observations your project ever encounter +list all observations at your project ever encountered and are the authoritative source for all research data -such as unique identifiers, sample status and treatment assignment -(the following two sections of this chapter discuss how to generate these variables). +such as ID values, sampling and treatment statuses, and so on. \textbf{Data flow charts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, -and what manipulation of these data sources is necessary -to get to the final analysis dataset(s), -such as merging, appending, or other linkages. +and how they should be created by merging, appending, +or in any other way linking different data tables. \subsection{Creating a data plan} @@ -106,7 +103,7 @@ \subsection{Creating a data plan} or make a plan for how the new dataset will be linked to the project ID. It is very labor intensive to work with a dataset that -do not have an unambiguous way to link to the project ID, +cannot be unambiguously linked to the project ID, and it is a major source of error. The data linkage table should indicate whether @@ -126,20 +123,22 @@ \subsection{Creating a data plan} that will be used in any significant research activity. Examples of such activities are data collection, data analysis, sampling, and treatment assignment. -The master dataset should include and be the authoritative source of +The master dataset is the authoritative source of all \textbf{research variables}\sidenote{ \textbf{Research variables:} Research data that identifies observations - and maps research design information to those observations. + and maps research design information to them. Research variables are time-invariant and often, but not always, controlled by the research team. Examples include - ID variables, sampling status, treatment status, and treatment uptake.} -but not include any \textbf{measurement variables}\sidenote{ + ID variables, sampling status, treatment status, and treatment uptake.}, +and therefore all such variables should be present, +unlike \textbf{measurement variables}\sidenote{ \textbf{Measurement variables:} Data that corresponds to direct observations of the real world, recorded sentiments of the subjects of the research or any other aspect your project is studying. Measurement variables are not controlled by the research team + These variables are not controlled by the research team and often vary over time. Examples include characteristics of the research subject, outcome variables, input variables among many others.}. @@ -148,8 +147,8 @@ \subsection{Creating a data plan} but should not be stored in the same way. For example, if you acquire administrative data that both includes information on eligibility to be included in the study (research variable) -and information on outcome on the topic of your study (measurement variable) -you should first decide which variables are research variables, +and information on your outcome or interest (measurement variable), +you should first identify research variables, remove them during the data cleaning (see Chapter 6) and instead store them in your master dataset. It is common that you will have to update @@ -162,7 +161,7 @@ \subsection{Creating a data plan} \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} You should also list all other identifiers used in your project, such as names, addresses, or other IDs used by partner organizations, -and the master datasets will then serve as +and the master datasets can then serve as the linkage between those identifiers and the project ID. Because of this, master datasets must, with very few exceptions, always be encrypted. @@ -467,7 +466,7 @@ \subsection{Time periods in data plans} being a subset of the original sample that were better or worse off from the beginning. You should have a variable in your master dataset - that indicates attrition. +that indicates attrition. A balance check using the attrition variable can provide insights as to whether the lost observations were systematically different From 3b4042f923e2f9e5c691344673dc52ebc399ccbd Mon Sep 17 00:00:00 2001 From: Luiza Date: Thu, 20 Aug 2020 14:03:39 -0400 Subject: [PATCH 21/41] Revert "ch 3 - data plan review" This reverts commit 096fd2aa151c77997da63695211f31c5c0708070. --- auxiliary/preamble.tex | 14 +++++ chapters/3-measurement.tex | 107 +++++++++++++++++++------------------ 2 files changed, 68 insertions(+), 53 deletions(-) diff --git a/auxiliary/preamble.tex b/auxiliary/preamble.tex index efcd771c5..d7c18fc7f 100644 --- a/auxiliary/preamble.tex +++ b/auxiliary/preamble.tex @@ -124,6 +124,19 @@ \usepackage{xstring} \usepackage{catchfile} +%Set this user input +\newcommand{\gitfolder}{.git} %relative path to .git folder from .tex doc +\newcommand{\reponame}{worldbank/dime-data-handbook} % Name of account and repo be set in URL + +%Based on this https://tex.stackexchange.com/questions/455396/how-to-include-the-current-git-commit-id-and-branch-in-my-document +\CatchFileDef{\headfull}{\gitfolder/HEAD}{} %Get path to head file for checked out branch +\StrGobbleRight{\headfull}{1}[\head] %Remove end of line character +\StrBehind[2]{\head}{/}[\branch] %Parse out the path only +\CatchFileDef{\commit}{\gitfolder/refs/heads/\branch}{} %Get the content of the branch head +\StrGobbleRight{\commit}{1}[\commithash] %Remove end of line characted + +%Build the URL to this commit based on the information we now have +\newcommand{\commiturl}{\url{https://github.com/\reponame/commit/\commithash}} %---------------------------------------------------------------------------------------- @@ -167,6 +180,7 @@ \par Compiled from commit: \newline \vspace{-0.5cm} +\commiturl \par Released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.\newline \vspace{-0.5cm} diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index cb9f2a105..e2bfb4ee8 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -1,58 +1,59 @@ %----------------------------------------------------------------------------------------------- \begin{fullwidth} - - In this chapter we will show how you can save a lot of time - and increase the quality of your research by - planning your project's data requirements in advance. - Planning data requirements should be more than - just a listing of the key outcome variables. - It requires understanding how to structure the project's data - to best answer the research questions, - and creating the tools to share this understanding across your team. - - The first section of this chapter discusses how to - determine the data needs of the project, - and introduces DIME's data plan template that is composed of - one data linkage table, - one or several master datasets and - one or several data flow charts. - These three tools should be the way data requirements are communicated - both across the team and across time. - This section also discusses what specific research data you need - based on your project's research design, - and how those data needs are documented in the data plan. - - The second section of this chapter covers two activities where - research data is created by the research team - instead of being observed in the real world. - Those two activities are random sampling and random assignment. - Special focus is spent on how to ensure that - these and other random processes are reproducible, - which is critical for the credibility of your research. - - The chapter concludes with a discussion of power calculations and randomization inference, - and how both are important tools to make optimal choices when planning data work. - - + +In this chapter we will show how you can save a lot of time +and increase the quality of your research by +planning your project's data requirements in advance. +Planning data requirements requires more than +simply listing the key outcome variables. +You need to understand how to structure the project's data +to best answer the research questions, +and create the tools to share this understanding across your team. + +The first section of this chapter discusses how to +determine your project's data needs, +and introduces DIME's data plan template. +The template includes: +one data linkage table, +one or several master datasets, and +one or several data flow charts. +These three tools will help to communicate the project's data requirements +both across the team and across time. +This section also discusses what specific research data you need +based on your project's research design, +and how to document those data needs in the data plan. + +The second section of this chapter covers two activities where +research data is created by the research team +instead of being observed in the real world. +Those two activities are random sampling and random assignment. +Special focus is spent on how to ensure that +these and other random processes are reproducible, +which is critical for the credibility of your research. + +The chapter concludes with a discussion of power calculations and randomization inference, +and how both are important tools to make optimal choices when planning data work. + + \end{fullwidth} %----------------------------------------------------------------------------------------------- \section{Translating research design to a data plan} +In most projects, more than one data source is needed to answer the research question. These could be data from multiple survey rounds, data acquired from different partners (such as administrative data, implementation data, sensor data), web scraping, or complex combinations of these and other sources. -Most projects require data source to answer all their research questions. However your study is structured, you need to know how to link data from all sources -and analyze the relationship between their units -to manage its data successfully. +and analyze the relationships between the units that appear in them +to answer all your research questions. You might think that you are able to keep all the relevant details in your head, but your whole research team is unlikely to have the same understanding, at all times, of all the datasets required. -The only way to make sure that the whole team shares the same understanding +The only way to make sure that the full team shares the same understanding is to create a \textbf{data plan}\index{Data plan}.\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Plan}} DIME's data plan template has three components: @@ -68,14 +69,16 @@ \section{Translating research design to a data plan} combining information from multiple data sources. \textbf{Master datasets}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} -list all observations at your project ever encountered +list all observations your project ever encounter and are the authoritative source for all research data -such as ID values, sampling and treatment statuses, and so on. +such as unique identifiers, sample status and treatment assignment +(the following two sections of this chapter discuss how to generate these variables). \textbf{Data flow charts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, -and how they should be created by merging, appending, -or in any other way linking different data tables. +and what manipulation of these data sources is necessary +to get to the final analysis dataset(s), +such as merging, appending, or other linkages. \subsection{Creating a data plan} @@ -103,7 +106,7 @@ \subsection{Creating a data plan} or make a plan for how the new dataset will be linked to the project ID. It is very labor intensive to work with a dataset that -cannot be unambiguously linked to the project ID, +do not have an unambiguous way to link to the project ID, and it is a major source of error. The data linkage table should indicate whether @@ -123,22 +126,20 @@ \subsection{Creating a data plan} that will be used in any significant research activity. Examples of such activities are data collection, data analysis, sampling, and treatment assignment. -The master dataset is the authoritative source of +The master dataset should include and be the authoritative source of all \textbf{research variables}\sidenote{ \textbf{Research variables:} Research data that identifies observations - and maps research design information to them. + and maps research design information to those observations. Research variables are time-invariant and often, but not always, controlled by the research team. Examples include - ID variables, sampling status, treatment status, and treatment uptake.}, -and therefore all such variables should be present, -unlike \textbf{measurement variables}\sidenote{ + ID variables, sampling status, treatment status, and treatment uptake.} +but not include any \textbf{measurement variables}\sidenote{ \textbf{Measurement variables:} Data that corresponds to direct observations of the real world, recorded sentiments of the subjects of the research or any other aspect your project is studying. Measurement variables are not controlled by the research team - These variables are not controlled by the research team and often vary over time. Examples include characteristics of the research subject, outcome variables, input variables among many others.}. @@ -147,8 +148,8 @@ \subsection{Creating a data plan} but should not be stored in the same way. For example, if you acquire administrative data that both includes information on eligibility to be included in the study (research variable) -and information on your outcome or interest (measurement variable), -you should first identify research variables, +and information on outcome on the topic of your study (measurement variable) +you should first decide which variables are research variables, remove them during the data cleaning (see Chapter 6) and instead store them in your master dataset. It is common that you will have to update @@ -161,7 +162,7 @@ \subsection{Creating a data plan} \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} You should also list all other identifiers used in your project, such as names, addresses, or other IDs used by partner organizations, -and the master datasets can then serve as +and the master datasets will then serve as the linkage between those identifiers and the project ID. Because of this, master datasets must, with very few exceptions, always be encrypted. @@ -466,7 +467,7 @@ \subsection{Time periods in data plans} being a subset of the original sample that were better or worse off from the beginning. You should have a variable in your master dataset -that indicates attrition. + that indicates attrition. A balance check using the attrition variable can provide insights as to whether the lost observations were systematically different From 87617acb4b0d515ea186a4d58febf51403d8f373 Mon Sep 17 00:00:00 2001 From: Maria Date: Thu, 20 Aug 2020 14:34:00 -0400 Subject: [PATCH 22/41] Update chapters/3-measurement.tex --- chapters/3-measurement.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index e2bfb4ee8..5272ccb4a 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -148,7 +148,7 @@ \subsection{Creating a data plan} but should not be stored in the same way. For example, if you acquire administrative data that both includes information on eligibility to be included in the study (research variable) -and information on outcome on the topic of your study (measurement variable) +and data on the topic of your study (measurement variable) you should first decide which variables are research variables, remove them during the data cleaning (see Chapter 6) and instead store them in your master dataset. From eea94d40c64696c1997d91673a9d49c41c1584a2 Mon Sep 17 00:00:00 2001 From: Benjamin Daniels Date: Thu, 20 Aug 2020 16:07:55 -0400 Subject: [PATCH 23/41] Rewrite research variables section --- chapters/3-measurement.tex | 433 +++++++++++++------------------------ 1 file changed, 153 insertions(+), 280 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 5272ccb4a..9fc167ed6 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -3,32 +3,32 @@ \begin{fullwidth} In this chapter we will show how you can save a lot of time -and increase the quality of your research by +and increase the quality of your research by planning your project's data requirements in advance. Planning data requirements requires more than -simply listing the key outcome variables. +simply listing the key outcome variables. You need to understand how to structure the project's data -to best answer the research questions, +to best answer the research questions, and create the tools to share this understanding across your team. -The first section of this chapter discusses how to -determine your project's data needs, -and introduces DIME's data plan template. -The template includes: +The first section of this chapter discusses how to +determine your project's data needs, +and introduces DIME's data plan template. +The template includes: one data linkage table, one or several master datasets, and -one or several data flow charts. +one or several data flow charts. These three tools will help to communicate the project's data requirements both across the team and across time. -This section also discusses what specific research data you need +This section also discusses what specific research data you need based on your project's research design, and how to document those data needs in the data plan. -The second section of this chapter covers two activities where -research data is created by the research team +The second section of this chapter covers two activities where +research data is created by the research team instead of being observed in the real world. Those two activities are random sampling and random assignment. -Special focus is spent on how to ensure that +Special focus is spent on how to ensure that these and other random processes are reproducible, which is critical for the credibility of your research. @@ -45,8 +45,8 @@ \section{Translating research design to a data plan} In most projects, more than one data source is needed to answer the research question. These could be data from multiple survey rounds, data acquired from different partners (such as administrative data, implementation data, sensor data), -web scraping, -or complex combinations of these and other sources. +web scraping, +or complex combinations of these and other sources. However your study is structured, you need to know how to link data from all sources and analyze the relationships between the units that appear in them to answer all your research questions. @@ -59,123 +59,123 @@ \section{Translating research design to a data plan} DIME's data plan template has three components: one \textit{data linkage table},\index{Data linkage table} one or several \textit{master datasets}\index{Master datasets} -and one or several \textit{data flow charts}.\index{Data flowchart} +and one or several \textit{data flow charts}.\index{Data flowchart} A \textbf{data linkage table}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Linkage\_Table}} lists all the datasets that will be used in the project. -Its most important function is to indicate +Its most important function is to indicate how all those datasets can be be linked when combining information from multiple data sources. \textbf{Master datasets}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} list all observations your project ever encounter and are the authoritative source for all research data -such as unique identifiers, sample status and treatment assignment -(the following two sections of this chapter discuss how to generate these variables). +such as unique identifiers, sample status and treatment assignment +(the following two sections of this chapter discuss how to generate these variables). \textbf{Data flow charts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, -and what manipulation of these data sources is necessary -to get to the final analysis dataset(s), -such as merging, appending, or other linkages. +and what manipulation of these data sources is necessary +to get to the final analysis dataset(s), +such as merging, appending, or other linkages. \subsection{Creating a data plan} The data plan is the best tool a research team has for -the lead researchers to communicate their vision for the data work, +the lead researchers to communicate their vision for the data work, and for the research assistants to communicate their understanding of that vision. -The data plan should be drafted at the outset of a project, -before any data is acquired, +The data plan should be drafted at the outset of a project, +before any data is acquired, but it is not a static document; it will need to be updated as the project evolves. -To create a data plan according to DIME's template, +To create a data plan according to DIME's template, the first step is to create a \textbf{data linkage table} by listing all the data sources you know you will use in a spreadsheet. -If one source of data will result in two different datasets, +If one source of data will result in two different datasets, then list each dataset on its own row. For each dataset, list the \textbf{unit of observation}\sidenote{ \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}}, and the name of the project ID variable for that unit of observation. -Your project should only have one project ID variable per unit of observation. +Your project should only have one project ID variable per unit of observation. When you list a dataset in the data linkage table -- which should be done before that dataset is acquired -- you should always make sure that the dataset will -be fully and uniquely identified by the project ID, -or make a plan for how +be fully and uniquely identified by the project ID, +or make a plan for how the new dataset will be linked to the project ID. It is very labor intensive to work with a dataset that -do not have an unambiguous way to link to the project ID, +do not have an unambiguous way to link to the project ID, and it is a major source of error. -The data linkage table should indicate whether -datasets can be merged one-to-one (for example, -merging baseline and endline datasets +The data linkage table should indicate whether +datasets can be merged one-to-one (for example, +merging baseline and endline datasets that use the same unit of observation), or whether two datasets need to be merged many-to-one (for example, school administrative data merged with student data). -Your data map must indicate which ID variables +Your data map must indicate which ID variables can be used -- and how -- when merging datasets. The data linkage table is also a great place to list other metadata, -such as the source of your data, its backup locations, +such as the source of your data, its backup locations, the nature of the data license, and so on. -The second step in creating a data plan is to create one \textbf{master dataset} +The second step in creating a data plan is to create one \textbf{master dataset} for each unit of observation that will be used in any significant research activity. -Examples of such activities are data collection, data analysis, +Examples of such activities are data collection, data analysis, sampling, and treatment assignment. The master dataset should include and be the authoritative source of all \textbf{research variables}\sidenote{ \textbf{Research variables:} Research data that identifies observations and maps research design information to those observations. - Research variables are time-invariant and + Research variables are time-invariant and often, but not always, controlled by the research team. Examples include - ID variables, sampling status, treatment status, and treatment uptake.} + ID variables, sampling status, treatment status, and treatment uptake.} but not include any \textbf{measurement variables}\sidenote{ - \textbf{Measurement variables:} Data that - corresponds to direct observations of the real world, - recorded sentiments of the subjects of the research - or any other aspect your project is studying. + \textbf{Measurement variables:} Data that + corresponds to direct observations of the real world, + recorded sentiments of the subjects of the research + or any other aspect your project is studying. Measurement variables are not controlled by the research team and often vary over time. - Examples include characteristics of the research subject, + Examples include characteristics of the research subject, outcome variables, input variables among many others.}. -Research variables and measurement variables +Research variables and measurement variables often come from the same source, but should not be stored in the same way. -For example, if you acquire administrative data that both includes -information on eligibility to be included in the study (research variable) -and data on the topic of your study (measurement variable) -you should first decide which variables are research variables, -remove them during the data cleaning (see Chapter 6) -and instead store them in your master dataset. -It is common that you will have to update +For example, if you acquire administrative data that both includes +information on eligibility to be included in the study (research variable) +and data on the topic of your study (measurement variable) +you should first decide which variables are research variables, +remove them during the data cleaning (see Chapter 6) +and instead store them in your master dataset. +It is common that you will have to update your master datasets throughout your project. -The most important function of the master dataset +The most important function of the master dataset is to be the authoritative source for the project ID. -This means that all observations listed +This means that all observations listed should be uniquely and fully identified by the included project ID variable.\sidenote{ \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} -You should also list all other identifiers used in your project, +You should also list all other identifiers used in your project, such as names, addresses, or other IDs used by partner organizations, -and the master datasets will then serve as +and the master datasets will then serve as the linkage between those identifiers and the project ID. -Because of this, master datasets must, +Because of this, master datasets must, with very few exceptions, always be encrypted. -Even when a partner organization has a unique identifier, -you should always create a project ID specific to your project only, -as you are otherwise not in control over -who can re-identify your de-identified dataset. +Even when a partner organization has a unique identifier, +you should always create a project ID specific to your project only, +as you are otherwise not in control over +who can re-identify your de-identified dataset. -You should include all observations ever encountered +You should include all observations ever encountered in your master datasets, even if they are not eligible for your study. This is because, if you ever need to perform a record linkage such as a fuzzy match -on string variables like proper names, +on string variables like proper names, you will make fewer errors the more information you have. If you ever need to do a fuzzy match, you should always do that between the master dataset @@ -184,7 +184,7 @@ \subsection{Creating a data plan} you have successfully merged the project IDs from the master dataset. -Since the master datasets is the authoritative source +Since the master datasets is the authoritative source of the project ID and all research variables, it serves as an unambiguous method of mapping the observations in your study to your research design. @@ -192,71 +192,71 @@ \subsection{Creating a data plan} The third and final step in creating the data plan is to create \textbf{data flow charts}. Each analysis dataset (see Chapter 6 for discussion on why you likely need multiple analysis datasets) -should have a data flow chart showing how it was created. -The flow chart is a diagram -where each starting point is either a master dataset +should have a data flow chart showing how it was created. +The flow chart is a diagram +where each starting point is either a master dataset or a dataset listed in the data linkage table. -The data flow chart should include instructions on how +The data flow chart should include instructions on how the datasets can be combined to create the analysis dataset. The operations used to combine the data could include: -appending, one-to-one merging, +appending, one-to-one merging, many-to-one or one-to-many merging, collapsing, or a broad variety of others. You must list which ID variable or set of ID variables should be used in each operation, and note whether the operation creates a new variable or combination of variables to identify the newly linked data. -Once you have acquired the datasets listed in the flow chart, -you can add to the data flow charts the number of observations that +Once you have acquired the datasets listed in the flow chart, +you can add to the data flow charts the number of observations that the starting point dataset has and the number of observation each resulting datasets -should have after each operation. +should have after each operation. This is a great method to track attrition and to make sure that the operations used to combine datasets did not create unwanted duplicates or incorrectly drop any observations. A data flow chart can be created in a flow chart drawing tool (there are many free alternatives online) or -by using shapes in Microsoft PowerPoint. +by using shapes in Microsoft PowerPoint. You can also do this simply by drawing on a piece of paper and taking a photo, but we recommend a digital tool -so that flow charts can easily be updated over time. +so that flow charts can easily be updated over time. \subsection{Defining research variables related to your research design} After you have set up your data plan, -you need to carefully think about your research design +you need to carefully think about your research design and which research variables you will need in the data analysis -to infer the relation between differences in measurement variables +to infer the relation between differences in measurement variables and your research design. We assume you have a working familiarity with the research designs mentioned here. If needed, you can reference \textcolor{red}{Appendix XYZ}, -where you will find more details +where you will find more details and specific references for common impact evaluation methods. The research designs discussed here compare a group that received some kind of \textbf{treatment}\index{Treatment}\sidenote{ \textbf{Treatment:} The general word for the evaluated intervention or event. This includes things like being offered a training, - a cash transfer from a program, + a cash transfer from a program, or experiencing a natural disaster, among many others.} against a counterfactual control group\index{Counterfactual}.\sidenote{ - \textbf{Counterfactual:} A statistical description of + \textbf{Counterfactual:} A statistical description of what would have happened to specific individuals in an alternative scenario, for example, a different treatment assignment outcome.} The key assumption is that each -person, facility, or village +person, facility, or village (or whatever the unit of treatment is) had two possible states: their outcome if they did receive the treatment and their outcome if they did not receive that treatment. The average impact of the treatment, or the ATE\sidenote{ - The \textbf{average treatment effect (ATE)} - is the expected average change in outcome - that untreated units would have experienced + The \textbf{average treatment effect (ATE)} + is the expected average change in outcome + that untreated units would have experienced had they been treated.}, -is defined as the difference +is defined as the difference between these two states averaged over all units. However, we can never observe the same unit @@ -265,35 +265,35 @@ \subsection{Defining research variables related to your research design} Instead, the treatment group is compared to a control group that is statistically indistinguishable, which makes the average impact of the treatment -mathematically equivalent to +mathematically equivalent to the difference in averages between the groups. Statistical similarity is often defined -as \textbf{balance} between two or more groups. +as \textbf{balance} between two or more groups. Since balance tests are commonly run for impact evaluations, -DIME Analytics created a Stata command to +DIME Analytics created a Stata command to standardize and automate the creation of nicely-formatted balance tables: \texttt{iebaltab}\sidenote{ \url{https://dimewiki.worldbank.org/iebaltab}}. Each research design has a different method for -identifying the statistically-similar control group. -The rest of this section covers how research data requirements +identifying the statistically-similar control group. +The rest of this section covers how research data requirements differ between those different methods. -What does not differ, however, +What does not differ, however, is that these data requirements are all research variables. The source for the required research variables varies -between research designs and between projects, +between research designs and between projects, but the authoritative source for that type of data should always be a master dataset. -You will often have to merge -the research variables to other datasets, -but that is an easy task +You will often have to merge +the research variables to other datasets, +but that is an easy task if you created a data linkage table. %%%%% Experimental design -In \textbf{experimental research designs}, +In \textbf{experimental research designs}, such as\index{randomized control trials}\index{experimental research designs} \textbf{randomized control trials (RCTs)},\sidenote{ \url{https://dimewiki.worldbank.org/Randomized\_Control\_Trials}} @@ -308,7 +308,7 @@ \subsection{Defining research variables related to your research design} then the two groups will, on average, be statistically indistinguishable. The treatment will therefore not be correlated with anything but the impact of that treatment.\cite{duflo2007using} -The randomized assignment should be done +The randomized assignment should be done using data from the master dataset, and the result should be saved back to the master dataset, before being merged to other datasets. @@ -330,198 +330,71 @@ \subsection{Defining research variables related to your research design} to exploit events that occurred in the past. Therefore, these methods often use either secondary data, including administrative data or other classes of routinely-collected information, -and it is important that your data linkage table documents +and it is important that your data linkage table documents how this data can be linked to the rest of the data in your project. -%%%%% Regression discontinuity +%%%%% Research variables -\textbf{Regression discontinuity (RD)}\sidenote{ +No matter the design, you should be very clear about +which data points you observe or collect are research variables. +For example, +\textbf{regression discontinuity (RD)}\sidenote{ \url{https://dimewiki.worldbank.org/Regression\_Discontinuity}} \index{regression discontinuity} designs exploit sharp breaks or limits -in policy designs to separate a single group of potentially eligible recipients -into comparable groups of individuals who do and do not receive a treatment. -Common examples are test score thresholds and income thresholds, -where the individuals on one side of some threshold receive -a treatment but those on the other side do not.\sidenote{ - \url{https://blogs.worldbank.org/impactevaluations/regression-discontinuity-porn}} -The intuition is that, on average, -individuals immediately on one side of the threshold -are statistically indistinguishable from the individuals on the other side, -and the only difference is receiving the treatment. -In your data you need an unambiguous way -to define which observations were above or below the cutoff. +in policy designs. The cutoff determinant, or running variable, -is often a continuous variable -that is used to divide the sample into two or more groups. -Both the running variable and a categorical cutoff variable, should be saved in your master dataset. -The running variable is one of the exceptions where -a research variable in the master dataset -may vary over time, -and an observation may be on different sides of the cutoff, -depending on when you make the cutoff. -In these cases your research design should -ex-ante clearly indicate what point in time -the running variable will be recorded, -and this should be clearly documented in your master dataset. - - -%%%%% IV regression - -\textbf{Instrumental variables (IV)}\sidenote{ +In \textbf{instrumental variables (IV)}\sidenote{ \url{https://dimewiki.worldbank.org/Instrumental\_Variables}} \index{instrumental variables} -designs, unlike the previous approaches, -assume that the treatment effect is not directly identifiable. -Similar to RD designs, -IV designs focus on a subset of the variation in treatment take-up. -Where RD designs use a \textit{sharp} or binary cutoff, -IV designs are \textit{fuzzy}, meaning that the input does not completely determine -the treatment status, but instead influence the \textit{probability of treatment}. -You will need variables in your data -that can be used to estimate the probability of treatment for each unit. -These variables are called \textbf{instruments}. -Instrument variables are another example of research variables -that might vary over time, -as your probability of being treated might change. -Again, in these cases your research design should -ex-ante clearly indicate what point of time this will be recorded, -and this should be clearly documented in your master dataset. - -%%%%% Matching - -\textbf{Matching}\sidenote{ +designs, the \textbf{instruments} influence the \textit{probability} of treatment. +These research variables should be collected and stored in master data. +In \textbf{matching} designs, observations are often grouped +by a strata, grouping, index, or propensity score.\sidenote{ \url{https://dimewiki.worldbank.org/Matching}} -methods use observable characteristics to construct -sets of treatment and control units -where the observations in each set -are as similar as possible. \index{matching} -These sets can either consist of exactly one treatment and one control observation (one-to-one), -a set of observations where -both groups have more than one observation represented (many-to-many), -or where only one group has more than one observation included (one-to-many). -By now you can probably guess that -the result of the matching needs to be saved in the master dataset. -This is best done by assigning a matching ID to each matched set, -and create a variable in the master dataset -with the ID for the set each unit belongs to. -The matching can also be done before the randomized assignment, -so that treatment can be randomized within each matching set. -This would then be a type of experimental design. -Furthermore, if no control observations were identified before the treatment, -then matching can be used to ex-post identify a control group. -Many matching algorithms can only match on a single variable, -so you first have to turn many variables into a single variable -by using an index or a propensity score.\sidenote{ - \url{https://dimewiki.worldbank.org/Propensity\_Score\_Matching}} -DIME Analytics developed a command to match observations -based on this single continuous variable: \texttt{iematch}\sidenote{ - \url{https://dimewiki.worldbank.org/iematch}}, -part of the \texttt{ietoolkit} package. - -%----------------------------------------------------------------------------------------------- -\subsection{Time periods in data plans} - -Your data plan should also take into consideration -whether you are using data from one time period or several. -A study that observes data in only one time period is called -a \textbf{cross-sectional study}. -\index{cross-sectional data} -This type of data is relatively easy to collect and handle because -you do not need to track individuals across time, -and therefore requires no additional information in your data plan. -Instead, the challenge in a cross-sectional study is to -show that the control group is indeed a valid counterfactual to the treatment group. - -Observations over multiple time periods, -referred to as \textbf{longitudinal data}\index{longitudinal data}, -can consist of either -\textbf{repeated cross-sections}\index{repeated cross-sectional data} -or \textbf{panel data}\index{panel data}. -In repeated cross-sections, -each successive round of data collection uses a new random sample -of observations from the treatment and control groups, -but in a panel data study -the same observations are tracked and included each round. -If each round of data collection is a separate activity, -then they should be treated as separate sources of data -and get their own row in the data linkage table. -If the data is continuously collected, -or at frequent intervals, -then it can be treated as a single data source. -The data linkage table must document -how the different rounds will be merged or appended -when panel data is collected in separate activities. - -You must keep track of the \textit{attrition rate} in panel data, -which is the share of observations not observed in follow-up data. -It is common that the observations not possible to track -can be correlated with the outcome you study. -For example, poorer households may live in more informal dwellings, -patients with worse health conditions might not survive to follow-up, -and so on. -If this is the case, -then your results might only be an effect of your remaining sample -being a subset of the original sample -that were better or worse off from the beginning. -You should have a variable in your master dataset - that indicates attrition. -A balance check using the attrition variable -can provide insights as to whether the lost observations -were systematically different -compared to the rest of the sample. +While you need not include the undelying variables in master data, +the match information itself is part of the experimental design +and should usually be recorded there. +In all these cases, fidelity to the design is important to note as well. +A program intended for students that scored under 50\% on a test +might have some cases where the program is offered to someone that scored 51\% at the test, +or someone that scored 49\% at the might decline to participate in the program. +Differences between assignments and realizations +should also be recorded in the master datasets. %----------------------------------------------------------------------------------------------- \subsection{Monitoring data} -For any study with an ex-ante design, +For any study with an ex-ante design, \textbf{monitoring data}\index{monitoring data}\sidenote{\url{ https://dimewiki.worldbank.org/Monitoring\_Data}} is very important for understanding if the -assumptions made during the research design +assumptions made during the research design corresponds to what is true in reality. -The most typical example is to make sure that, +The most typical example is to make sure that, in an experimental design, the treatment was implemented according to your treatment assignment. While it is always better to monitor all activities, -it might be to costly. +it might be to costly. In those cases you can sample a smaller number of critical activities and monitor them. This will not be detailed enough to be used as a control in your analysis, -but it will still give a way to +but it will still give a way to estimate the validity of your research design assumptions. Treatment implementation is often carried out by partners, and field realities may be more complex realities than foreseen during research design. Furthermore, the field staff of your partner organization, might not be aware that their actions are the implementation of your research. -Therefore, you must acquire monitoring data that +Therefore, you must acquire monitoring data that tells you how well the treatment assignment in the field corresponds to your intended treatment assignment, for nearly all experimental research designs. -An example of a non-experimental research design -for which monitoring data also is important -are regression discontinuity (RD) designs -where he discontinuity is -a cutoff for eligibility of the treatment. -For example, -let's say your project studies the impact of a program -for students that scored under 50\% at a test. -We might have the exact results of the tests for all students, -and therefore know who should be offered the program, -however that is not the same as knowing who attended the program. -A teacher might offer the program to someone that scored 51\% at the test, -and someone that scored 49\% at the might decline to participate in the program. -We need to understand how common this was, -and if one case was more common than the other. -Otherwise the result of our research will not be helpful -in evaluating the program. - -Monitoring data is particularly prone to errors +Monitoring data is particularly prone to errors when linking it with the rest of the data in your project. -Often monitoring activities is done by +Often monitoring activities are done by sending a team to simply record the name of all people attending a training, or by a partner organization sharing their administrative data, which is rarely maintained in the same format or structure as your research data. @@ -529,7 +402,7 @@ \subsection{Monitoring data} the project ID or any other unambiguous identifiers in our master datasets is used to record who is who. Planning ahead for this when the monitoring activity is added to the data linkage table -is the best protection from ending up with poor correlation +is the best protection from ending up with poor correlation between treatment uptake and treatment assignment, without a way to tell if the poor correlation is just a result of a fuzzy link between monitoring data and the rest of your data. @@ -539,27 +412,27 @@ \subsection{Monitoring data} %----------------------------------------------------------------------------------------------- \section{Research variables created by randomization} -Random sampling and treatment assignment are two research activities -at the core elements of research design, +Random sampling and treatment assignment are two research activities +at the core elements of research design, that generated research variables. -In experimental methods, +In experimental methods, random sampling and treatment assignment directly determine the set of individuals who are going to be observed and what their status will be for the purpose of effect estimation. -In quasi-experimental methods, +In quasi-experimental methods, random sampling determines what populations the study will be able to make meaningful inferences about, and random treatment assignment creates counterfactuals. \textbf{Randomization}\sidenote{ - \textbf{Randomization} is often used interchangeably + \textbf{Randomization} is often used interchangeably to mean random treatment assignment. - In this book however, \textit{randomization} will only + In this book however, \textit{randomization} will only be used to describe the process of generating - a sequence of unrelated numbers, i.e. a random process. - \textit{Randomization} will never be used to mean + a sequence of unrelated numbers, i.e. a random process. + \textit{Randomization} will never be used to mean the process of assigning units in treatment and control groups, that will always be called \textit{random treatment assignment}, - or a derivative thereof.} + or a derivative thereof.} is used to ensure that a sample is representative and that any treatment and control groups are statistically indistinguishable after treatment assignment. @@ -591,7 +464,7 @@ \subsection{Randomizing sampling and treatment assignment} This process can be used, for example, to select a subset from all eligible units to be included in data collection when the cost of collecting data on everyone is prohibitive.\sidenote{ \url{https://dimewiki.worldbank.org/Sample\_Size\_and\_Power\_Calculations}} -But it can also be used to select a sub-sample of your observations to test a computationally heavy process +But it can also be used to select a sub-sample of your observations to test a computationally heavy process before running it on the full data. \textbf{Randomized treatment assignment} is the process of assigning observations to different treatment arms. This process is central to experimental research design. @@ -607,20 +480,20 @@ \subsection{Randomizing sampling and treatment assignment} This list should always be your \textbf{master dataset} when possible, and the result should always be saved in the master dataset before merged to any other data. -One example of the rare exceptions -when master datasets cannot be used is +One example of the rare exceptions +when master datasets cannot be used is when sampling must be done in real time -- -for example, randomly sampling patients +for example, randomly sampling patients as they arrive at a health facility. -In those cases, +In those cases, it is important that you collect enough data during the real time sampling, -such that you can add these individuals, -and the result of the sampling, +such that you can add these individuals, +and the result of the sampling, to your master dataset afterwards. % implement uniform-probability random sampling -The simplest form of sampling is +The simplest form of sampling is \textbf{uniform-probability random sampling}. This means that every eligible observation in the master dataset has an equal probability of being selected. @@ -670,7 +543,7 @@ \subsection{Programming reproducible random processes} This section introduces strict rules: these are non-negotiable (but thankfully simple). Stata, like most statistical software, uses a \textbf{pseudo-random number generator} -which, in ordinary research use, +which, in ordinary research use, produces sequences of number that are as good as random.\sidenote{ \url{https://dimewiki.worldbank.org/Randomization\_in\_Stata}} However, for \textit{reproducible} randomization, we need two additional properties: @@ -751,19 +624,19 @@ \subsection{Programming reproducible random processes} that randomized assignment results be revealed in the field. It is possible to do this using survey software or live events, such as a live lottery. These methods typically do not leave a record of the randomization, -and as such are never reproducible. -However, you can often run your randomization in advance +and as such are never reproducible. +However, you can often run your randomization in advance even when you do not have list of eligible units in advance. -Let's say you want to, at various health facilities, +Let's say you want to, at various health facilities, randomly select a sub-sample of patients as they arrive. -You can then have a pre-generated list +You can then have a pre-generated list with a random order of ''in sample'' and ''not in sample''. Your field staff would then go through this list in order and cross off one randomized result as it is used for a patient. This is especially beneficial if you are implementing a more complex randomization, -for example, sample 10\% of the patients, show a video for 50\% of the sample, -and ask a longer version of the questionnaire to 20\% of both +for example, sample 10\% of the patients, show a video for 50\% of the sample, +and ask a longer version of the questionnaire to 20\% of both the group of patients that watch the video and those that did not. The real time randomization is much more likely to be implemented correctly, if your field staff simply can follow a list with the randomized categories @@ -774,7 +647,7 @@ \subsection{Programming reproducible random processes} Finally, if this real-time randomization implementation is done using survey software, then the pre-generated list of randomized categories can be preloaded into the questionnaire. -Then the field team can follow a list of respondent IDs +Then the field team can follow a list of respondent IDs that are randomized into the appropriate categories, and the survey software can show a video and control which version of the questionnaire is asked. This way, you reduce the risk of errors in field randomization. @@ -906,12 +779,12 @@ \section{Doing power calculations for research design} so such a study would never be able to say anything about the effect size that is practically relevant. Conversely, the \textbf{minimum sample size} pre-specifies expected effect sizes and tells you how large a study's sample would need to be to detect that effect, -which can tell you what resources you would need +which can tell you what resources you would need to implement a useful study. % what is randomization inference \textbf{Randomization inference}\sidenote{ - \url{https://dimewiki.worldbank.org/Randomization\_Inference}} + \url{https://dimewiki.worldbank.org/Randomization\_Inference}} is used to analyze the likelihood \index{randomization inference} that the randomized assignment process, by chance, From 73c2c2c1acc688590b13800e9cca00ed38e23d32 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 16:33:09 -0400 Subject: [PATCH 24/41] [ch3] flow chart -> flowchart --- chapters/3-measurement.tex | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 9fc167ed6..18ceba5fb 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -17,7 +17,7 @@ The template includes: one data linkage table, one or several master datasets, and -one or several data flow charts. +one or several data flowcharts. These three tools will help to communicate the project's data requirements both across the team and across time. This section also discusses what specific research data you need @@ -59,7 +59,7 @@ \section{Translating research design to a data plan} DIME's data plan template has three components: one \textit{data linkage table},\index{Data linkage table} one or several \textit{master datasets}\index{Master datasets} -and one or several \textit{data flow charts}.\index{Data flowchart} +and one or several \textit{data flowcharts}.\index{Data flowchart} A \textbf{data linkage table}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Linkage\_Table}} @@ -73,7 +73,7 @@ \section{Translating research design to a data plan} and are the authoritative source for all research data such as unique identifiers, sample status and treatment assignment (the following two sections of this chapter discuss how to generate these variables). -\textbf{Data flow charts}\sidenote{ +\textbf{Data flowcharts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all datasets that are needed to create each analysis dataset, and what manipulation of these data sources is necessary @@ -189,14 +189,14 @@ \subsection{Creating a data plan} it serves as an unambiguous method of mapping the observations in your study to your research design. -The third and final step in creating the data plan is to create \textbf{data flow charts}. +The third and final step in creating the data plan is to create \textbf{data flowcharts}. Each analysis dataset (see Chapter 6 for discussion on why you likely need multiple analysis datasets) -should have a data flow chart showing how it was created. -The flow chart is a diagram +should have a data flowchart showing how it was created. +The flowchart is a diagram where each starting point is either a master dataset or a dataset listed in the data linkage table. -The data flow chart should include instructions on how +The data flowchart should include instructions on how the datasets can be combined to create the analysis dataset. The operations used to combine the data could include: appending, one-to-one merging, @@ -205,8 +205,8 @@ \subsection{Creating a data plan} should be used in each operation, and note whether the operation creates a new variable or combination of variables to identify the newly linked data. -Once you have acquired the datasets listed in the flow chart, -you can add to the data flow charts the number of observations that +Once you have acquired the datasets listed in the flowchart, +you can add to the data flowcharts the number of observations that the starting point dataset has and the number of observation each resulting datasets should have after each operation. @@ -214,12 +214,12 @@ \subsection{Creating a data plan} the operations used to combine datasets did not create unwanted duplicates or incorrectly drop any observations. -A data flow chart can be created in a flow chart drawing tool +A data flowchart can be created in a flowchart drawing tool (there are many free alternatives online) or by using shapes in Microsoft PowerPoint. You can also do this simply by drawing on a piece of paper and taking a photo, but we recommend a digital tool -so that flow charts can easily be updated over time. +so that flowcharts can easily be updated over time. \subsection{Defining research variables related to your research design} From 173d1d6657ea8a3cd4e79c3a1bd1d55b915b7cbc Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 16:35:49 -0400 Subject: [PATCH 25/41] [ch3] data source in flowchart --- chapters/3-measurement.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 18ceba5fb..4d3aaf972 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -75,7 +75,7 @@ \section{Translating research design to a data plan} (the following two sections of this chapter discuss how to generate these variables). \textbf{Data flowcharts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} -list all datasets that are needed to create each analysis dataset, +list all data sources that are needed to create each analysis dataset, and what manipulation of these data sources is necessary to get to the final analysis dataset(s), such as merging, appending, or other linkages. From faa620cb5c4fee35dff4d01282961318a9bbd19a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 16:38:37 -0400 Subject: [PATCH 26/41] [ch3] measure variable def - record responses You do not measure responses, you record them --- chapters/3-measurement.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 4d3aaf972..779fadb13 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -136,9 +136,8 @@ \subsection{Creating a data plan} ID variables, sampling status, treatment status, and treatment uptake.} but not include any \textbf{measurement variables}\sidenote{ \textbf{Measurement variables:} Data that - corresponds to direct observations of the real world, - recorded sentiments of the subjects of the research - or any other aspect your project is studying. + data the measures attributes or + record responses of research subjects. Measurement variables are not controlled by the research team and often vary over time. Examples include characteristics of the research subject, From 4ac1bd84aaf99f9b8d09c6004068ea136cc92f5a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 16:48:30 -0400 Subject: [PATCH 27/41] [ch3] measurement vars def - typos --- chapters/3-measurement.tex | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 779fadb13..a62bc01df 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -134,14 +134,13 @@ \subsection{Creating a data plan} often, but not always, controlled by the research team. Examples include ID variables, sampling status, treatment status, and treatment uptake.} -but not include any \textbf{measurement variables}\sidenote{ +but not include any \textbf{measurement variables}.\sidenote{ \textbf{Measurement variables:} Data that - data the measures attributes or - record responses of research subjects. + measures attributes or records responses of research subjects. Measurement variables are not controlled by the research team and often vary over time. Examples include characteristics of the research subject, - outcome variables, input variables among many others.}. + outcome variables, input variables among many others.} Research variables and measurement variables often come from the same source, but should not be stored in the same way. From 40875f5d9606bee420c91f583c52d33b0a19279c Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 16:57:15 -0400 Subject: [PATCH 28/41] [ch3] research and measure variables in intro --- chapters/3-measurement.tex | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index a62bc01df..1aecfc959 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -10,6 +10,13 @@ You need to understand how to structure the project's data to best answer the research questions, and create the tools to share this understanding across your team. +We will discuss two types of variables: +variables that tie your research design +to the observations in the data +which we call \textbf{research variables}; +and variables that correspond to observations of the real world, +which we call \textbf{measurement variables}. +The project's data plan needs to account for both. The first section of this chapter discusses how to determine your project's data needs, From 03cdf2e636c45e54e5dbb078b3705cc534cedd0f Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 17:37:09 -0400 Subject: [PATCH 29/41] [ch3] data map vision --- chapters/3-measurement.tex | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 1aecfc959..5a9e6d7fb 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -15,7 +15,7 @@ to the observations in the data which we call \textbf{research variables}; and variables that correspond to observations of the real world, -which we call \textbf{measurement variables}. +which we call \textbf{measurement variables}. The project's data plan needs to account for both. The first section of this chapter discusses how to @@ -89,9 +89,11 @@ \section{Translating research design to a data plan} \subsection{Creating a data plan} -The data plan is the best tool a research team has for -the lead researchers to communicate their vision for the data work, -and for the research assistants to communicate their understanding of that vision. +The process of drafting the data plan is itself useful, +as it is an opportunity for the principal investigators +to communicate their vision of the data environment, +and for research assistants to communicate +their understanding of that vision. The data plan should be drafted at the outset of a project, before any data is acquired, but it is not a static document; From 27dee583f8c60b28fee2d31d9275c6b7d82b02af Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 17:39:14 -0400 Subject: [PATCH 30/41] [ch3] DIME <3 IE --- chapters/3-measurement.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 5a9e6d7fb..1cd089641 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -241,7 +241,9 @@ \subsection{Defining research variables related to your research design} where you will find more details and specific references for common impact evaluation methods. -The research designs discussed here compare a group that received +As DIME primarily works on impact evaluations, +we focus our discussion here on research designs +that compare a group that received some kind of \textbf{treatment}\index{Treatment}\sidenote{ \textbf{Treatment:} The general word for the evaluated intervention or event. This includes things like being offered a training, From d305ab2c71522ecaa20290059ca31a5ea0bbfca2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 17:40:56 -0400 Subject: [PATCH 31/41] [ch3] all are research vars --- chapters/3-measurement.tex | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 1cd089641..c4fd50095 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -290,10 +290,8 @@ \subsection{Defining research variables related to your research design} differ between those different methods. What does not differ, however, is that these data requirements are all research variables. -The source for the required research variables varies -between research designs and between projects, -but the authoritative source for that type of data should -always be a master dataset. +And that the research variables discussed below +should always be included in the master dataset. You will often have to merge the research variables to other datasets, but that is an easy task From b6f584c8998d0e1052aa1d5ab137a10a2e2122cc Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 17:50:14 -0400 Subject: [PATCH 32/41] [ch3] research vars in master, measurement vars in cleaned --- chapters/3-measurement.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index c4fd50095..f6bbaf4d0 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -154,11 +154,12 @@ \subsection{Creating a data plan} often come from the same source, but should not be stored in the same way. For example, if you acquire administrative data that both includes -information on eligibility to be included in the study (research variable) +information on eligibility for the study (research variable) and data on the topic of your study (measurement variable) you should first decide which variables are research variables, -remove them during the data cleaning (see Chapter 6) -and instead store them in your master dataset. +and store them in the master dataset, +while storing the measurement variables in a cleaned dataset +as described in Chapter 5. It is common that you will have to update your master datasets throughout your project. From 99df373f2f2ded7f10c4096891e1125854f5d4a5 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 17:52:04 -0400 Subject: [PATCH 33/41] [ch3] meas vars - examples --- chapters/3-measurement.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index f6bbaf4d0..f34641152 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -149,7 +149,7 @@ \subsection{Creating a data plan} Measurement variables are not controlled by the research team and often vary over time. Examples include characteristics of the research subject, - outcome variables, input variables among many others.} + outcome variables, and control variables among many others.} Research variables and measurement variables often come from the same source, but should not be stored in the same way. From 651e17a809e0d91627582f57a4f8e1b3b8e23dc2 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 18:15:37 -0400 Subject: [PATCH 34/41] [ch3] reintroducing design darlings --- chapters/3-measurement.tex | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index f34641152..4ef7d8313 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -358,13 +358,25 @@ \subsection{Defining research variables related to your research design} \url{https://dimewiki.worldbank.org/Instrumental\_Variables}} \index{instrumental variables} designs, the \textbf{instruments} influence the \textit{probability} of treatment. -These research variables should be collected and stored in master data. +These research variables should be collected +and stored in the master dataset. +Both the running variable in RD designs +and the instruments in IV designs, +are among the rare examples of research variables +that may vary over time. +In such cases your research design should +ex-ante clearly indicate what point of time they will be recorded, +and this should be clearly documented in your master dataset. + In \textbf{matching} designs, observations are often grouped by a strata, grouping, index, or propensity score.\sidenote{ \url{https://dimewiki.worldbank.org/Matching}} -While you need not include the undelying variables in master data, -the match information itself is part of the experimental design -and should usually be recorded there. +Like all research variables, the matching results +should be stored in the master dataset. +This is best done by assigning a matching ID +to each matched pair or group, +and create a variable in the master dataset +with the matching ID each unit belongs to. In all these cases, fidelity to the design is important to note as well. A program intended for students that scored under 50\% on a test might have some cases where the program is offered to someone that scored 51\% at the test, From a3757770c2ff11e49e2498348cc541c24bfa8201 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 18:33:44 -0400 Subject: [PATCH 35/41] [ch3] bring back time --- chapters/3-measurement.tex | 52 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 4ef7d8313..f1870dc40 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -384,6 +384,58 @@ \subsection{Defining research variables related to your research design} Differences between assignments and realizations should also be recorded in the master datasets. +%----------------------------------------------------------------------------------------------- +\subsection{Time periods in data maps} + +Your data map should also take into consideration +whether you are using data from one time period or several. +A study that observes data in only one time period is called +a \textbf{cross-sectional study}. +\index{cross-sectional data} +This type of data is relatively easy to collect and handle because +you do not need to track individuals across time, +and therefore requires no additional information in your data map. +Instead, the challenge in a cross-sectional study is to +show that the control group is indeed a valid counterfactual to the treatment group. + +Observations over multiple time periods, +referred to as \textbf{longitudinal data}\index{longitudinal data}, +can consist of either +\textbf{repeated cross-sections}\index{repeated cross-sectional data} +or \textbf{panel data}\index{panel data}. +In repeated cross-sections, +each successive round of data collection uses a new random sample +of observations from the treatment and control groups, +but in a panel data study +the same observations are tracked and included each round. +If each round of data collection is a separate activity, +then they should be treated as separate sources of data +and get their own row in the data linkage table. +If the data is continuously collected, +or at frequent intervals, +then it can be treated as a single data source. +The data linkage table must document +how the different rounds will be merged or appended +when panel data is collected in separate activities. + +You must keep track of the \textit{attrition rate} in panel data, +which is the share of observations not observed in follow-up data. +It is common that the observations not possible to track +can be correlated with the outcome you study. +For example, poorer households may live in more informal dwellings, +patients with worse health conditions might not survive to follow-up, +and so on. +If this is the case, +then your results might only be an effect of your remaining sample +being a subset of the original sample +that were better or worse off from the beginning. +You should have a variable in your master dataset + that indicates attrition. +A balance check using the attrition variable +can provide insights as to whether the lost observations +were systematically different +compared to the rest of the sample. + %----------------------------------------------------------------------------------------------- \subsection{Monitoring data} From 2141b74521f58147f7ddff828606dea614cbc40e Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Thu, 20 Aug 2020 18:35:43 -0400 Subject: [PATCH 36/41] [ch3] new sections & data plan -> data map --- chapters/3-measurement.tex | 50 +++++++++++++++++++++----------------- 1 file changed, 28 insertions(+), 22 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index f1870dc40..b1286616b 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -10,17 +10,10 @@ You need to understand how to structure the project's data to best answer the research questions, and create the tools to share this understanding across your team. -We will discuss two types of variables: -variables that tie your research design -to the observations in the data -which we call \textbf{research variables}; -and variables that correspond to observations of the real world, -which we call \textbf{measurement variables}. -The project's data plan needs to account for both. The first section of this chapter discusses how to determine your project's data needs, -and introduces DIME's data plan template. +and introduces \textit{DIME's Data Map} template. The template includes: one data linkage table, one or several master datasets, and @@ -29,7 +22,14 @@ both across the team and across time. This section also discusses what specific research data you need based on your project's research design, -and how to document those data needs in the data plan. +and how to document those data needs in the data map. +We will discuss two types of variables: +variables that tie your research design +to the observations in the data +which we call \textbf{research variables}; +and variables that correspond to observations of the real world, +which we call \textbf{measurement variables}. +The project's data map needs to account for both. The second section of this chapter covers two activities where research data is created by the research team @@ -47,7 +47,7 @@ %----------------------------------------------------------------------------------------------- -\section{Translating research design to a data plan} +\section{Creating a data map} In most projects, more than one data source is needed to answer the research question. These could be data from multiple survey rounds, @@ -61,9 +61,9 @@ \section{Translating research design to a data plan} but your whole research team is unlikely to have the same understanding, at all times, of all the datasets required. The only way to make sure that the full team shares the same understanding -is to create a \textbf{data plan}\index{Data plan}.\sidenote{ - \url{https://dimewiki.worldbank.org/Data\_Plan}} -DIME's data plan template has three components: +is to create a \textbf{data map}\index{Data map}.\sidenote{ + \url{https://dimewiki.worldbank.org/Data\_Map}} +DIME's data map template has three components: one \textit{data linkage table},\index{Data linkage table} one or several \textit{master datasets}\index{Master datasets} and one or several \textit{data flowcharts}.\index{Data flowchart} @@ -87,19 +87,19 @@ \section{Translating research design to a data plan} to get to the final analysis dataset(s), such as merging, appending, or other linkages. -\subsection{Creating a data plan} - -The process of drafting the data plan is itself useful, +The process of drafting the data map is itself useful, as it is an opportunity for the principal investigators to communicate their vision of the data environment, and for research assistants to communicate their understanding of that vision. -The data plan should be drafted at the outset of a project, +The data map should be drafted at the outset of a project, before any data is acquired, but it is not a static document; it will need to be updated as the project evolves. -To create a data plan according to DIME's template, +\subsection{Data linkage table} + +To create a data map according to DIME's template, the first step is to create a \textbf{data linkage table} by listing all the data sources you know you will use in a spreadsheet. If one source of data will result in two different datasets, @@ -130,7 +130,9 @@ \subsection{Creating a data plan} such as the source of your data, its backup locations, the nature of the data license, and so on. -The second step in creating a data plan is to create one \textbf{master dataset} +\subsection{Master datasets} + +The second step in creating a data map is to create one \textbf{master dataset} for each unit of observation that will be used in any significant research activity. Examples of such activities are data collection, data analysis, @@ -197,7 +199,9 @@ \subsection{Creating a data plan} it serves as an unambiguous method of mapping the observations in your study to your research design. -The third and final step in creating the data plan is to create \textbf{data flowcharts}. +\subsection{Data flowcharts} + +The third and final step in creating the data map is to create \textbf{data flowcharts}. Each analysis dataset (see Chapter 6 for discussion on why you likely need multiple analysis datasets) should have a data flowchart showing how it was created. @@ -229,9 +233,9 @@ \subsection{Creating a data plan} but we recommend a digital tool so that flowcharts can easily be updated over time. -\subsection{Defining research variables related to your research design} +\section{Relate research design to a data map} -After you have set up your data plan, +After you have set up your data map, you need to carefully think about your research design and which research variables you will need in the data analysis to infer the relation between differences in measurement variables @@ -242,6 +246,8 @@ \subsection{Defining research variables related to your research design} where you will find more details and specific references for common impact evaluation methods. +\subsection{Defining research variables related to your research design} + As DIME primarily works on impact evaluations, we focus our discussion here on research designs that compare a group that received From bce35bd9d55aee34e34d7583104642652b2c0333 Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 26 Aug 2020 10:10:01 -0400 Subject: [PATCH 37/41] [ch3] why list everyone ever encountered --- chapters/3-measurement.tex | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index b1286616b..a442278d7 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -74,12 +74,13 @@ \section{Creating a data map} Its most important function is to indicate how all those datasets can be be linked when combining information from multiple data sources. -\textbf{Master datasets}\sidenote{ +Each \textbf{master dataset}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} -list all observations your project ever encounter +list all observations relevant to the project +for a \textbf{unit of observation}\sidenote{ + \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} and are the authoritative source for all research data -such as unique identifiers, sample status and treatment assignment -(the following two sections of this chapter discuss how to generate these variables). +such as unique identifiers, sample status and treatment assignment. \textbf{Data flowcharts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all data sources that are needed to create each analysis dataset, @@ -104,8 +105,7 @@ \subsection{Data linkage table} all the data sources you know you will use in a spreadsheet. If one source of data will result in two different datasets, then list each dataset on its own row. -For each dataset, list the \textbf{unit of observation}\sidenote{ - \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}}, +For each dataset, list the , unit of observation and the name of the project ID variable for that unit of observation. Your project should only have one project ID variable per unit of observation. When you list a dataset in the data linkage table -- @@ -181,10 +181,16 @@ \subsection{Master datasets} as you are otherwise not in control over who can re-identify your de-identified dataset. -You should include all observations ever encountered -in your master datasets, -even if they are not eligible for your study. -This is because, if you ever need to perform a record linkage such as a fuzzy match +While the master dataset starting point often is a sampling frame +(more on sampling frames later in this chapter), +you should continuously update it with +all observations ever encountered in your project, +even if those observations are not eligible for your study. +Examples includes new observations listed during monitoring activities +or observations that respondents in your study mention in, +for example, a social network module. +This is because, +if you ever need to perform a record linkage such as a fuzzy match on string variables like proper names, you will make fewer errors the more information you have. If you ever need to do a fuzzy match, From dd5fb8ae2207f5cd9d9a8baa43a634ae2edd985a Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 26 Aug 2020 10:45:41 -0400 Subject: [PATCH 38/41] [ch3] project IDs --- chapters/3-measurement.tex | 43 +++++++++++++++++++++++++------------- 1 file changed, 29 insertions(+), 14 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index a442278d7..8726d082d 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -166,20 +166,34 @@ \subsection{Master datasets} your master datasets throughout your project. The most important function of the master dataset -is to be the authoritative source for the project ID. -This means that all observations listed -should be uniquely and fully identified by the included project ID variable.\sidenote{ - \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties}} -You should also list all other identifiers used in your project, -such as names, addresses, or other IDs used by partner organizations, -and the master datasets will then serve as -the linkage between those identifiers and the project ID. -Because of this, master datasets must, -with very few exceptions, always be encrypted. -Even when a partner organization has a unique identifier, -you should always create a project ID specific to your project only, -as you are otherwise not in control over -who can re-identify your de-identified dataset. +is to be the authoritative source +for how all observations are identified. +This means that the master datasets should include +identifying information such as names, contact information, +but also your \textbf{project ID}.\sidenote{ + \textbf{Project ID:} The main ID used in your project to identify + observations. + You should never have multiple project IDs for the same unit of observation. + The project ID must be uniquely and fully identified all observations in the project. + See \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties} for more details.} +The project ID should be the ID variable used in the data linkage table, +and is therefore how observations are linked across datasets. +Your master dataset may list alternative IDs used, +for example, by a partner organization. +However, you should not use such an ID as your project ID, +as you would then not be in control over +who can re-identify data that you publish. +The project ID must be created by the project team, +and the linkage to direct identifiers +should only be known to people listed on the IRB. +If you receive a dataset with an alternative ID, +you should immediately replaced it with your project ID, +and the alternative ID should be dropped +as a part of your de-identification (see Chapters 5 and 7). +Your master dataset serves as the linkage between +all other identifying information and your project ID. +Since your master dataset is full of identifying information, +it must always be stored encrypted. While the master dataset starting point often is a sampling frame (more on sampling frames later in this chapter), @@ -199,6 +213,7 @@ \subsection{Master datasets} You should not do anything with that dataset until you have successfully merged the project IDs from the master dataset. +Any new observations that you Since the master datasets is the authoritative source of the project ID and all research variables, From 9182ca3a07adf589000e2231349fb7c379dfb6fe Mon Sep 17 00:00:00 2001 From: kbjarkefur Date: Wed, 26 Aug 2020 10:57:44 -0400 Subject: [PATCH 39/41] [ch3] ids in flowcharts --- chapters/3-measurement.tex | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 8726d082d..3ccc0f429 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -234,10 +234,13 @@ \subsection{Data flowcharts} The operations used to combine the data could include: appending, one-to-one merging, many-to-one or one-to-many merging, collapsing, or a broad variety of others. -You must list which ID variable or set of ID variables +You must list which variable or set of variables should be used in each operation, and note whether the operation creates a new variable or combination of variables to identify the newly linked data. +These variables should be project IDs when possible. +Examples of exception are time variables in longitudinal data, +and sub-units like farm plots that belong to farmers with project IDs. Once you have acquired the datasets listed in the flowchart, you can add to the data flowcharts the number of observations that the starting point dataset has From fceb7dd76920ef8a5469f6b66e1b504f32bd8497 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Wed, 26 Aug 2020 11:00:56 -0400 Subject: [PATCH 40/41] [ch3] ben review Co-authored-by: Benjamin Daniels --- chapters/3-measurement.tex | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index 3ccc0f429..ca84b0960 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -76,11 +76,12 @@ \section{Creating a data map} combining information from multiple data sources. Each \textbf{master dataset}\sidenote{ \url{https://dimewiki.worldbank.org/Master\_Data\_Set}} -list all observations relevant to the project +lists all observations relevant to the project for a \textbf{unit of observation}\sidenote{ \url{https://dimewiki.worldbank.org/Unit\_of\_Observation}} -and are the authoritative source for all research data -such as unique identifiers, sample status and treatment assignment. +and is the authoritative source for all research data +about that unit of observation +including unique identifiers, sample status, and treatment assignment. \textbf{Data flowcharts}\sidenote{ \url{https://dimewiki.worldbank.org/Data\_Flow\_Chart}} list all data sources that are needed to create each analysis dataset, @@ -105,7 +106,7 @@ \subsection{Data linkage table} all the data sources you know you will use in a spreadsheet. If one source of data will result in two different datasets, then list each dataset on its own row. -For each dataset, list the , unit of observation +For each dataset, list the unit of observation and the name of the project ID variable for that unit of observation. Your project should only have one project ID variable per unit of observation. When you list a dataset in the data linkage table -- @@ -187,7 +188,7 @@ \subsection{Master datasets} and the linkage to direct identifiers should only be known to people listed on the IRB. If you receive a dataset with an alternative ID, -you should immediately replaced it with your project ID, +you should immediately replace it with your project ID, and the alternative ID should be dropped as a part of your de-identification (see Chapters 5 and 7). Your master dataset serves as the linkage between @@ -200,7 +201,7 @@ \subsection{Master datasets} you should continuously update it with all observations ever encountered in your project, even if those observations are not eligible for your study. -Examples includes new observations listed during monitoring activities +Examples include new observations listed during monitoring activities or observations that respondents in your study mention in, for example, a social network module. This is because, From 9a1547cb61395e4fbfebb18fadd621445f4137f2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kristoffer=20Bj=C3=A4rkefur?= Date: Wed, 26 Aug 2020 11:42:48 -0400 Subject: [PATCH 41/41] [ch3] maria review Co-authored-by: Maria --- chapters/3-measurement.tex | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/chapters/3-measurement.tex b/chapters/3-measurement.tex index ca84b0960..d9ab2419b 100644 --- a/chapters/3-measurement.tex +++ b/chapters/3-measurement.tex @@ -175,13 +175,13 @@ \subsection{Master datasets} \textbf{Project ID:} The main ID used in your project to identify observations. You should never have multiple project IDs for the same unit of observation. - The project ID must be uniquely and fully identified all observations in the project. + The project ID must uniquely and fully identify all observations in the project. See \url{https://dimewiki.worldbank.org/ID\_Variable\_Properties} for more details.} -The project ID should be the ID variable used in the data linkage table, +The project ID is the ID variable used in the data linkage table, and is therefore how observations are linked across datasets. -Your master dataset may list alternative IDs used, +Your master dataset may list alternative IDs that are used, for example, by a partner organization. -However, you should not use such an ID as your project ID, +However, you must not use such an ID as your project ID, as you would then not be in control over who can re-identify data that you publish. The project ID must be created by the project team, @@ -194,20 +194,20 @@ \subsection{Master datasets} Your master dataset serves as the linkage between all other identifying information and your project ID. Since your master dataset is full of identifying information, -it must always be stored encrypted. +it must always be encrypted. -While the master dataset starting point often is a sampling frame -(more on sampling frames later in this chapter), -you should continuously update it with +The starting point for the master dataset is typically a sampling frame +(more on sampling frames later in this chapter). +However, you should continuously update the master dataset with all observations ever encountered in your project, -even if those observations are not eligible for your study. +even if those observations are not eligible for the study. Examples include new observations listed during monitoring activities -or observations that respondents in your study mention in, -for example, a social network module. -This is because, +or observations that are connected to respondents in the study, +for example in a social network module. +This is useful because, if you ever need to perform a record linkage such as a fuzzy match on string variables like proper names, -you will make fewer errors the more information you have. +the more information you have the fewer errors you are likely to make If you ever need to do a fuzzy match, you should always do that between the master dataset and the dataset without an unambiguous identifier.