From d675a87c88f77c218ffa70d9baa970be3b4b0377 Mon Sep 17 00:00:00 2001 From: Luiza Date: Wed, 17 Feb 2021 18:38:23 -0500 Subject: [PATCH] #476 : "raw data" and "data set" --- 00-introduction.Rmd | 2 +- 01-reproducibility.Rmd | 6 ++-- 02-collaboration.Rmd | 6 ++-- 03-measurement.Rmd | 12 ++++---- 04-acquisition.Rmd | 18 ++++++------ 05-processing.Rmd | 64 +++++++++++++++++++++--------------------- 06-analysis.Rmd | 8 +++--- 07-publication.Rmd | 4 +-- 8 files changed, 60 insertions(+), 60 deletions(-) diff --git a/00-introduction.Rmd b/00-introduction.Rmd index aae971ff..eb58d5c0 100644 --- a/00-introduction.Rmd +++ b/00-introduction.Rmd @@ -18,7 +18,7 @@ This book aims to be a highly practical resource so the reader can immediately b **Chapter \@ref(processing)** describes data processing tasks. It details how to construct "tidy" data at the appropriate units of analysis, how to ensure uniquely identified datasets, and how to routinely incorporate data quality checks into the workflow. It also provides guidance on de-identification and cleaning of personally-identified data, focusing on how to understand and structure data so that it is ready for indicator construction and analytical work. -**Chapter \@ref(analysis)** discusses data analysis tasks. It begins with data construction, or the creation of new variables from the raw data acquired or collected in the field. It introduces core principles for writing analytical code and creating, exporting, and storing research outputs such as figures and tables reproducibly using dynamic documents. +**Chapter \@ref(analysis)** discusses data analysis tasks. It begins with data construction, or the creation of new variables from the original data or collected in the field. It introduces core principles for writing analytical code and creating, exporting, and storing research outputs such as figures and tables reproducibly using dynamic documents. **Chapter \@ref(publication)** outlines the publication of research outputs, including manuscripts, code, and data. This chapter discusses how to effectively collaborate on technical writing using dynamic documents. It also covers how and why to publish datasets in an accessible, citable, and safe fashion. Finally, it provides guidelines for preparing functional and informative reproducibility packages that contain all the code, data, and meta-information needed for others to evaluate and reproduce your work. diff --git a/01-reproducibility.Rmd b/01-reproducibility.Rmd index f82775b8..b6a90159 100644 --- a/01-reproducibility.Rmd +++ b/01-reproducibility.Rmd @@ -429,8 +429,8 @@ the **original data** (including corrections)^[ that becomes the functional basis for research work.]\index{original data} should be immediately placed in a secure permanent storage system. Before analytical work begins, you should create a "for-publication" -copy of the original dataset by removing potentially identifying information.\index{de-identification} -This will become the raw data, and must be +copy of the acquired dataset by removing potentially identifying information.\index{de-identification} +This will become the original data, and must be placed in an archival repository where it can be cited.^[@vilhuber2020report]\index{data publication} This can initially be done under embargo or with limited release, in order to protect your data and future work. @@ -450,7 +450,7 @@ provide specific repositories in which they require the deposit of data they fun and you should take advantage of these when possible. If this is not provided, you must be aware of privacy issues with directly identifying data and questions of data ownership -before uploading raw data to any third-party server, whether public or not;\index{data ownership} +before uploading original data to any third-party server, whether public or not;\index{data ownership} this is a legal question for your home organization. If data that is required for analysis must be placed under restricted use or restricted access, including data that can never be distributed directly by you to third parties, diff --git a/02-collaboration.Rmd b/02-collaboration.Rmd index 389f7e1e..bc1b78f0 100644 --- a/02-collaboration.Rmd +++ b/02-collaboration.Rmd @@ -587,7 +587,7 @@ it seems to be "enough but not too much" for most purposes. ```{block2, type = 'ex'} ### Demand for Safe Spaces Case Study: Writing Code That Others Can Read {-} -To ensure that all team members were able to easily read and understand data work, *Demand for Safe Spaces* code files were extensively commented. Comments typically took the form of "what – why": what is this section of code doing, and why is it necessary. The below snippet from a do-file cleaning one of the raw data files illustrates the use of comments: +To ensure that all team members were able to easily read and understand data work, *Demand for Safe Spaces* code files were extensively commented. Comments typically took the form of "what – why": what is this section of code doing, and why is it necessary. The below snippet from a do-file cleaning one of the original datasets illustrates the use of comments: ![](examples/ch2-writing-code-that-others-can-read.png) @@ -604,7 +604,7 @@ To bring all these smaller code files together, you must maintain a master scrip A master script is the map of all your project's data work which serves as a table of contents for the instructions that you code. Anyone should be able to follow and reproduce all your work from -raw data to all outputs by simply running this single script. +the original data to all outputs by simply running this single script. By follow, we mean someone external to the project who has the master script and all the input data can (i) run all the code and recreate all outputs, (ii) have a general understanding of what is being done at every step, and @@ -987,7 +987,7 @@ The **initial de-identification** process strips the data of direct identifiers as early in the process as possible, to create a working de-identified dataset that can be shared *within the research team* without the need for encryption. -This data set should always be used when possible. +This dataset should always be used when possible. The **final de-identification** process involves making a decision about the trade-off between risk of disclosure and utility of the data diff --git a/03-measurement.Rmd b/03-measurement.Rmd index d6b48a5e..05738524 100644 --- a/03-measurement.Rmd +++ b/03-measurement.Rmd @@ -125,7 +125,7 @@ The **data linkage table**^[ More details on DIME's data linkage table template and an example can be found on the DIME Wiki: https://dimewiki.worldbank.org/Data_Linkage_Table.] -lists all the raw datasets that will be used in the project, +lists all the original datasets that will be used in the project, what data sources they are created from, and how they relate to each other. For each **unit of observation**^[ @@ -144,7 +144,7 @@ you will create **data flowcharts**,^[ More details on DIME's data flow chart template and an example can be found on the DIME Wiki: https://dimewiki.worldbank.org/Data_Flow_Chart.] -describing how the raw datasets and master datasets +describing how the original datasets and master datasets are to be combined and manipulated to create analysis datasets. Each component will be discussed in more detail below. @@ -174,8 +174,8 @@ outcomes of interest, and control variables among many others. To create a data map according to DIME's template,\index{data linkage table} the first step is to create a **data linkage table** by listing all the data sources you know you will use in a spreadsheet, -and the raw datasets that will be created from them. -If one source of data will result in two different raw datasets, +and the original datasets that will be created from them. +If one source of data will result in two different datasets, then list each dataset on its own row. For each dataset, list the unit of observation and the name of the **project ID**^[ @@ -223,7 +223,7 @@ the nature of the data license, and so on. The main unit of observation in the platform survey datasets is the respondent and it is uniquely identified by the variable id. However, implicit association tests (IAT) were collected through a specialized software that outputs two datasets for each IAT instrument: one at respondent level, containing the final scores; and one with detailed information on each stimulus used in the test (images or expressions to be associated with concepts). Three IAT instruments were used: one testing the association between gender and career choices; one testing the association between car choice and safety concerns; and one testing the association between car choice and openness to sexual advances. -As a result, the raw data for the platform survey component of the project consisted in 7 datasets: 1 for the platform survey, and 6 for the IAT -- 3 with IAT scores (one for each instrument) and 3 with detailed stimuli data (one for each instrument). All 7 datasets are stored in the same raw data folder. The data linkage table lists their file names and indicates how their ID variables are connected. Note that the raw stimulus data does not have a unique identifier, since the same stimulus can be shown repeatedly, so the “ID var” field is blank for these datasets. +As a result, the original data for the platform survey component of the project consisted in 7 datasets: 1 for the platform survey, and 6 for the IAT -- 3 with IAT scores (one for each instrument) and 3 with detailed stimuli data (one for each instrument). All 7 datasets are stored in the same raw data folder. The data linkage table lists their file names and indicates how their ID variables are connected. Note that the raw stimulus data does not have a unique identifier, since the same stimulus can be shown repeatedly, so the “ID var” field is blank for these datasets. | Data source | Raw dataset name | Unit of observation
(ID var) | Parent unit
(ID var) | |-------------|------------------|---------------------------------|--------------------------| @@ -365,7 +365,7 @@ and multi-level data like "district-school-teacher-student" structures. ```{block2, type = "ex"} ### Demand for Safe Spaces Example: Creating Data Flowcharts -The data flow chart indicates how the raw datasets are processed and combined to create a final respondent-level dataset that will be used for analysis. The analysis dataset resulting from this process is shown in green. The raw datasets are shown in blue (refer to \ref{@linkage} for details on the raw datasets). The name of the uniquely identifying variable in the dataset is indicated in the format (ID: variable_name). +The data flow chart indicates how the original datasets are processed and combined to create a final respondent-level dataset that will be used for analysis. The analysis dataset resulting from this process is shown in green. The original datasets are shown in blue (refer to the data linkage table example for details on the original datasets). The name of the uniquely identifying variable in the dataset is indicated in the format (ID: variable_name). Each operation that changes the level of observation of the data is summarized in the flow chart. The chart also summarizes how datasets will be combined. Since these are the most error-prone data processing tasks, having a high-level plan for how they will be executed helps clarify the process for everyone in the data team, preventing future mistakes. diff --git a/04-acquisition.Rmd b/04-acquisition.Rmd index 046aa5a5..0c3133fe 100644 --- a/04-acquisition.Rmd +++ b/04-acquisition.Rmd @@ -7,7 +7,7 @@ Data acquisition can take many forms, including: primary data generated through surveys; private sector partnerships granting access to new data sources, such as administrative and sensor data; digitization of paper records, including administrative data; web scraping; -primary data capture by unmanned aerial vehicles or other types of remote sensing; +data captured by unmanned aerial vehicles or other types of remote sensing; or novel integration of various types of datasets, such as combining survey and sensor data. Much of the recent push toward credibility in the social sciences has focused on analytical practices. However, credible development research depends, first and foremost, on the quality of the acquired data. @@ -657,7 +657,7 @@ address any issues that arose during piloting and cover frequently asked questions. The manual must also describe survey protocols and conventions, such as how to select or confirm the identity of respondents, -and standardized means for recording responses such as ``Don't know".^[ +and standardized means for recording responses such as "Don't know".^[ For more details and examples of common survey protocols see the DIME Wiki: https://dimewiki.worldbank.org/Survey_Protocols] @@ -1011,9 +1011,9 @@ There is absolutely no way to restore the data if you lose your key, so we cannot stress enough the importance of using a password manager, or equally secure solution, to store these encryption keys. -It is becoming more and more common that development research -is done on data set that is too big to store on a regular computer, -and instead the data is stored and processed in a cloud environment. +It is becoming more and more common for development research +to use data that is too big to be stored in a regular computer +and needs to be stored and processed in a cloud environment instead. There are many available cloud storage solutions and you need to understand how the data is encrypted and how the keys are handled. This is likely another case where a regular research team will have to ask a cybersecurity expert. @@ -1040,11 +1040,11 @@ This should be on your computer, and could be in a shared folder. If your data source is a survey and the data was encrypted during data collection, then you will need *both* the private key used during data collection to be able to download the data, *and* the key used when you created the encrypted folder to save it there. -This your first copy of your raw data, and the copy you will use for cleaning and analysis. +This your first copy of your original data, and the copy you will use for cleaning and analysis. 1. Create a second encrypted folder on an external drive that you can keep in a secure location. Copy the data you just downloaded to this second encrypted folder. -This is the ``master" backup copy of the raw data. +This is the "master" backup copy of the original data. You should never work with this data on a day-to-day basis. You should not use the same encrypted folder or the same key as above, because if you use the same key and lose the key, @@ -1057,8 +1057,8 @@ and thereby do not risk losing access by losing an encryption key. Either you can create this on your computer and upload it to a long-term cloud storage service (not a sync software), or you can create it on another external hard drive or computer that you then store in a second location, for example, at another office of your organization. -This is the ``golden master" backup copy of the raw data. -You should never store the ``golden master" copy in a synced folder, +This is the "golden master" backup copy of the original data. +You should never store the "golden master" copy in a synced folder, as it would be deleted in the cloud storage if it is deleted on your computer. You should also never work with this data; it exists only for recovery purposes. diff --git a/05-processing.Rmd b/05-processing.Rmd index 37d9c8bf..675a086b 100644 --- a/05-processing.Rmd +++ b/05-processing.Rmd @@ -6,7 +6,7 @@ most of which are not immediately suited for analysis. The process of preparing data for analysis has many different names: data cleaning, data munging, data wrangling. But they all mean the same thing -- -transforming raw data into a convenient format for your intended use. +transforming data into a convenient format for your intended use. This is the most time-consuming step of a project's data work, particularly when primary data is involved; it is also essential for data quality. @@ -17,7 +17,7 @@ We consider creating new variables, imputing values and correcting outliers to be research decisions, and will discuss those in the next chapter. Therefore, the clean dataset, which is the main output from the workflow discussed in this chapter, -contains the same information as the raw data, +contains the same information as the original data, but in a format that is ready for use with statistical software. @@ -33,18 +33,18 @@ The final section discusses how to examine each variable in your dataset and make sure that it is as well documented and as easy to use as possible. Each of these tasks is implemented through code, and resulting datasets can be reproduced exactly by running this code. -The raw data files are kept exactly as they were acquired, +The original data files are kept exactly as they were acquired, and no changes are made directly to them. ```{block2, type = 'summary'} ### Summary: Cleaning and processing research data {-} -After data is acquired, it must be structured for analysis in accordance with the research design, as laid out in the data linkage tables and the data flowcharts discussed in Chapter \@ref(measurement). Data processing requires first transforming the unprocessed materials from partners or data collection into the appropriate tables and units of observation, then producing clean data sets that match the ground truth observations. To do this, you will: +After data is acquired, it must be structured for analysis in accordance with the research design, as laid out in the data linkage tables and the data flowcharts discussed in Chapter \@ref(measurement). Data processing requires first transforming the unprocessed materials from partners or data collection into the appropriate tables and units of observation, then producing clean datasets that match the ground truth observations. To do this, you will: -**1. Tidy the data.** Many raw datasets as received will not have an unambiguous identifier, and the rows in the dataset often will not match the units of observation specified by the research plan and data linkage table. To prepare the data for analysis, you must: +**1. Tidy the data.** Many datasets will not have an unambiguous identifier as received, and the rows in the dataset often will not match the units of observation specified by the research plan and data linkage table. To prepare the data for analysis, you must: - Determine the *unique identifier* for each unit of observation that you require. -- Transform the raw data so that the desired *unit of observation* uniquely identifies rows in each dataset. +- Transform the data so that the desired *unit of observation* uniquely identifies rows in each dataset. **2. Validate data quality.** Data completeness and quality should be validated upon receipt to ensure the data is an accurate representation of the characteristics and individuals it is supposed to contain. This includes: @@ -52,7 +52,7 @@ After data is acquired, it must be structured for analysis in accordance with th - Making sure data points are *consistent* across variables and datasets. - Exploring the *distributions* of key variables to identify outliers and other unexpected patterns. -**3. De-identify, clean, and prepare the data.** You should archive and/or publish the raw data after processing and de-identifying. Before publication, you should ensure that the processed version is highly accurate and appropriately protects the privacy of individuals, by doing the following: +**3. De-identify, clean, and prepare the data.** You should archive and/or publish the data after processing and de-identifying. Before publication, you should ensure that the processed version is highly accurate and appropriately protects the privacy of individuals, by doing the following: - *De-identifying the data*, in accordance with best practices and relevant privacy regulations. - *Correcting data points* which are identified as being in error compared to ground reality. @@ -111,7 +111,7 @@ and the **unit of observation**^[ can be found on the DIME Wiki: https://dimewiki.worldbank.org/Unit_of_Observation. ]\index{unit of observation} -may be ambiguous in many raw datasets. +may be ambiguous in many datasets. This section will present what we call a *tidy* data format, which is, in our experience, the ideal format to handle tabular data. We will treat tidying data as the first step in data cleaning even though, in practice, @@ -147,8 +147,8 @@ must be listed in the **master dataset**.^[ Ensuring that observations are uniquely and fully identified is arguably the most important step in data cleaning. It may be the case that the variables expected to uniquely identify -the raw data contain either missing or duplicate values.^[ - We use the expression **raw data** +the data contain either missing or duplicate values.^[ + We use the expression **original dataset** to refer to the "data in the state it was originally received by the research team". In other sources, you will also see it used to refer to the "corrected and compiled dataset created from received information, @@ -158,7 +158,7 @@ the raw data contain either missing or duplicate values.^[ original data collected by the research team. ] -It is also possible for a raw dataset to not include an unique identifier, +It is also possible that a dataset does not include an unique identifier, or that the identifier is not a suitable **project ID**.^[ More details on what makes an ID variable a suitable Project ID variable @@ -168,12 +168,12 @@ Suitable project IDs should, for example, not involve long strings that are difficult to work with, such as a name, or be an ID that is known outside the research team. In such cases, cleaning begins by -adding a project ID to the raw data. +adding a project ID to the acquired data. If a project ID already exists, for this unit of observation, then you should carefully merge it from the master dataset -to the raw data +to the acquired data using other identifying information.^[Such operations are commonly called "merges" in Stata, and "joins" in R's `tidyverse` dialect. @@ -181,7 +181,7 @@ using other identifying information.^[Such If a project ID does not exist, then you need to generate one, add it to the master dataset, -and then merge it back into the raw data. +and then merge it back into the acquired data. Note that while digital survey tools create unique identifiers for each data submission, that is not the same as having a unique ID variable @@ -207,7 +207,7 @@ it is important to keep a record of all cases of duplicated IDs encountered ```{block2, type = 'ex'} ### Demand for Safe Spaces Case Study: Establishing a Unique Identifier -All datasets have a "unit of observation", and the first columns of each dataset should uniquely identify which unit is being observed. In the *Demand for Safe Spaces* project, as in all projects, the first few lines of code that import each raw dataset immediately ensure that this is true and apply any corrections from the field needed to fix errors with uniqueness. +All datasets have a "unit of observation", and the first columns of each dataset should uniquely identify which unit is being observed. In the *Demand for Safe Spaces* project, as in all projects, the first few lines of code that import each original dataset immediately ensure that this is true and apply any corrections from the field needed to fix errors with uniqueness. The code segment below imports the crowdsourced ride data and uses the `ieduplicates` command to remove duplicate values of the uniquely identifying variable in the dataset. The corresponding filled `ieduplicates` form included below shows how the command generates information documenting and resolving duplicated IDs in data collection. After applying the corrections, the code confirms that the data is uniquely identified by riders and ride IDs and saves it in an optimized format. @@ -221,9 +221,9 @@ The code segment below imports the crowdsourced ride data and uses the `ieduplic -### Tidying raw data {-} +### Tidying data {-} -Though raw data can be acquired in all shapes and sizes, +Though data can be acquired in all shapes and sizes, it is most commonly received as one or multiple data tables. These data tables can organize information in multiple ways, and not all of them result in easy-to-handle datasets. @@ -236,15 +236,15 @@ A data table is tidy when each column represents one **variable**,^[ each row represents one observation, and all variables in it have the same unit of observation. Every other format is *untidy*. -This may seem trivial, but raw data, +This may seem trivial, but data, and raw survey data in particular, is rarely received in a tidy format. -The most common case of untidy raw data encountered in development research +The most common case of untidy data acquired in development research is a dataset with multiple units of observations stored in the same data table. Take, for example, a household survey that includes household-level questions, as well as a household member roster. -Such raw datasets usually consists of a single data table +Such raw datasets usually consist of a single data table where questions from the household member roster are saved in different columns, one for each member, with a corresponding member suffix, and household-level questions are represented by one column each. @@ -308,7 +308,7 @@ You must be sure that identifying variables are consistent across data tables, so they can always be linked. Reshaping is the type of transformation we referred to in the example of how you calculate -the share of women in a wide data set. +the share of women in a wide dataset. The important difference is that in a tidy workflow, instead of transforming the data for each operation, @@ -359,7 +359,7 @@ This means that there must be a master dataset for households. (You may have a master dataset for household members as well if you think it is important for your research, but it is not strictly required.) -The household data set would then be stored in a folder called, +The household dataset would then be stored in a folder called, for example, `baseline-hh-survey/`. In that folder you would save both the household-level data table with the same name as the folder, @@ -389,9 +389,9 @@ Preparing the data for analysis, the last task in this chapter, is much simpler when that is the case. ```{block2, type = 'ex'} -### Demand for Safe Spaces Case Study: Tidying Raw Data +### Demand for Safe Spaces Case Study: Tidying Data -The unit of observations in a raw dataset does not always match the relevant unit of analysis for a study. One of the first steps required is creating datasets at the levels of analysis desired for analytical work. In the case of the crowdsourced ride data used in *Demand for Safe Spaces*, the raw datasets show one *task* per row. Remember that in each metro ride, study participants were asked to complete three tasks: one before boarding the train, one during the ride, and one after leaving the train. So the relevant unit of analysis, one metro *trip*, was broken into three rows in this dataset. To create a dataset at this level, the research team took two steps, outlined in the data flowchart (see box Creating Data Flowcharts in Chapter \@ref(measurement)). First, three separate datasets were created, one for each task, containing only the variables created during that task. Then the ride level dataset was created by combining the variables in each task dataset for each individual ride (identified by the session variable). +The unit of observations in an original dataset does not always match the relevant unit of analysis for a study. One of the first steps required is creating datasets at the levels of analysis desired for analytical work. In the case of the crowdsourced ride data used in *Demand for Safe Spaces*, the raw datasets show one *task* per row. Remember that in each metro ride, study participants were asked to complete three tasks: one before boarding the train, one during the ride, and one after leaving the train. So the relevant unit of analysis, one metro *trip*, was broken into three rows in this dataset. To create a dataset at this level, the research team took two steps, outlined in the data flowchart (see box Creating Data Flowcharts in Chapter \@ref(measurement)). First, three separate datasets were created, one for each task, containing only the variables created during that task. Then the ride level dataset was created by combining the variables in each task dataset for each individual ride (identified by the session variable). The code below shows the example of the ride task script. It keeps only the ride task rows and columns from the raw dataset. @@ -482,7 +482,7 @@ through which sampled units are directly assigned to individual enumerators. For data received from partners, such as administrative data, this may be harder to validate. In these cases, cross-referencing with other data sources can help to ensure completeness. -It is often the case that raw data includes duplicate or missing entries, +It is often the case that data the data as originally acquired includes duplicate or missing entries, which may occur due to typos, failed submissions to data servers, or other mistakes.^[ More details on how to deal with duplicates during surveys @@ -647,7 +647,7 @@ both the initial and the final de-identification processes. Initial de-identification reduces risk and simplifies workflows. Once you create a de-identified version of the dataset, you no longer need to interact directly with the encrypted data. -Note that if the data tidying resulted in multiple raw data tables, +Note that if the data tidying resulted in multiple data tables, each will need to be de-identified separately, but the workflow will be the same for all of them. @@ -684,7 +684,7 @@ assess them against the analysis plan and first ask yourself for each variable: *will this variable be needed for the analysis?* If not, the variable should be dropped. Don't be afraid to drop too many variables the first time, -as you can always go back and extract additional variables from the raw data, +as you can always go back and extract additional variables from the original dataset, but you cannot go back in time and drop a PII variable that was leaked. For each confidential variable that is needed in the analysis, ask yourself: @@ -796,7 +796,7 @@ of any of these characteristics could be caused by data entry errors. At this point, it is more important to document your findings than to directly address any irregularities found. -There is a very limited set of changes that should be made to the raw data during cleaning. +There is a very limited set of changes that should be made to the original dataset during cleaning. They are described in the next two sections, and are usually applied to each variable as you examine it. Most of the transformations that result in new variables @@ -845,7 +845,7 @@ If your team decides to follow up on and correct these issues, the follow-up process must also be thoroughly documented. Be very careful not to include confidential information in documentation that is not securely stored, or that you intend to release as part of a replication package or data publication. -Finally, remember not to make changes directly to the raw data. +Finally, remember not to make changes directly to the original dataset. Instead, any corrections must be done as part of data cleaning, applied through code, and saved to a new intermediate dataset. @@ -919,9 +919,9 @@ In Stata, this information can be elegantly conserved using extended missing val ] -We recommend that the cleaned dataset be kept as similar to the raw data as possible. +We recommend that the cleaned dataset be kept as similar to the original dataset as possible. This is particularly important regarding variable names: -keeping them consistent with the raw data makes data processing and construction more transparent. +keeping them consistent with the original dataset makes data processing and construction more transparent. Unfortunately, not all variable names are informative. In such cases, one important piece of documentation makes the data easier to handle: the variable dictionary. @@ -981,7 +981,7 @@ The *Demand for Safe Spaces* team relied mostly on the `iecodebook` command for ![](examples/iecodebook.png) -Column B contains the corrected variable labels, column D indicates the value labels to be used for categorical variables, and column I recodes the underlying numbers in those variables. The differences between columns E and A indicate changes to variable names. Typically, it is strongly recommended not to rename variables at the cleaning stage, as it is important to maintain correspondence to the raw data. However, that was not possible in this case, as the same question had inconsistent variable names across multiple transfers of the data from the technology firm that managed the mobile application. In fact, this is one of the two cleaning tasks that could not be performed through `iecodebook` directly (the other was transformation of string variables to categorical format for increased efficiency). The code below shows a few examples of how these cleaning tasks were carried out directly in the script. +Column B contains the corrected variable labels, column D indicates the value labels to be used for categorical variables, and column I recodes the underlying numbers in those variables. The differences between columns E and A indicate changes to variable names. Typically, it is strongly recommended not to rename variables at the cleaning stage, as it is important to maintain correspondence to the original dataset. However, that was not possible in this case, as the same question had inconsistent variable names across multiple transfers of the data from the technology firm that managed the mobile application. In fact, this is one of the two cleaning tasks that could not be performed through `iecodebook` directly (the other was transformation of string variables to categorical format for increased efficiency). The code below shows a few examples of how these cleaning tasks were carried out directly in the script. ![](examples/ch5-recording-and-annotating-data.png) diff --git a/06-analysis.Rmd b/06-analysis.Rmd index 38affa9f..f0d9e5de 100644 --- a/06-analysis.Rmd +++ b/06-analysis.Rmd @@ -17,11 +17,11 @@ but also for the smooth implementation of a project. In this chapter, we discuss the necessary steps to transform -cleaned raw data into informative analysis outputs such as tables and figures. +cleaned data into informative analysis outputs such as tables and figures. The suggested workflow starts where the last chapter ended: with the outputs of data cleaning. The first section covers variable construction: -transforming the raw data into economically meaningful indicators. +transforming the cleaned data into economically meaningful indicators. The second section discusses the analysis code itself. We do not offer instructions on how to conduct specific analyses, as that is determined by research design, @@ -116,7 +116,7 @@ planned during research design\index{research design}, with the pre-analysis plan serving as a guide.\index{pre-analysis plan} During construction, data will typically be reshaped, merged, and aggregated to change the level of the data points -from the **unit of *observation* ** in the raw data +from the **unit of *observation* ** in the original dataset to the **unit of *analysis* **.^[ More details on the concepts of unit of observations and unit of analysis @@ -274,7 +274,7 @@ and to be extensively documented, separately from other data construction tasks. ```{block2, type = 'ex'} ### Demand for Safe Spaces Case study: Integrating Multiple Data Sources -The raw crowsourced data acquired for the *Demand for Safe Spaces* study was received by the research team in a different level of observation than the one relevant for analysis. The unit of analysis is a ride, and each ride was represented in the raw data by three rows, one for questions answered before boarding the train, one for those answered during the trip and one for those answered after leaving the train. The *Tidying raw data* example explains how the team created three different intermediate datasets for each of these tasks. To create the complete ride-level dataset, the team combined the individual task datasets. The code below shows how the team assured that all observations had merged as expected. Two different approaches depending on what you expect are shown. +The raw crowsourced data acquired for the *Demand for Safe Spaces* study was received by the research team in a different level of observation than the one relevant for analysis. The unit of analysis is a ride, and each ride was represented in the original dataset by three rows, one for questions answered before boarding the train, one for those answered during the trip and one for those answered after leaving the train. The *Tidying data* example explains how the team created three different intermediate datasets for each of these tasks. To create the complete ride-level dataset, the team combined the individual task datasets. The code below shows how the team assured that all observations had merged as expected. Two different approaches depending on what you expect are shown. The first code chunk shows the quality assurance protocol for when the team expected that all observations would exist in all datasets so that each merge would have only matched observations. To test that this was the case the team used the option `assert(3)`. When two dataset are merged in Stata each observations get the code 1, 2 or 3. 1 means that the observation only existed in the dataset in memory (called the "master data"), 2 means that the observation existed only in the other dataset (called the “using data”) and 3 means that they existed in both. `assert(3)` tests that all observation existed in both dataset and was coded as 3. diff --git a/07-publication.Rmd b/07-publication.Rmd index 1c03acb9..bf8e1b02 100644 --- a/07-publication.Rmd +++ b/07-publication.Rmd @@ -318,7 +318,7 @@ that was not produced by your team, but you will still need to carefully explain the process of acquiring this data). If you follow the steps outlined in Chapter \@ref(processing), when you get to the publication stage you will have -a cleaned data set and supporting documentation ready. +a cleaned dataset and supporting documentation ready. Second, you should separately catalog the analysis dataset used for the research output you are publishing. @@ -772,7 +772,7 @@ a consortium of social science data editors.^[ ```{block2, type = 'ex'} ### Demand for Safe Spaces Case Study: Releasing a Reproducibility Package -The *Demand for Safe Spaces* reproducibility package, released on the World Bank’s GitHub organization, contains all the materials necessary for another researcher to access raw materials and reproduce all the results include with the paper. The Reproducibility Package folder contains a README file with all the instructions for executing the code. Among other things, it provides licensing information for the materials, software and hardware requirements including time needed to run, and instructions for accessing and placing the raw data before running the code (which must be downloaded separately). Finally, it has a detailed list of the code files that will run, their data inputs, and the outputs of each process. +The *Demand for Safe Spaces* reproducibility package, released on the World Bank’s GitHub organization, contains all the materials necessary for another researcher to access raw materials and reproduce all the results include with the paper. The Reproducibility Package folder contains a README file with all the instructions for executing the code. Among other things, it provides licensing information for the materials, software and hardware requirements including time needed to run, and instructions for accessing and placing the original data before running the code (which must be downloaded separately). Finally, it has a detailed list of the code files that will run, their data inputs, and the outputs of each process. ![](examples/README.png)