Skip to content

Commit

Permalink
#476 : "raw data" and "data set"
Browse files Browse the repository at this point in the history
  • Loading branch information
luizaandrade committed Feb 17, 2021
1 parent a4d206d commit d675a87
Show file tree
Hide file tree
Showing 8 changed files with 60 additions and 60 deletions.
2 changes: 1 addition & 1 deletion 00-introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ This book aims to be a highly practical resource so the reader can immediately b

**Chapter \@ref(processing)** describes data processing tasks. It details how to construct "tidy" data at the appropriate units of analysis, how to ensure uniquely identified datasets, and how to routinely incorporate data quality checks into the workflow. It also provides guidance on de-identification and cleaning of personally-identified data, focusing on how to understand and structure data so that it is ready for indicator construction and analytical work.

**Chapter \@ref(analysis)** discusses data analysis tasks. It begins with data construction, or the creation of new variables from the raw data acquired or collected in the field. It introduces core principles for writing analytical code and creating, exporting, and storing research outputs such as figures and tables reproducibly using dynamic documents.
**Chapter \@ref(analysis)** discusses data analysis tasks. It begins with data construction, or the creation of new variables from the original data or collected in the field. It introduces core principles for writing analytical code and creating, exporting, and storing research outputs such as figures and tables reproducibly using dynamic documents.

**Chapter \@ref(publication)** outlines the publication of research outputs, including manuscripts, code, and data. This chapter discusses how to effectively collaborate on technical writing using dynamic documents. It also covers how and why to publish datasets in an accessible, citable, and safe fashion. Finally, it provides guidelines for preparing functional and informative reproducibility packages that contain all the code, data, and meta-information needed for others to evaluate and reproduce your work.

Expand Down
6 changes: 3 additions & 3 deletions 01-reproducibility.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -429,8 +429,8 @@ the **original data** (including corrections)^[
that becomes the functional basis for research work.]\index{original data}
should be immediately placed in a secure permanent storage system.
Before analytical work begins, you should create a "for-publication"
copy of the original dataset by removing potentially identifying information.\index{de-identification}
This will become the raw data, and must be
copy of the acquired dataset by removing potentially identifying information.\index{de-identification}
This will become the original data, and must be
placed in an archival repository where it can be cited.^[@vilhuber2020report]\index{data publication}
This can initially be done under embargo or with limited release,
in order to protect your data and future work.
Expand All @@ -450,7 +450,7 @@ provide specific repositories in which they require the deposit of data they fun
and you should take advantage of these when possible.
If this is not provided, you must be aware of privacy issues
with directly identifying data and questions of data ownership
before uploading raw data to any third-party server, whether public or not;\index{data ownership}
before uploading original data to any third-party server, whether public or not;\index{data ownership}
this is a legal question for your home organization.
If data that is required for analysis must be placed under restricted use or restricted access,
including data that can never be distributed directly by you to third parties,
Expand Down
6 changes: 3 additions & 3 deletions 02-collaboration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -587,7 +587,7 @@ it seems to be "enough but not too much" for most purposes.
```{block2, type = 'ex'}
### Demand for Safe Spaces Case Study: Writing Code That Others Can Read {-}
To ensure that all team members were able to easily read and understand data work, *Demand for Safe Spaces* code files were extensively commented. Comments typically took the form of "what – why": what is this section of code doing, and why is it necessary. The below snippet from a do-file cleaning one of the raw data files illustrates the use of comments:
To ensure that all team members were able to easily read and understand data work, *Demand for Safe Spaces* code files were extensively commented. Comments typically took the form of "what – why": what is this section of code doing, and why is it necessary. The below snippet from a do-file cleaning one of the original datasets illustrates the use of comments:
![](examples/ch2-writing-code-that-others-can-read.png)
Expand All @@ -604,7 +604,7 @@ To bring all these smaller code files together, you must maintain a master scrip
A master script is the map of all your project's data work
which serves as a table of contents for the instructions that you code.
Anyone should be able to follow and reproduce all your work from
raw data to all outputs by simply running this single script.
the original data to all outputs by simply running this single script.
By follow, we mean someone external to the project who has the master script and all the input data can
(i) run all the code and recreate all outputs,
(ii) have a general understanding of what is being done at every step, and
Expand Down Expand Up @@ -987,7 +987,7 @@ The **initial de-identification** process strips the data of direct identifiers
as early in the process as possible,
to create a working de-identified dataset that
can be shared *within the research team* without the need for encryption.
This data set should always be used when possible.
This dataset should always be used when possible.
The **final de-identification** process involves
making a decision about the trade-off between
risk of disclosure and utility of the data
Expand Down
12 changes: 6 additions & 6 deletions 03-measurement.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ The **data linkage table**^[
More details on DIME's data linkage table template
and an example can be found on the DIME Wiki:
https://dimewiki.worldbank.org/Data_Linkage_Table.]
lists all the raw datasets that will be used in the project,
lists all the original datasets that will be used in the project,
what data sources they are created from,
and how they relate to each other.
For each **unit of observation**^[
Expand All @@ -144,7 +144,7 @@ you will create **data flowcharts**,^[
More details on DIME's data flow chart template
and an example can be found on the DIME Wiki:
https://dimewiki.worldbank.org/Data_Flow_Chart.]
describing how the raw datasets and master datasets
describing how the original datasets and master datasets
are to be combined and manipulated to create analysis datasets.
Each component will be discussed in more detail below.

Expand Down Expand Up @@ -174,8 +174,8 @@ outcomes of interest, and control variables among many others.
To create a data map according to DIME's template,\index{data linkage table}
the first step is to create a **data linkage table** by listing
all the data sources you know you will use in a spreadsheet,
and the raw datasets that will be created from them.
If one source of data will result in two different raw datasets,
and the original datasets that will be created from them.
If one source of data will result in two different datasets,
then list each dataset on its own row.
For each dataset, list the unit of observation
and the name of the **project ID**^[
Expand Down Expand Up @@ -223,7 +223,7 @@ the nature of the data license, and so on.
The main unit of observation in the platform survey datasets is the respondent and it is uniquely identified by the variable id. However, implicit association tests (IAT) were collected through a specialized software that outputs two datasets for each IAT instrument: one at respondent level, containing the final scores; and one with detailed information on each stimulus used in the test (images or expressions to be associated with concepts). Three IAT instruments were used: one testing the association between gender and career choices; one testing the association between car choice and safety concerns; and one testing the association between car choice and openness to sexual advances.
As a result, the raw data for the platform survey component of the project consisted in 7 datasets: 1 for the platform survey, and 6 for the IAT -- 3 with IAT scores (one for each instrument) and 3 with detailed stimuli data (one for each instrument). All 7 datasets are stored in the same raw data folder. The data linkage table lists their file names and indicates how their ID variables are connected. Note that the raw stimulus data does not have a unique identifier, since the same stimulus can be shown repeatedly, so the “ID var” field is blank for these datasets.
As a result, the original data for the platform survey component of the project consisted in 7 datasets: 1 for the platform survey, and 6 for the IAT -- 3 with IAT scores (one for each instrument) and 3 with detailed stimuli data (one for each instrument). All 7 datasets are stored in the same raw data folder. The data linkage table lists their file names and indicates how their ID variables are connected. Note that the raw stimulus data does not have a unique identifier, since the same stimulus can be shown repeatedly, so the “ID var” field is blank for these datasets.
| Data source | Raw dataset name | Unit of observation <br> (ID var) | Parent unit <br> (ID var) |
|-------------|------------------|---------------------------------|--------------------------|
Expand Down Expand Up @@ -365,7 +365,7 @@ and multi-level data like "district-school-teacher-student" structures.
```{block2, type = "ex"}
### Demand for Safe Spaces Example: Creating Data Flowcharts
The data flow chart indicates how the raw datasets are processed and combined to create a final respondent-level dataset that will be used for analysis. The analysis dataset resulting from this process is shown in green. The raw datasets are shown in blue (refer to \ref{@linkage} for details on the raw datasets). The name of the uniquely identifying variable in the dataset is indicated in the format (ID: variable_name).
The data flow chart indicates how the original datasets are processed and combined to create a final respondent-level dataset that will be used for analysis. The analysis dataset resulting from this process is shown in green. The original datasets are shown in blue (refer to the data linkage table example for details on the original datasets). The name of the uniquely identifying variable in the dataset is indicated in the format (ID: variable_name).
Each operation that changes the level of observation of the data is summarized in the flow chart. The chart also summarizes how datasets will be combined. Since these are the most error-prone data processing tasks, having a high-level plan for how they will be executed helps clarify the process for everyone in the data team, preventing future mistakes.
Expand Down
18 changes: 9 additions & 9 deletions 04-acquisition.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Data acquisition can take many forms, including:
primary data generated through surveys;
private sector partnerships granting access to new data sources, such as administrative and sensor data;
digitization of paper records, including administrative data; web scraping;
primary data capture by unmanned aerial vehicles or other types of remote sensing;
data captured by unmanned aerial vehicles or other types of remote sensing;
or novel integration of various types of datasets, such as combining survey and sensor data.
Much of the recent push toward credibility in the social sciences has focused on analytical practices.
However, credible development research depends, first and foremost, on the quality of the acquired data.
Expand Down Expand Up @@ -657,7 +657,7 @@ address any issues that arose during piloting
and cover frequently asked questions.
The manual must also describe survey protocols and conventions,
such as how to select or confirm the identity of respondents,
and standardized means for recording responses such as ``Don't know".^[
and standardized means for recording responses such as "Don't know".^[
For more details and examples of common survey protocols
see the DIME Wiki:
https://dimewiki.worldbank.org/Survey_Protocols]
Expand Down Expand Up @@ -1011,9 +1011,9 @@ There is absolutely no way to restore the data if you lose your key,
so we cannot stress enough the importance of using a password manager,
or equally secure solution, to store these encryption keys.

It is becoming more and more common that development research
is done on data set that is too big to store on a regular computer,
and instead the data is stored and processed in a cloud environment.
It is becoming more and more common for development research
to use data that is too big to be stored in a regular computer
and needs to be stored and processed in a cloud environment instead.
There are many available cloud storage solutions
and you need to understand how the data is encrypted and how the keys are handled.
This is likely another case where a regular research team will have to ask a cybersecurity expert.
Expand All @@ -1040,11 +1040,11 @@ This should be on your computer, and could be in a shared folder.
If your data source is a survey and the data was encrypted during data collection,
then you will need *both* the private key used during data collection to be able to download the data,
*and* the key used when you created the encrypted folder to save it there.
This your first copy of your raw data, and the copy you will use for cleaning and analysis.
This your first copy of your original data, and the copy you will use for cleaning and analysis.

1. Create a second encrypted folder on an external drive that you can keep in a secure location.
Copy the data you just downloaded to this second encrypted folder.
This is the ``master" backup copy of the raw data.
This is the "master" backup copy of the original data.
You should never work with this data on a day-to-day basis.
You should not use the same encrypted folder or the same key as above,
because if you use the same key and lose the key,
Expand All @@ -1057,8 +1057,8 @@ and thereby do not risk losing access by losing an encryption key.
Either you can create this on your computer and upload it to a long-term cloud storage service (not a sync software),
or you can create it on another external hard drive or computer that you then store in a second location,
for example, at another office of your organization.
This is the ``golden master" backup copy of the raw data.
You should never store the ``golden master" copy in a synced folder,
This is the "golden master" backup copy of the original data.
You should never store the "golden master" copy in a synced folder,
as it would be deleted in the cloud storage if it is deleted on your computer.
You should also never work with this data;
it exists only for recovery purposes.
Expand Down
Loading

0 comments on commit d675a87

Please sign in to comment.