Skip to content

Commit

Permalink
updated to documentation following workshop
Browse files Browse the repository at this point in the history
  • Loading branch information
deanmarchiori committed Apr 18, 2024
1 parent dc72e62 commit 44c623a
Show file tree
Hide file tree
Showing 6 changed files with 47 additions and 6 deletions.
7 changes: 7 additions & 0 deletions R/othertext_lookup.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,17 @@
#' collected. This function provides a manual look up reference so free text responses
#' can be compared to the original questions in the validation workflow.
#'
#' This function can be expanded by providing a tibble with two columns: `name` and
#' `other_name` which maps the question name in ODK to the question name containing
#' 'other' or 'free text'.
#'
#' @param questionnaire The ODK questionnaire. Used to ensure the correct look up table is found.
#'
#' @return tibble
#' @export
#' @examples
#' othertext_lookup(questionnaire = c("animal_owner"))
#'
#'
othertext_lookup <- function(questionnaire = c("animal_owner")){

Expand Down
9 changes: 9 additions & 0 deletions man/othertext_lookup.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Binary file added vignettes/img/integration.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/img/pipeline2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 13 additions & 5 deletions vignettes/integration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,11 @@ Also below is the relevant target that performs the joining operation and integr
)
```

It is critical that data validation steps are correctly performed to ensure the integration of multiple data sets is successful. In the case where there are missing, malformed or duplicate primary key, the expectations around the relationship type will not hold up.
It is critical that data validation steps are correctly performed to ensure the integration of multiple data sets is successful. In the case where there are missing, malformed or duplicate primary key, the expectations around the relationship type will not hold up.

An overview of this integration process is below.

![](img/integration.png)

## Types of Data

Expand All @@ -56,7 +60,11 @@ Throughout the data cleaning pipeline, we take in raw data and convert it to som

Over the course of a long data collection exercise, standards and formats can diverge. This makes the data cleaning steps difficult and will slow down the ability to integrate data as above. Some general strategies can help to mitigate these risks:

- Design and enforce a Primary Key or Unique Identifier for each data set that will be meaningful and immutable.
- Think about storing data in a 'tidy' format where possible. See here: <https://www.jstatsoft.org/article/view/v059i10>\
- Store raw data in a machine readable format (i.e. CSV)
- Set some metadata standards at the start of the project around columns and data types. It is understandable that these might change over time, but having these standards will help plan how to best accommodate changes without breaking existing work.
- Ensure that data formatting to a standard where each row uniquely identifies a record and have a unique identifier without duplicates or missing values.
- Ensure all data are stored as a data.frame or tibble.
- In R, ensure that each column has a correct and unique column type. (i.e. character, numeric and *not* list)
- Each column name should be unique and formatted consistently to avoid spaces and special characters in column names. Hint: Use `janitor::clean_names()`.
- Design and enforce a Primary Key or Unique Identifier for each data set that will be meaningful and immutable.
- Think about storing data in a 'tidy' format where possible. See here: <https://www.jstatsoft.org/article/view/v059i10>\
- Store raw data in a machine readable format (i.e. CSV)
- Set some metadata standards at the start of the project around columns and data types. It is understandable that these might change over time, but having these standards will help plan how to best accommodate changes without breaking existing work.
19 changes: 18 additions & 1 deletion vignettes/mechanics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@ Once these rules have been defined, they are 'confronted' with the data. Those r

### Integrating corrections

The final step involves correcting the data and converting it into a 'semi-clean' data set. This involves reading in the validation log, scanning for any changes that are indicated, and then correcting the existing values with the newly supplied values.
The final step involves correcting the data and converting it into a 'semi-clean' data set. This involves reading in the validation log, scanning for any changes that are indicated, and then correcting the existing values with the newly supplied values.

To ensure corrections are applied as expected, a function `validation_checks()` is provided to compare the before and after data, along with the validation logs. This function will error if the checks are not satisfied and the output when successful is the summary output from `arsenal::comparedf()`.

A conceptual overview of this process is outlined below.

Expand All @@ -50,6 +52,10 @@ A conceptual overview of this process is outlined below.

### Pipeline

The below code sample demonstrates a typical end to end validation pipeline. While
the data pre-processing steps for any given project may vary, much of this code can be re-used as a template for
future pipelines.

```{r, eval=FALSE}
fs_mosquito_field_targets <- tar_plan(
# FS mosquito field data googlesheets ids
Expand Down Expand Up @@ -287,6 +293,17 @@ A directed acyclic graph of a {targets} pipeline is shown below for a real examp

![](img/targets.png)

### Revised pipeline

In actual fact, after reading through the targets pipeline, a refinement is required to our flow chart to reflect how
the pipeline works beyond its initial creation.

To ensure validation log corrections are applied correctly, any existing validation log is read in first. The
raw data are then corrected and the validation log is re-run on the 'semi-clean' data so as to only flag new violations
that were not already reviewed and corrected. Any new violation are appended to the existing log and uploaded to Dropbox.

![](img/pipeline2.png)

## Complex Cases

In some cases, such as with questionnaire data, multiple different logs are created
Expand Down

0 comments on commit 44c623a

Please sign in to comment.