updated to documentation following workshop

ecohealthalliance · Apr 18, 2024 · 44c623a · 44c623a
1 parent dc72e62
commit 44c623a
Show file tree

Hide file tree

Showing 6 changed files with 47 additions and 6 deletions.
diff --git a/R/othertext_lookup.R b/R/othertext_lookup.R
@@ -11,10 +11,17 @@
 #' collected. This function provides a manual look up reference so free text responses
 #' can be compared to the original questions in the validation workflow.
 #'
+#' This function can be expanded by providing a tibble with two columns: `name` and
+#' `other_name` which maps the question name in ODK to the question name containing
+#' 'other' or 'free text'.
+#'
 #' @param questionnaire The ODK questionnaire. Used to ensure the correct look up table is found.
 #'
 #' @return tibble
 #' @export
+#' @examples
+#' othertext_lookup(questionnaire = c("animal_owner"))
+#'
 #'
 othertext_lookup <- function(questionnaire = c("animal_owner")){
 

diff --git a/man/othertext_lookup.Rd b/man/othertext_lookup.Rd
diff --git a/vignettes/img/integration.png b/vignettes/img/integration.png
diff --git a/vignettes/img/pipeline2.png b/vignettes/img/pipeline2.png
diff --git a/vignettes/integration.Rmd b/vignettes/integration.Rmd
@@ -40,7 +40,11 @@ Also below is the relevant target that performs the joining operation and integr
   )
 ```
 
-It is critical that data validation steps are correctly performed to ensure the integration of multiple data sets is successful. In the case where there are missing, malformed or duplicate primary key, the expectations around the relationship type will not hold up.
+It is critical that data validation steps are correctly performed to ensure the integration of multiple data sets is successful. In the case where there are missing, malformed or duplicate primary key, the expectations around the relationship type will not hold up.  
+
+An overview of this integration process is below.  
+
+![](img/integration.png)
 
 ## Types of Data
 
@@ -56,7 +60,11 @@ Throughout the data cleaning pipeline, we take in raw data and convert it to som
 
 Over the course of a long data collection exercise, standards and formats can diverge. This makes the data cleaning steps difficult and will slow down the ability to integrate data as above. Some general strategies can help to mitigate these risks:
 
--   Design and enforce a Primary Key or Unique Identifier for each data set that will be meaningful and immutable.
--   Think about storing data in a 'tidy' format where possible. See here: <https://www.jstatsoft.org/article/view/v059i10>\
--   Store raw data in a machine readable format (i.e. CSV)
--   Set some metadata standards at the start of the project around columns and data types. It is understandable that these might change over time, but having these standards will help plan how to best accommodate changes without breaking existing work.
+- Ensure that data formatting to a standard where each row uniquely identifies a record and have a unique identifier without duplicates or missing values. 
+- Ensure all data are stored as a data.frame or tibble.  
+- In R, ensure that each column has a correct and unique column type. (i.e. character, numeric and *not* list)  
+- Each column name should be unique and formatted consistently to avoid spaces and special characters in column names. Hint: Use `janitor::clean_names()`. 
+- Design and enforce a Primary Key or Unique Identifier for each data set that will be meaningful and immutable.
+- Think about storing data in a 'tidy' format where possible. See here: <https://www.jstatsoft.org/article/view/v059i10>\
+- Store raw data in a machine readable format (i.e. CSV)
+- Set some metadata standards at the start of the project around columns and data types. It is understandable that these might change over time, but having these standards will help plan how to best accommodate changes without breaking existing work.
diff --git a/vignettes/mechanics.Rmd b/vignettes/mechanics.Rmd
@@ -40,7 +40,9 @@ Once these rules have been defined, they are 'confronted' with the data. Those r
 
 ### Integrating corrections
 
-The final step involves correcting the data and converting it into a 'semi-clean' data set. This involves reading in the validation log, scanning for any changes that are indicated, and then correcting the existing values with the newly supplied values.   
+The final step involves correcting the data and converting it into a 'semi-clean' data set. This involves reading in the validation log, scanning for any changes that are indicated, and then correcting the existing values with the newly supplied values.     
+
+To ensure corrections are applied as expected, a function `validation_checks()` is provided to compare the before and after data, along with the validation logs. This function will error if the checks are not satisfied and the output when successful is the summary output from `arsenal::comparedf()`. 
 
 A conceptual overview of this process is outlined below.  
 
@@ -50,6 +52,10 @@ A conceptual overview of this process is outlined below.
 
 ### Pipeline  
 
+The below code sample demonstrates a typical end to end validation pipeline. While
+the data pre-processing steps for any given project may vary, much of this code can be re-used as a template for
+future pipelines. 
+
 ```{r, eval=FALSE}
 fs_mosquito_field_targets <- tar_plan(
   # FS mosquito field data googlesheets ids
@@ -287,6 +293,17 @@ A directed acyclic graph of a {targets} pipeline is shown below for a real examp
 
 ![](img/targets.png)
 
+### Revised pipeline  
+
+In actual fact, after reading through the targets pipeline, a refinement is required to our flow chart to reflect how
+the pipeline works beyond its initial creation. 
+
+To ensure validation log corrections are applied correctly, any existing validation log is read in first. The
+raw data are then corrected and the validation log is re-run on the 'semi-clean' data so as to only flag new violations
+that were not already reviewed and corrected. Any new violation are appended to the existing log and uploaded to Dropbox. 
+
+![](img/pipeline2.png)
+
 ## Complex Cases  
 
 In some cases, such as with questionnaire data, multiple different logs are created