-
Notifications
You must be signed in to change notification settings - Fork 27
Case Study: Report of Fields With Missing Values
A research institution conducted a randomized trial to evaluate the effectiveness of a surgical technique. Before beginning data analysis, it was desired to know if there were patients for whom data collection was not complete. The research team wanted to ensure that all critical fields had been completed before beginning the analysis. But with 155 records, 477 fields, and 6 events, a manual audit of the data was impractical. The statistician was approached and asked if it were possible to create a report of all of the fields containing missing values for each patient-event.
- Branching Logic: When branching logic applies, fields do not appear on the data input form unless certain conditions apply. If the field does not appear, we expect the data to be missing and the field should not be included in the report.
- Applicability of Forms: Some forms may not apply to all patients. Most notably, the Adverse Event form for a patient is not filled out if the patient did not experience an adverse event. Fields in these scenarios should not be included in the report.
- Clinical Interest: Not all fields are of clinical interest to the researchers and may not be vital to the analysis. Only fields of clinical interest should be included in the report.
To complete the request, a function was written that can access the REDCap database and search for missing values. The function returns a data.frame
object listing the patient ID, REDCap event name, data access group, number of missing fields, and the list of missing fields.
The missingSummary
function is not a formal part of the redcapAPI
package, though it makes use of the package functionality. The function can be downloaded as a gist from GitHub. After saving the code to the location gist_location
, it can be loaded into the R workspace using
source([gist_location])
missingSummary
is a generic function with an active method for redcapApiConnection
objects. The core arguments are:
-
rcon
a REDCap connection object generated byredcapConnection
. -
excludeMissingForms
When true, this assumes that if a form contains only missing values, then it was intended that they are missing (such as no adverse events), and those fields are left off the report.
For redcapApiConnection
objects, the user may also specify a proj
, a redcapProjectInfo
object (created by the redcapProjectInfo
function) and the batch.size
for limiting the number of record ID's pulled in any one batch.
If using the function in "offline mode" (meaning using the data downloads instead of the API), the rcon
arguments is replaced by records
and meta_data
, which take the file paths of the raw/unlabelled data download and data dictionary, respectively.
missingSummary
operates by doing the following tasks:
- export records from REDCap.
- export meta data from REDCap.
- Translate REDCap branching logic to
R
expressions. - Designate fields excluded by branching logic as non-missing.
- Designate fields excluded by unused forms as non-missing.
- Apply
is.na
to all fields. - Produce the summary of results.
This procedure deals with obstacles 1 and 2. Obstacle 3 is dealt with outside of the function, and we will show how to do this later.
The basic summary that will accomplish the objectives of
library(redcapAPI)
library(stringr)
options(redcap_api_url = [REDCAP_API_ADDRESS])
rcon <- redcapConnection(token = [SUPER_SECRET_TOKEN])
Miss <- missingSummary(rcon)
The resulting summary produces the following data frame (only the first six rows are shown)
patient_id_incl | redcap_event_name | redcap_data_access_group | n_missing | missing | |
---|---|---|---|---|---|
1 | 10-abc | baseline_arm_1 | dag1 | 3 | reason, partner_satisfaction, lying |
2 | 10-abc | surgery_arm_1 | dag1 | 0 | |
3 | 10-abc | 6_week_followup_arm_1 | dag1 | 1 | pt_initials_pod_7 |
4 | 10-abc | 3_month_followup_arm_1 | dag1 | 8 | pain_interfere, reason, partner_satisfaction, overall_satisfaction, incon16, recur_incon17, bulk_agent18, urin_ret19 |
5 | 10-abc | 6_month_followup_arm_1 | dag1 | 0 | |
6 | 10-abc | 12_month_followup_arm_1 | dag1 | 4 | limit, reason, partner_satisfaction, incon16 |
Not all of the variables in the table above are of concern if they are missing. For instance, if intitals are missing from pt_initials_pod_7
at the six week follow-up, it won't affect the results of the analysis. In order to further reduce this table to just the clinically meaningful variables, we require additional information from the researchers.
In this case, the researchers provided a list of fields they felt were crucial to gather. With their list of fields, the following code was run:
varsToKeep <- c("hosp_6m", "er_visits_6_m", "clinic_6_m", ...
[A lot of other field names],
"prolapse", "sexual", "urinary_incont")
Miss$missing <- as.character(Miss$missing)
Miss$label <- NA
for (i in 1:nrow(Miss)){
tmp <- unlist(str_split(Miss$missing[i], ", "))
tmp <- tmp[tmp %in% varsToKeep]
lbl <- ifelse(length(tmp) > 0, paste(Hmisc::label(VAULT[, tmp]), collapse="\r\n"), "")
tmp <- ifelse(length(tmp) > 0, paste(tmp, collapse="\r\n"), "")
Miss$missing[i] <- tmp
Miss$label[i] <- lbl
}
The label
column was added for the convenience of the data entry staff--the field label is probably more familiar to them than the field name. The resulting data frame shows:
patient_id_incl | redcap_event_name | redcap_data_access_group | n_missing | missing | label | |
---|---|---|---|---|---|---|
1 | 10-abc | baseline_arm_1 | dag1 | 3 | ||
2 | 10-abc | surgery_arm_1 | dag1 | 0 | ||
3 | 10-abc | 6_week_followup_arm_1 | dag1 | 1 | ||
4 | 10-abc | 3_month_followup_arm_1 | dag1 | 8 | incon16; recur_incon17; bulk_agent18; urin_ret19 | 18. Has the patient had surgery for new urinary incontinence since the study procedure ?; 19. Has the patient had surgery for recurrent or persistent urinary incontinence since the study procedure?; 20. Has the patient undergone bulking agent Injection (i.e. Collagen, Contigen, Durasphere, Coaptite) for post-op urinary incontinence since the study procedure?; 21. Does the patient have or has the patient had voiding dysfunction or urinary retention requiring intermittent self-catheritization or surgery (i.e. urethrolysis) since the study procedure. (Do not include catheterization less than 6 weeks postop.) |
5 | 10-abc | 6_month_followup_arm_1 | dag1 | 0 | ||
6 | 10-abc | 12_month_followup_arm_1 | dag1 | 4 | incon16 | 18. Has the patient had surgery for new urinary incontinence since the study procedure ? |
Note: In the code, I specify a the characters \r\n
to separate each field and field label. I replaced these with semi-colons for the wiki presentation. But the \r\n
separator is useful when passing a CSV file back to the research because it will place each field on a separate line within the CSV, which makes for easy reading.