course data conundrum #24

aspina7 · 2024-11-11T00:56:32Z

Situation

We currently have loads of different versions floating around various exercise and course_folder repos.

Background

Most of the datasets are derivations of the epirhandbook ebola datasets; however these new versions weren't created reproducibly and there are several names for mostly the same content.
Similarly the covid datasets are from fulton course, but haven't just been moved over, and have different names in different places.

Assessment

file location and naming
The main issues are summarised in this google sheet.
repo_data: all datasets across our github.
course folder datasets: datasets called with import() in course folders (nb. might be ignoring Rmd file calls)
Course exercises datasets: datasets called with import() in course exercise repos (includes both r and rmd file calls)
Pivot Table 2: summary of course exercises datasets to show what files are in which courses (see screenshot below)

file content
Scripts for comparing datasets (ebola and covid) can be found in ae-ideas::data_compare repo. search.r produces the excel lists above, and compare.r analyses file dimensions and content.
For example, looking only at ebola related linelists we see below that num of rows and columns are very different across datasets.
Then looking at existing variables in only the linelist_combined_20141201.rds, we see that intro version uses gender where as stats and rmd use sex (we should use the latter); and the stats course includes extra variables for "bleeding" and "healthcare worker".

While covid looks better in terms of num rows and columns - there is some discrepencies in variable naming

Recommendation

Re-think which datasets are actually needed (i.e. we should substantially reduce the number)
Go back and create master datasets with necessary variables across all courses reproducibly from epirhandbook versions of linelist_raw.xlsx and linelist_cleaned.rds
Create subsets of those for specific courses as necessary reproducibly, as descriebd in {appliedepidata} this needs to be reproducible in a parent>child format.

Timeline

Depending on decisions around course restructuring could be revamped by end jan 2025

The text was updated successfully, but these errors were encountered:

aspina7 assigned jarvisc1 and nsbatra Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

course data conundrum #24

course data conundrum #24

aspina7 commented Nov 11, 2024 •

edited

Loading

course data conundrum #24

course data conundrum #24

Comments

aspina7 commented Nov 11, 2024 • edited Loading

Situation

Background

Assessment

Recommendation

Timeline

aspina7 commented Nov 11, 2024 •

edited

Loading