Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

course data conundrum #24

Open
aspina7 opened this issue Nov 11, 2024 · 0 comments
Open

course data conundrum #24

aspina7 opened this issue Nov 11, 2024 · 0 comments
Assignees

Comments

@aspina7
Copy link
Contributor

aspina7 commented Nov 11, 2024

Situation

We currently have loads of different versions floating around various exercise and course_folder repos.

Background

Most of the datasets are derivations of the epirhandbook ebola datasets; however these new versions weren't created reproducibly and there are several names for mostly the same content.
Similarly the covid datasets are from fulton course, but haven't just been moved over, and have different names in different places.

Assessment

file location and naming
The main issues are summarised in this google sheet.
repo_data: all datasets across our github.
course folder datasets: datasets called with import() in course folders (nb. might be ignoring Rmd file calls)
Course exercises datasets: datasets called with import() in course exercise repos (includes both r and rmd file calls)
Pivot Table 2: summary of course exercises datasets to show what files are in which courses (see screenshot below)

grafik

file content
Scripts for comparing datasets (ebola and covid) can be found in ae-ideas::data_compare repo. search.r produces the excel lists above, and compare.r analyses file dimensions and content.
For example, looking only at ebola related linelists we see below that num of rows and columns are very different across datasets.
Then looking at existing variables in only the linelist_combined_20141201.rds, we see that intro version uses gender where as stats and rmd use sex (we should use the latter); and the stats course includes extra variables for "bleeding" and "healthcare worker".

grafik

While covid looks better in terms of num rows and columns - there is some discrepencies in variable naming

grafik

Recommendation

  1. Re-think which datasets are actually needed (i.e. we should substantially reduce the number)
  2. Go back and create master datasets with necessary variables across all courses reproducibly from epirhandbook versions of linelist_raw.xlsx and linelist_cleaned.rds
  3. Create subsets of those for specific courses as necessary reproducibly, as descriebd in {appliedepidata} this needs to be reproducible in a parent>child format.

Timeline

Depending on decisions around course restructuring could be revamped by end jan 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants