You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently have loads of different versions floating around various exercise and course_folder repos.
Background
Most of the datasets are derivations of the epirhandbook ebola datasets; however these new versions weren't created reproducibly and there are several names for mostly the same content.
Similarly the covid datasets are from fulton course, but haven't just been moved over, and have different names in different places.
Assessment
file location and naming
The main issues are summarised in this google sheet. repo_data: all datasets across our github. course folder datasets: datasets called with import() in course folders (nb. might be ignoring Rmd file calls) Course exercises datasets: datasets called with import() in course exercise repos (includes both r and rmd file calls) Pivot Table 2: summary of course exercises datasets to show what files are in which courses (see screenshot below)
file content
Scripts for comparing datasets (ebola and covid) can be found in ae-ideas::data_compare repo. search.r produces the excel lists above, and compare.r analyses file dimensions and content.
For example, looking only at ebola related linelists we see below that num of rows and columns are very different across datasets.
Then looking at existing variables in only the linelist_combined_20141201.rds, we see that intro version uses gender where as stats and rmd use sex (we should use the latter); and the stats course includes extra variables for "bleeding" and "healthcare worker".
While covid looks better in terms of num rows and columns - there is some discrepencies in variable naming
Recommendation
Re-think which datasets are actually needed (i.e. we should substantially reduce the number)
Go back and create master datasets with necessary variables across all courses reproducibly from epirhandbook versions of linelist_raw.xlsx and linelist_cleaned.rds
Create subsets of those for specific courses as necessary reproducibly, as descriebd in {appliedepidata} this needs to be reproducible in a parent>child format.
Timeline
Depending on decisions around course restructuring could be revamped by end jan 2025
The text was updated successfully, but these errors were encountered:
Situation
We currently have loads of different versions floating around various exercise and course_folder repos.
Background
Most of the datasets are derivations of the epirhandbook ebola datasets; however these new versions weren't created reproducibly and there are several names for mostly the same content.
Similarly the covid datasets are from fulton course, but haven't just been moved over, and have different names in different places.
Assessment
file location and naming
The main issues are summarised in this google sheet.
repo_data
: all datasets across our github.course folder datasets
: datasets called with import() in course folders (nb. might be ignoring Rmd file calls)Course exercises datasets
: datasets called with import() in course exercise repos (includes both r and rmd file calls)Pivot Table 2
: summary of course exercises datasets to show what files are in which courses (see screenshot below)file content
Scripts for comparing datasets (ebola and covid) can be found in ae-ideas::data_compare repo.
search.r
produces the excel lists above, andcompare.r
analyses file dimensions and content.For example, looking only at ebola related linelists we see below that num of rows and columns are very different across datasets.
Then looking at existing variables in only the
linelist_combined_20141201.rds
, we see thatintro
version uses gender where asstats
andrmd
use sex (we should use the latter); and thestats
course includes extra variables for "bleeding" and "healthcare worker".While covid looks better in terms of num rows and columns - there is some discrepencies in variable naming
Recommendation
epirhandbook
versions of linelist_raw.xlsx and linelist_cleaned.rdsTimeline
Depending on decisions around course restructuring could be revamped by end jan 2025
The text was updated successfully, but these errors were encountered: