-
Notifications
You must be signed in to change notification settings - Fork 0
Import Process
The Dengue Fever data is imported from the Thai MOPH every two weeks, which provides Access and CSV files that begin the
process. Following are the steps that accomplish this. Note that the final import deliverable is the unique_case_data table
in the Dengue-Cases-Database.
- Transfer Access and CSV files to idmodeling1.jhsph.edu
- Transfer Access file from idmodeling1.jhsph.edu to sphhs-bioepi01.umass.edu
- Extract all of the Access table's fields to a CSV file
- Update $DATA_DIR/dengue_case_metadata.txt
- Add a corresponding R file to $CODE_DIR/file_specific_code
- Run
make test-input
- Run
make import
- Run
make test-output
Every two weeks the Thai MOPH transfers via SFTP an Access and CSV file (we ignore the CSV) for that period to
idmodeling1.jhsph.edu:/home/sasithan/incoming
, and then emails project members to announce the
new files. The Access file format used for delivery has changed over time but is now .mdb files named like
wk22_2017.mdb
. The Access file contains one table with columns that have varied in the past, both the column name and
which columns are included. The import code normalizes these later in the process.
UMass then transfers the Access file to sphhs-bioepi01.umass.edu/$DATA_DIR/raw/rolling
. (File names
were normalized somewhat in the past, but are now left as-is.) Connections from idmodeling1 to sphhs-bioepi01 server go
through port 6393.
UMass manually runs commands from the MDB Tools project extract all fields from the
Access table's fields to a CSV file located at $DATA_DIR/original_data
. We extract the CSV file
ourselves rather than relying on the one provided by the MOPH because we want to ensure we are extracting all fields
using their native names. There are two steps:
- List all tables in access db files (there should only be one):
$ mdb-tables final_2016.mdb
- Use the table name in #1 to txport the table, e.g.,:
$ mdb-export final_2016.mdb final_2016 > ../../original_data/dengue_cases_2016_final.csv
(See note in Servers-Databases-And-Code-Repos#data_dir-variable re: the $DATA_DIR
variable.)
UMass manually maintains the file $DATA_DIR/dengue_case_metadata.txt
, which tracks information about
each CSV file we extract. It captures which files from the original_data folder to read, and also stores some metadata
about the file that's used in later processing. Tracking this is important partly because which columns are included has
changed over time. The file contains these columns:
-
link_name
: The file in$DATA_DIR/original_data
. -
date_delivered
: When we obtained the file. -
week
: which week of the year the data concerns (this can be extracted from the file name). -
data_era
: I am not sure what this field is for. At the moment we are in era 3, it should be in the old documentation. -
data_format
: What format the dates were exported in ie 'mdy'. -
file_name
: The name of the original file in$DATA_DIR/raw/rolling
(this usually differs from the link_name because we have a naming scheme in place). -
order
: A unique id for files based on the order we got them. -
delivery_year
: The year the file reports cases in. This is to make it easy to extract all files reporting cases in a specific year.
Note: Over time the MOPH has added additional data, resulting in different colunms being available depending on the year.
Overall, there are two major sets of data available: Count-only data from 1968 to 2005, and case details available from
1999 onward. The former is contained in the file
$DATA_DIR/original_data/dengue_counts_1968_to_2005.csv
and contains per-province counts, one month per
row. For the more recent files containing actual cases, we have one case per row. Again, which columns are provided has
varied over time. For example, dengue_cases_1999.csv
through dengue_cases_2001.csv
contains these columns:
SEX,AGE,RPDT,ILDT,disease,province,opd,status,hospital,year,ageyrmo,hospitalgr,metropol
But the most recent files contain:
DISEASE,SEX,Year,Month,Day,MARIETAL,RACE,OCCUPAT,ADDRCODE,PROVINCE,METROPOL,HOSPITAL,TYPE,RESULT,DATESICK,DATEDEFINE,DATEDEATH,DATERECORD,DATEREACH,ORGANISM,COMPLICA,icd10,PeriodSick_define,PeriodSick_death
The import code normalizes these later in the process.
Lastly, there are annual 'final' files that contain all the cases for each year, e.g., dengue_cases_2014_final.csv
with these columns:
"DISEASE","SEX","YEAR","MONTH","DAY","MARIETAL","RACE","OCCUPAT","ADDRCODE","PROVINCE","METROPOL","HOSPITAL","TYPE","RESULT","DATESICK","DATEDEFINE","DATEDEATH","DATERECORD","DATEREACH","ORGANISM","COMPLICA","icd10"
To insulate the code from column differences, we maintain an R file for each CSV file that was extracted. These files
are located under the $CODE_DIR/file_specific_code
directory in the locally-cloned repo, and are named like
dengue_cases_2017_wk_18.R
. Here are typical contents:
source('standard_rolling_dump_mapping.R')
column_map <- c(column_map,list(
date_sick_mdyhms = 'datesick',
date_define_mdyhms = 'datedefine',
date_death_mdyhms = 'datedeath',
date_record_mdyhms = 'daterecord',
date_reach_mdyhms = 'datereach'
))
To test the data directory for consistency, run this target in Makefile:
$ make test-input
Automated-Tests provides details on what input tests are run.
To actually import the data, do:
$ make import
The details of this step are covered in Make-Import-Details.
To consistency-check the resulting database, run:
$ make test-output
See Automated-Tests for details on what output tests are run.