Import Process

The Dengue Fever data is imported from the Thai MOPH every two weeks, which provides Access and CSV files that begin the process. Following are the steps that accomplish this. Note that the final import deliverable is the unique_case_data table in the Dengue-Cases-Database.

Transfer Access and CSV files to idmodeling1.jhsph.edu
Transfer Access file from idmodeling1.jhsph.edu to sphhs-bioepi01.umass.edu
Extract all of the Access table's fields to a CSV file
Update $DATA_DIR/dengue_case_metadata.txt
Add a corresponding R file to $CODE_DIR/file_specific_code
Run make test-input
Run make import
Run make test-output

Transfer Access and CSV files to idmodeling1.jhsph.edu

Every two weeks the Thai MOPH transfers via SFTP an Access and CSV file (we ignore the CSV) for that period to idmodeling1.jhsph.edu:/home/sasithan/incoming, and then emails project members to announce the new files. The Access file format used for delivery has changed over time but is now .mdb files named like wk22_2017.mdb. The Access file contains one table with columns that have varied in the past, both the column name and which columns are included. The import code normalizes these later in the process.

Transfer Access file from idmodeling1.jhsph.edu to sphhs-bioepi01.umass.edu

UMass then transfers the Access file to sphhs-bioepi01.umass.edu/$DATA_DIR/raw/rolling. (File names were normalized somewhat in the past, but are now left as-is.) Connections from idmodeling1 to sphhs-bioepi01 server go through port 6393.

Extract all of the Access table's fields to a CSV file

UMass manually runs commands from the MDB Tools project extract all fields from the Access table's fields to a CSV file located at $DATA_DIR/original_data. We extract the CSV file ourselves rather than relying on the one provided by the MOPH because we want to ensure we are extracting all fields using their native names. There are two steps:

List all tables in access db files (there should only be one):
```
$ mdb-tables final_2016.mdb
```

Use the table name in #1 to txport the table, e.g.,:

$ mdb-export final_2016.mdb final_2016 > ../../original_data/dengue_cases_2016_final.csv

Update $DATA_DIR/dengue_case_metadata.txt

(See note in Servers-Databases-And-Code-Repos#data_dir-variable re: the $DATA_DIR variable.)

UMass manually maintains the file $DATA_DIR/dengue_case_metadata.txt, which tracks information about each CSV file we extract. It captures which files from the original_data folder to read, and also stores some metadata about the file that's used in later processing. Tracking this is important partly because which columns are included has changed over time. The file contains these columns:

link_name: The file in $DATA_DIR/original_data.
date_delivered: When we obtained the file.
week: which week of the year the data concerns (this can be extracted from the file name).
data_era: I am not sure what this field is for. At the moment we are in era 3, it should be in the old documentation.
data_format: What format the dates were exported in ie 'mdy'.
file_name: The name of the original file in $DATA_DIR/raw/rolling (this usually differs from the link_name because we have a naming scheme in place).
order: A unique id for files based on the order we got them.
delivery_year: The year the file reports cases in. This is to make it easy to extract all files reporting cases in a specific year.

Note: Over time the MOPH has added additional data, resulting in different colunms being available depending on the year. Overall, there are two major sets of data available: Count-only data from 1968 to 2005, and case details available from 1999 onward. The former is contained in the file $DATA_DIR/original_data/dengue_counts_1968_to_2005.csv and contains per-province counts, one month per row. For the more recent files containing actual cases, we have one case per row. Again, which columns are provided has varied over time. For example, dengue_cases_1999.csv through dengue_cases_2001.csv contains these columns:

SEX,AGE,RPDT,ILDT,disease,province,opd,status,hospital,year,ageyrmo,hospitalgr,metropol

But the most recent files contain:

DISEASE,SEX,Year,Month,Day,MARIETAL,RACE,OCCUPAT,ADDRCODE,PROVINCE,METROPOL,HOSPITAL,TYPE,RESULT,DATESICK,DATEDEFINE,DATEDEATH,DATERECORD,DATEREACH,ORGANISM,COMPLICA,icd10,PeriodSick_define,PeriodSick_death

The import code normalizes these later in the process.

Lastly, there are annual 'final' files that contain all the cases for each year, e.g., dengue_cases_2014_final.csv with these columns:

"DISEASE","SEX","YEAR","MONTH","DAY","MARIETAL","RACE","OCCUPAT","ADDRCODE","PROVINCE","METROPOL","HOSPITAL","TYPE","RESULT","DATESICK","DATEDEFINE","DATEDEATH","DATERECORD","DATEREACH","ORGANISM","COMPLICA","icd10"

Add a corresponding R file to $CODE_DIR/file_specific_code

To insulate the code from column differences, we maintain an R file for each CSV file that was extracted. These files are located under the $CODE_DIR/file_specific_code directory in the locally-cloned repo, and are named like dengue_cases_2017_wk_18.R. Here are typical contents:

source('standard_rolling_dump_mapping.R')

column_map <- c(column_map,list(
    date_sick_mdyhms = 'datesick',
    date_define_mdyhms = 'datedefine',
    date_death_mdyhms = 'datedeath',
    date_record_mdyhms = 'daterecord',
    date_reach_mdyhms = 'datereach'
    ))

Run `make test-input`

To test the data directory for consistency, run this target in Makefile:

$ make test-input

Automated-Tests provides details on what input tests are run.

Run `make import`

To actually import the data, do:

$ make import

The details of this step are covered in Make-Import-Details.

Run `make test-output`

To consistency-check the resulting database, run:

$ make test-output

See Automated-Tests for details on what output tests are run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Process

Transfer Access and CSV files to idmodeling1.jhsph.edu

Transfer Access file from idmodeling1.jhsph.edu to sphhs-bioepi01.umass.edu

Extract all of the Access table's fields to a CSV file

Update $DATA_DIR/dengue_case_metadata.txt

Add a corresponding R file to $CODE_DIR/file_specific_code

Run `make test-input`

Run `make import`

Run `make test-output`

Clone this wiki locally

Import Process

Transfer Access and CSV files to idmodeling1.jhsph.edu

Transfer Access file from idmodeling1.jhsph.edu to sphhs-bioepi01.umass.edu

Extract all of the Access table's fields to a CSV file

Update $DATA_DIR/dengue_case_metadata.txt

Add a corresponding R file to $CODE_DIR/file_specific_code

Run make test-input

Run make import

Run make test-output

Clone this wiki locally

Run `make test-input`

Run `make import`

Run `make test-output`