Skip to content

etl_mapping_instructions

Richard Starr edited this page May 17, 2017 · 2 revisions

Mapping from data files into OMOP CDM using the ETL.Stage schema and procedures

  • Create omop database and load vocabulary.  

    • This can use the default OMOP database creation and vocabulary load ( and then run the update_exisiting_cdm_with_audit_ddl.sql ),

    • Or you can use my "create_new_omop_cdm+with_audit_ddl.sql" ad then load the vocabulary tables using the default OMOP data and load script.

    • The vocabularies are available from the ohdsi Athena website ( http://www.ohdsi.org/web/athena/ ). Note that some of the vocabularies require licenses.

  • Create ETL tables using the "create_etl_tables.sql" script.  

  • Load the source data files to the etl.stage* tables.  

    • The easiest way to do this for me has been to load the source data files into a new schema in the database in their original format.

    • Map from the source tables to the etl.stage* tables.  There are two scripts in the codes for this.  One is the "default_stage_insert.sql", which is just an empty insert statement for each of the tables.  

    • The other is the insert script for the mimic conversion, "example_stage_insert_mimic.sql".  This should give an example of the fields to include and where they go.  Some tables are mandatory due to constraints in the system.

  • Mapping the source tables to the ETL tables:

    • NOTE: the various code_source_type fields may be left null, but this may cause unwanted mappings. The process is more accurate if the coding systems used are known or can be determined.

    • Etl.stage_person. This table is used to create the individual patient records for the dataset. This should be unique patients.

      • person_source_value = this is the patient_id or other unique identifier for the patient.  This is the value that will link the patient with their records in the other files.

      • Gender - map from the gender_source_value into single character genders ( 'M', 'F', 'A', 'U', 'O' or null )

      • Dob/Dod ( date of birth/date of death ) - convert to date/timestamp format

      • gender_source_value - the unconverted gender value as represented in the source data file ( 'Male', 'male', 'M', 0 ).

      • Ethnicity - ethnicity source value should be converted to ( 'Hispanic' or null )

      • ethnicity_source_value = the unconverted ethnicity value as represented in the source data file.

      • Race - race source value should be converted to one of the following values ( 'White', 'Black', 'Asian', 'American Indian or Alaska Native', 'Native Hawaiian or Other Pacific Islander', 'Other' or null )

      • Address information fields ( address_1, address_2, city, state, zip, county )

      • Location_source_value – this should be a unique value to represent the address of the patient. It is used to create an entry in the location table. This source value is used to link to the patient’s home address. It is a varchar(50) size field, so something like substr( concat_wc( ‘_’, zip, address_1, address_2, city, state ), 1, 50 ) and then whenever you reference this address later in the code, this concatenated source value field is how it is linked. For the patient, this may be the only time it is used, but locations are also created for providers and care site locations.

      • Provider_source_value – this is the link to the provider record created in the table. Usually NPI or some substr(concat_wc…. ) created value that is used to link into the provider table.

      • Care_site_source _value – this is the link to the care_site for the patient. It is probably not applicable in this table, but could be for a default provider/clinic.

    • Etl.stage_provider. This is used to create the individual providers in the dataset.

      • provider_source_value – a unique identifier for the provider. Could be based on the NPI or name information. This should be a value that is accessible in the other tables ( such as visit or procedure or Rx ) as it is used to link that information to the correct provider. A varchar(50) column.

      • provider_name – varchar(50) column to hold provide name

      • NPI – national provider identifier. A 10 digit number to identify health providers.

      • DEA – the DEA registration number of a provider. Should be a alphanumeric value of 9 characters.

      • specialty_source_value – Should be the omop concept_code listed in the SNOMED Provider_specialty domain.

      • care_site_source_value - the unique identifier for the care_site

      • location_source_value – the unique identifier for the location of the provider. substr( concat_wc( ‘_’, zip, address_1, address_2, city, state ), 1, 50 )

      • gender – see etl.stage_person

      • year_of_birth – just the 4 digit year value of the provider’s DOB

      • address_1– see etl.stage_person

      • address_2– see etl.stage_person

      • city– see etl.stage_person

      • state– see etl.stage_person

      • zip– see etl.stage_person

      • county– see etl.stage_person

      • gender_source_value– see etl.stage_person

    • etl.stage_visit. This is used to create visit/encounter information. There could be multiple visits/day or everything could be combined into a single visit/day. Different data sources behave differently.

      • visit_source_value – a unique value for the visit ( visit_id or concat_ws(‘_’, claim_number, line_number ), or something unique to the dataset. It will be used to link the visit to the condition/procedure/etc records.

      • visit_type – one of the following values ( ‘IP’, ‘OUT’, ‘ER’, ‘LONGTERM’ ) to describe the type of visit. IP=inpatient ( hospital ), OUT=outpatient( doctor visit or other ), ER= Emergency room/department, LONGTERM= longterm care( nursing home, etc ).

      • visit_source_type_value – source of the data. One of the following values ( ‘CLAIM’, ‘EHR’, ‘STUDY’ )

      • visit_start_date,

      • visit_end_date,

      • total_charge,

      • total_cost,

      • total_paid,

      • paid_by_payer,

      • paid_by_patient,

      • paid_patient_copay,

      • paid_patient_coinsurance,

      • paid_patient_deductible,

      • paid_by_primary,

      • person_source_value – unique identifier for the patient

      • provider_source_value – unique identifier for the provider

      • care_site_source_value – unique identifier for the care_site

    • etl.stage_condition. Table used to create the diagnosis records. This table will also map other condition_type records ( ICD9/10 code can be observations, procedures or measurements also ).

      • condition_code_source_type – the vocabulary id for the coding system used. Should be in the list ( ‘SNOMED’, ‘ICD10CM’, ‘ICD9CM’, ‘HCPCS’, ‘MedDRA’ )

      • condition_source_value – the value of the code, without decimals. For example, 299.00 = diagnosis of autism and should be entered as ‘29900’, with a condition_code_source_type = ‘ICD9CM’

      • condition_source_type_value – the source for the value of the code. This is an OMOP concept id that represents the type of data source ( claim, ehr,etc ) and the relative position of the diagnosis code ( primary diagnosis, 1^st^ position, 2^nd^ position, etc ). The domain_id = ‘Type Concept’, vocabulary_id = ‘Condition Type’. For example, the condition type concept for the primary diagnosis would be 38000183 for an inpatient detail record. It would be 45756835 for the first position of a claims record. Usually there are multiple sequential position records for a distinct visit/line. The Type concept_ids are sequential also. See the mimic etl source for an example.

      • start_date,

      • end_date,

      • stop_reason – usually not populated

      • visit_source_value – unique identifier for visit

      • person_source_value – unique identifier for patient

      • provider_source_value – unique identifier for provider

    • etl.stage_lab. Used to create records for labs and measurements

      • measurement_source_type – the omop vocabulary_id for the coding system used for the lab records ( ‘LOINC’, ‘ICD10CM’, ‘HCPCS’, ‘CPT4’, ‘ICD9CM’, ‘SNOMED’ )

      • measurement_source_value – the value of the code. No decimal.

      • measurement_date – date and timestamp

      • operator_source_value – ‘<’, ‘=’ (default), etc should map to the concept table with domain id of ‘Meas Value Operator’

      • unit_source_value – unit of measurement should map to the concept table with domain_id of ‘Unit’

      • value_source_value – original source value of measurement

      • range_low,

      • range_high,

      • visit_source_value – unique identifier for visit

      • person_source_value –unique identifier for patient

      • provider_source_value – unique identifier for provider

    • etl.stage_procedure. Used to create procedure records.

      • procedure_code_source_type – omop vocabulary_id for the type of coding system used ( ‘ICD10CM’, ‘HCPCS’, ‘CPT4’, ‘ICD9CM’, ICD9Proc’, ‘SNOMED’ )

      • procedure_source_value – procedure code, without decimals.

      • procedure_source_type_value – see explanation fro the condition_source_type_value, but the vocabulary_id = ‘Procedure Type’

      • code_modifier – modifier to the code. Should be of concept_class_id like ‘%Modifier%’

      • procedure_date

      • quantity

      • stop_reason

      • total_charge,

      • total_cost,

      • total_paid,

      • paid_by_payer,

      • paid_by_patient,

      • paid_patient_copay,

      • paid_patient_coinsurance,

      • paid_patient_deductible,

      • paid_by_primary,

      • visit_source_value – unique identifier for visit

      • person_source_value –unique identifier for patient

      • provider_source_value – unique identifier for provider

    • etl.stage_rx. Used to create drug records.

      • drug_source_type – coding system used ( ‘NDC’, ‘RxNorm’, ‘SNOMED’, ‘VA Product’, ‘HCPCS’, ‘CPT4’ )

      • drug_source_value – code representing the drug

      • drug_start_date,

      • drug_end_date,

      • stop_reason,

      • refills,

      • quantity,

      • days_supply,

      • dose_unit_source_value – shoud be of domain_id = ‘Unit’

      • effective_drug_dose,

      • total_charge,

      • total_cost,

      • total_paid,

      • paid_by_payer,

      • paid_by_patient,

      • paid_patient_copay,

      • paid_patient_coinsurance,

      • paid_patient_deductible,

      • paid_by_primary,

      • paid_ingredient_cost,

      • pait_dispensing_fee,

      • route_source_value – should be of domain_id = ‘Route’

      • visit_source_value – unique identifier for visit

      • person_source_value –unique identifier for patient

      • provider_source_value – unique identifier for provider

  • After the data is loaded into the staging tables, It is ready to call the functions to load into the OMOP CDM tables:

    • Run the run_all_create_procedures.sh script to create the individual postgres functions.

    • Run the run_all_create_qa_procedures.sh to create the QA postgres functions.

    • Run the run_omop_load.sql. This will run the generic_create_* scripts in the proper order. The process logs to the etl.logmessage table, so you can track progress or if you need to debug failures.

    • Run the run_qa.sql script to create statistics to check how many records were mapped successfully into the concept tables. This is an indicator of how well data was mapped.

    • Run the commands in the analyze_QA_output.sql script manually to check the QA statistics.

    • Run the deleted_loaded_entries.sql script. This will removed unmapped entries in the ‘main” tables ( condition, procedure, rx, measurement ) that may have gotten successfully mapped into another domain. This prevents duplicate records and an excessive number of unmapped concepts.