Skip to content

Admin ETL Lite

Piotr Wargulak edited this page Sep 10, 2024 · 1 revision

ETL Lite

Table of contents

Introduction

The ETL Lite Module is used to extract data from relational databases, transform it using Velocity templates, and then load it into an OpenMRS system using the application provided Spring services.

This module can be found here.

Prerequisites

The following prerequisite skills are required to work with the ETL-Lite module.

Structured Query Language (SQL)

SQL is used to write queries to extract data from a relational database.

Velocity Templates

Velocity Templates are used in the following sections of ETL-Lite

  • Extract to configure and execute dynamic queries
  • Transform to transform the extracted data into a different data structure
  • Load to load the transformed data back into OpenMRS using Spring beans.

Cron Syntax

CRON Syntax is used to schedule certain mappings to be executed periodically.

For a tutorial visit: http://www.cronmaker.com

Architecture

The architecture of the ETL-Lite module is as depicted in the figure.

For the system to work, at least one database and one mapping needs to be configured.

ETL-Lite extracts data from a source database, transforms it and then loads it into the target system.

  • The source database can either be the database of the current installation or a supported external database
  • The target database is always the database associated with the current installation.

A mapping consists of the details of how to extract, transform and load a certain set of data.

Loading data into the target database is implementation specific and needs to use services deployed as Spring beans provided by that specific OpenMRS deployment.

The ETL-Lite module also performs certain mappings periodically.

The settings for databases are stored using the Config Service with all sensitive data like passwords encrypted.

Access Rights

The ETL-Lite module protects access to its functionality by means of two roles

  • ETL Settings Administrator
  • ETL Mappings Administrator

The ETL Settings Administrator role is required in order to create and setup databases and the ETL Mappings Administrator role is required to actually create mappings and work with the ETL-Lite process.

In OpenMRS, only the users with admin privileges have both roles and can access ETL functionalities.

Interface

To access the ETL module the user should be logged as an admin and click “ETL Lite” button on system administration page.

The ETL-Lite interface consists of two tabs, a Settings tab, where an administrator can create databases as well as configure services and a Mappings tab used to configure the actual ETL process.

Settings

The Settings tab is used to configure services as well as create databases. Click on "Add Database" to add a database source.

When you choose the type of the database, the JDBC URL and the Test Query fields will be auto-populated based on the chosen database.

Supported database types:

  • MYSQL
  • POSTGRESQL
  • MSSQL

Note: It is recommended to use a read only database user in ETL Settings.

Mappings

The Mappings tab is used to configure the ETL process. You need at least one database in order to be able to save a mapping. You can have multiple mappings for the same database source.

Click on "Add ETL Mapping" to add a new mapping.

For more information on how to create mappings, refer the Process page.

Mappings can be saved by clicking on the "Save" button. After saving the mapping, you will see a "Test" button to test your mapping right there on the ETL-Lite interface.

Only the extract and the transform parts of ETL can be tested in the interface.

The load part can be tested by querying the OpenMRS database using any database tool to see the inserted data and hence ETL-Lite does not provide any features for the same.

Process

Working with ETL mainly involves setting up at least one database source and one mapping.

The ETL process is best demonstrated with an example.

Requirement

Given there exists a table called patient with the following data.

Table: Patient

Fieldid first_name last_name age phone_number last_visit_time
1 John Doe 33 +48 666 777 666 2017-08-17 13:00:05
2 James Smith 12 2018-11-12 12:10:02
3 Ralph Kowalski 88 666777666
4 Don Don 31 900 900 900
5 Bob Bob +48 123 456 789
6 Alice Malice 0000000000 2019-02-03 14:18:15
7 Victoria Lure 44 2017-08-17 19:22:58
8 Diana Nowak 109 12345678 1999-01-03 03:11:29
9 Benny Bill 19
10 Linda Lay 72 666777666 1992-11-05 04:22:11

And a table: patient_address

id patient_id street city state_province country zip_code
1 1 14th Street New York New York USA 81818
2 1 Konwaliowa Gdynia Pomorskie Poland 81651
3 2 Dmowskiego Gdańsk Pomorskie Poland 80423
4 2 Lelewela Mlawa Mazowieckie Poland 06500
5 4 Długa Wąchock Świętokrzyskie Poland 27215
6 5 Kings Cross London NULL England 12345
7 8 NULL Moscow NULL Russia NULL
8 9 La Rambla Barcelona NULL Spain NULL
9 9 NULL Madrid NULL Spain NULL

We can transform those tables to get the following data (Only the patients with age less than 100 that are referenced in both tables along with their zip codes).

firstName lastName zipCode Id lastVisitDate age
John Doe 81818 81651 1 2017-08-17 33
James Smith 80423 06500 2 2018-11-12 12
Don Doe 27215 4 31
Benny Bill 9 19

Solution

We’ll now use the ETL-Lite module to setup a mapping to perform this transformation and perform the load.

Setup

In the Settings tab, click on "Add Database" and setup the source database to point to our OpenMRS database and click on "Save" to save the setting. Once saved, ensure that the test connectivity to the source database is successful by clicking on the "Test" button.

Extract

In the Mappings tab, click on "Add ETL Mapping" to create a new mapping.

Choose the source as the source created in the earlier step and provide the SQL in Query (Extract) to create the resulting table.

SELECT
	p.id id,
	p.first_name firstName,
	p.last_name lastName,
	p.age,
	p.last_visit_time lastVisitTime,
	pa.zip_code zipCode
FROM patient p
JOIN patient_address pa ON p.id = pa.patient_id
WHERE p.age < 100;

This SQL query chooses only patients with age less than 100 from the patient and patient_address table join.

The field Fetch Size is the parameter to retrieve the data in chunks and field Test Results Size is the maximum number of test results to be displayed on the interface. By default they are set with values 1000 and 10 respectively.

Click on "Save" to save the mapping. The ETL-Lite module should respond with a successful save message.

The "Test" button should now be visible. Since we only have the extract section filled-in, click on the "Test" button to only run the extraction query. The results should be displayed in-screen.

We are now ready to move to stage II of the ETL process, the Transform of this data.

Transform

The Transform section accepts a velocity template. The Transform template provides two in-built velocity script variables for use in the template.

$rows

The output of the extracted query is available in this variable as an array of maps.

$outs

This array is initially empty. The output of transformation goes into this variable again as an array of objects.

The simplest transform

In the same mapping we just created, try making the transform template contain the following expression:

$rows

Click on "Save" and click on the "Test" button. You will notice that there was no additional output as a result of transformation.

Even with the transform template as:

$outs

there is no visible additional output.

Attention

It’s important to realize that the output of the transformation step is not the output of the velocity template, but what gets added to the $outs array.

With the following transform template:

#foreach ($row in $rows)
	$outs.add($row)
#end

We finally see the output of both the extract and the transformation step!

Since we have done just a pass-through of the extracted data, we see that the results of both the extract and the transform steps are the same at this point in time.

Final Transformation

#set($patient_map = $util.newMap());
#foreach( $row in $rows )
	#if ($patient_map.containsKey($row.id))
		#set($patient =$patient_map.get($row.id))
	#else
		#set($patient = $util.newMap());
		
		$patient.put("id", $row.id);
		$patient.put("firstName", $row.firstName);
		$patient.put("lastName", $row.lastName);
		$patient.put("age", $row.age);
		
		$patient.put('lastVisitDate', $util.formatDate($row.lastVisitTime, "YYYY-MM-dd"));
	#end
	
	#if (!$patient.containsKey("zipCodes"))
		$patient.put("zipCodes", "");
	#end
	
	#if ($row.zipCode)
		$patient.put("zipCodes", $patient.get("zipCodes").concat(" ").concat($row.zipCode));
	#end
	
		$patient_map.put($patient.id, $patient);
#end

$outs.addAll($patient_map.values());

This transformation creates a new map, goes through each row of the table input from SQL, and puts it’s values into created mapping. We get the transformed output in the application.

Load

The Load template is also velocity template driven and has access to both the extracted data as $rows as well as the transformed data $outs.

Note: Since both the Transform template and the Load template are both velocity templates, it’s possible to completely eliminate the Transform step and perform transformation directly in the Load step, thus making the ETL a two step process. The three separate steps however provide greater clarity when designing a mapping.

Using any service beans injected in the Settings tab, we can load the transformed data into our system. Services are injected using the following syntax:

userService:com.janssen.connectforlife.service.UserService,
patientService:com.janssen.connectforlife.service.PatientService

Individual services are separated using a , and the service variable name for use in the velocity template is separated from the fully qualified class name using a : colon character.

A hypothetical Load template might however look like this:

#if ($outs.size() > 0)
	#foreach ($out in $outs)
		$userService.load($out)
	#end
#end

This load template loads all records from the transformation to application.

Troubleshooting

  1. Data fails to import. What can I do to check what went wrong?

    Sometimes there might be failures during data import to the OpenMRS. These failures will be logged into the table etl_error_logs along stacktrace of the exception and can be analysed. For this, the parameters sourceKey and sourceValue should be set in the Load template of the mapping. sourceKey would provide the key identifier for a record from the source and sourceValue would be the record value, which would help in identifying the failed records.

    For example, we can set sourceKey and sourceValue as below:

     #set ($sourceKey = "PatientId")
     #set ($sourceValue = $row.patientId)
    

    Here ‘PatientId’ is the column name, which is key identifier and ‘$row.patientId’ is the unique value for each record.

Advanced ETL

This section will explore more advanced features of the ETL-Lite module.

Velocity Template for Extract

Though it might not seem so, the extract query also supports velocity template syntax.

There are two ways to invoke the ETL process.

Event Driven On Demand Execution

Sometimes an application is required to import a single record from an external system on demand in real time. For these cases, the application can send an event using the event relay system with any parameters as required (say a user ID).

The Extract query can access these parameters as part of it’s provided $params script variable. It then has the flexibility to actual perform two different queries:

#if ($params.userId)
	select * from users where userid = :userId
#else
	select * from users
#end

The on-demand execution will run the first query where a single user’s record is extracted, transformed and then loaded into the system. If the $params variable is not present as would be the case for a scheduled execution, then the second query would be used.

Note: The individual params are also available as BIND variables to the SQL. This is provided as a security feature as otherwise the user would be forced to use a query like:

select * from users where userid = '$params.userId'

Note that this is highly prone to SQL injection attacks. Using BIND variables also improves performance, as the same query can be prepared by the database to be re-used for different values.

Scheduled execution

Each mapping also comes with a configurable CRON expression to run the mapping periodically. At the scheduled time, the entire ETL process of that mapping will be executed.

For this to be really effective, typically you would need one of two conditions to be fulfilled.

The Extract query is intelligent to pick only the new data in the source system depending on say the current time. This is typically done by using a where clause filter in the Extract query.

The Load template is intelligent enough to know when to insert or update data into the system, if an intelligent Extract is not possible.

Multiple mappings

Sometimes on-demand execution requires executing multiple mappings in order. The first mapping might load a user master record. The subsequent mappings might load entities related to the user record, which otherwise cannot be loaded without a user master record.

This can be achieved by passing a string of mappings to execute to the OpenMRS event and the ETL-Lite module will execute them in order.

Import Failure Logging in ETL

Sometimes there might be failures during data import from different sources to the OpenMRS. These failures will be logged into the table etl_error_logs along stacktrace of the exception and can be analysed. For this, the parameters sourceKey and sourceValue should be set in the Load template of the mapping. sourceKey would provide the key identifier for a record from the source and sourceValue would be the record value, which would help in identifying the failed records.

For example, we can set sourceKey and sourceValue as below:

#set ($sourceKey = "PatientId")
#set ($sourceValue = $row.patientId)

Here ‘PatientId’ is the column name, which is key identifier and ‘$row.patientId’ is the unique value for each record.

Reference

List of Variables Available in Velocity Templates

Variable Stage Description
$params ETL Any parameters supplied to the SQL query in the extract section
$rows TL Access to the extracted data as an array
$outs TL Transformation output. Empty array to start with in transformation template
$util ETL A wrapper for handy util methods

List of Helpers in $util

Name Description
$util.newObject(String className) Creates and returns a new object instance using Context from OpenMRS
$util.loadClass(String className) Loads class from OpenMRS Context
$util.formatDate(Date date, String format) Converts the Date object to the required String format
$util.toLong(String value) Converts String to Long
$util.stringToDate(String date, String format) Converts date in String format to Date
$util.getLocale(String language) Creates a locale object from language and country
$util.today() Returns the current date

Event to Invoke ETL Mapping

Event:

  • importData mappings - A list of String mappings to perform in order
  • params - All parameters are bundled into $params
Clone this wiki locally