-
Notifications
You must be signed in to change notification settings - Fork 332
Looking for the Best Tool for Handling Diverse Multiple Data Sources for Report
Data sources for report building are extremely diverse. In theory, you may encounter any sets of data sources in developing reports. They can be the combination of any among RDB, NoSQL, CSV, Excel, JSON, Hadoop, etc. This produces the diverse/multiple source issue in report development. To address a set of problems brought by diverse/multiple data sources, the essay examines and evaluates a number of ways reporting tools are using to handle diverse/multiple sources in aspects listed below:
-
The way of connecting to the data source;
-
The way of handling (mixed) data coming from diverse/multiple sources;
-
The capacity of handling diverse/multiple source computations, particularly complex scenarios;
-
The performance.
The diverse/multiple data sources can be of different or same structures, and different scenarios require dissimilar handling methods.
For the ways of connecting to various data sources, Looking for the Best Tool for Using Non-RDB Data Sources in Reporting Tools offers detailed discussions and we just skip them here.
Some databases provide routes to perform cross-database queries. Oracle DBLink and MySQL FEDERATED storage engine are among them to help achieve mixed queries between databases of same type. If the multiple data sources for report building are of same type, the database’s built-in route can be used to handle them.
The built-in routes have their disadvantages. The biggest is the unreliable support of different type of data sources. Both DBLink and FEDERATED show unsatisfactory abilities in dealing with different data sources. FEDERATED even requires the use of same-structure tables to perform computations (It creates table definitions locally while the source data files are stored in the remote databases). Another weakness is that the tools show bad performance while requiring a lot of database resources. They also have other issues, including unstable data transmission, lack of support for operations on big objects, error-prone and unstable computing process, and low scalability.
When data sources are CSV, Excel, JSON, or Spark, we can also use the corresponding computing engine to perform the multi-source mixed computation. Microsoft Text/XLS driver handles associative computations between Excel files and between Text, CSV and TSV files (but does not support them between an Excel file and a CSV file. Spark SQL can deal with queries involving diverse data sources, too. Yet, all these engines require that data sources are of same type or the to-be-associated sources have similar structure.
A built-in route depends mainly on the database or computing engine’s computing capacity and achieves algorithms using SQL. Their ability of handling complicated computing scenarios is determined by their support degree of SQL. Oracle offers the richest functions and thus the greatest computing ability. MySQL follows it. The other computing engines have different levels of ability, with most of them being weak.
These tools and engines have too many limitations in dealing with diverse/multiple report data sources thanks to their requirements of same-type data sources and a series of other issues.
The popular strategy of handling diverse data sources is to convert the multiple sources into a single source. We load data from the data sources into one RDB and achieve data queries using SQL. There are many ways of synchronizing data across sources, including writing a script, and using an ETL tool or an auto sync software, such as Hevo, Knowi, and Stich.
The indirect approach to the mixed data source problem is simple, yet with noticeable weaknesses. It is costly. Apart from the could-be limited capacity of the target database for receiving data imported from the sources and the undesired query performance, there is also the time-consuming data synchronization. The data range to be queried for building a report may be rather large and there are probably diverse data sources. Loading all data into one database takes a long time and a lot of space. Just loading the target data affects the query speed. When certain data in the sources is dynamic instead of static, the cost of synchronization becomes exorbitant. Another issue is that, as data export and import is usually executed at a fixed time interval, you cannot make sure that the query is real-time.
We build an enterprise-level data warehouse using this approach at many occasions. But still, data could not be real-time at the query and data source strengths are wasted.
Both cross-database query tools and the approach of loading data into one source do not use the reporting tool’s own computing capability. Now let’s look at two reporting tool features that handle data coming from multiple/diverse data sources.
Some reporting tools offer Joint Data Set feature, which allows users to join data sets to generate a new one. It is for data source combination. First, you create a data set based on each data source. Then, Joint Data Set walks you to join the data sets into a new one for performing mixed data source computations. Not all reporting tools support joining more than two data sets. You can only join two sets in BIRT. When there are more data sources, you need to join them two by two by creating N-1 data sets. This is difficult to maintain and leads to bad performance. The feature supports joins by a single field only and does not support equi-joins by two or more fields (only single-field-based equi-joins). Joint Data Set uses parameters coming from member sets to be joined and that are view-only (You cannot modify or delete them). If namesake parameters exist (it is probably there are namesakes), collision will occur.
In a word, the feature has more limitations than functionalities, as well as poor performance. It can just manage simple mixed data source computing scenarios. Besides, not all reporting tools has this feature. It exhibits weak universality.
Some other reporting tools handle the diverse/multiple data sources by associating cell expressions. You create separate data sets based on data sources. Unlike the above joint data set, these data sets are independent of each other in data preparation phase. Then you enter them into different areas in the report template and associate them using the right report function, such as dsDB.select(dsDBField,ID==A2). A2 is the cell containing the other data set dsCSV, which is represented by expression dsCSV.select(ID). The two data sets are associated via ID field and will be displayed in two areas/columns in the result report.
You can create data sets for data sources separately in data preparation phase, use them in the template equally, and join them through expressions, no matter how many sources there are. It is convenient to develop and maintain.
It has three demerits. First, it computes data sets one by one because it is unable to mix them. Suppose we have two data sources storing historical orders data and the current orders data respectively, and are trying to sum orders amounts according to all customers. But we cannot get this done using report expressions (but we can do it using the Joint Data Set). Second, it has limited computing ability. Usually, reporting tools have rather limited ability, which is further compromised by the fixed template layouts. Many computations cannot be expressed with expressions. Third, it has bad performance. Data sets are associated in turn circularly in the template, which takes quadratically long to finish. When the data sets to be associated contain large volumes of data, the performance will be extremely disappointing.
esProc is an open-source, professional data processing engine. It provides interfaces to connect to dozens of data sources and has powerful computing capacity based on independent scripting syntax SPL (Structured Process Language). It is convenient for handling the diverse/multiple data source mixed computing. The tool offers standard JDBC and ODBC interfaces to call the result of an SPL script by the reporting tool.
esProc SPL can be regarded as an enhanced version of Joint Data Set, but it is not demanding about same data source type and boasts more powerful capabilities to do joins and subsequent computations. To associate data sets generated based MongoDB data and a CSV file, for instance, SPL has following script (cross.dfx):
A | |
---|---|
1 | =mongo_open("mongodb://127.0.0.1:27017/mongo") |
2 | =mongo_shell(A1,"Orders.find()").fetch() |
3 | =file("Employee.csv").import@tc() |
4 | =mongo_close(A1) |
5 | =join(A2,SellerId;A3,EId) |
6 | =A5.new(_1.OrderID,_1.Client,_2.Name,_2.Gender,_2.Dept) |
A3 gets data from the CSV file. If the data source is an RDB, we can rewrite A3 as follows:
=db.query@x(“select * from Employee”)
The code after that will be the same. esProc supplies a complete set of computing abilities independent of all data sources. This is the tool’s unique asset.
Now we create the esProc JDBC data source in the reporting tool and use it as it is an ordinary database. Then you can call the SPL script from a report data set in the same way of accessing the stored procedure to achieve join operation, as shown below:
call cross()
cross is the above SPL script name (cross.dfx).
esProc SPL can make the most use of each data source’s own computing strength. RDB, for instance, has great computing ability but slow I/O, so we can let it do more computing jobs before data retrieval. NoSQL, on the contrary, has high I/O performance but weak computing ability, we can first retrieve data and then perform computations. It is flexible to use. The direct handling of data sources ensures up-to-date data that report development requires.
SPL can handle most big data computing scenarios through a series of techniques, including order-based computations, external storage computing and parallel processing, and outputs result set fast to the reporting tool.
The tool adopts efficient HASH JOIN algorithm (multiple alternatives for use) to join diverse/multiple data sources with high performance.
You can handle any mixed computations with esProc SPL as long as it supports the data sources. Below lists some of the dozens of data sources esProc SPL supports:
-
AliCloud
-
Cassandra
-
Multidimensional database
-
ElasticSearch
-
Ftp
-
Hbase
-
Hdfs
-
Hive
-
Influxdb
-
Informix
-
Kafka
-
MongoDB
-
Redis
-
Salesforce
-
Sap
-
Spark
-
Webcrawl
-
Webservice
-
YModel
-
Zip
Those are ways reporting tools are using to handle diverse/multiple data sources. Using databases’ computing capacity is expensive as configuring and maintaining routes for both cross-database queries or loading data into one source is difficult and has limited uses. Reporting tools’ own computing ability only applies to simple scenarios and produces unsatisfactory performance. By contrast, the open-source esProc SPL supports more types of data sources, offers simple ways of doing joins, possesses independent computing ability for achieving complicated computations, and supports big data processing excellently. It is the cheapest, the fastest, and the most efficient choice of dealing with diverse/multiple data sources during report building process.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code