-
Notifications
You must be signed in to change notification settings - Fork 332
Looking for the Best Technique of Processing Retrieved WebService RESTful Data
Data formats of WebService/RESTful are the general, multilevel, structured JSON or XML. The multilevel format is more complicated and more difficult to handle after data is retrieved, than the conventional two-dimensional format. This essay compares five types of techniques for processing retrieved WebService/RESTful data. Among them, esProc SPL can effectively simplify JSON/XML computations by offering data object that supports multilevel data format and expressive syntax. Looking ${article} for details.
WebService/RESTful are widely used for application communication, including microservice, data exchange, public or proprietary data services, etc. The popularity lies on their use of the general structured text and support of multilevel data that can store sufficiently rich and general information. Issue is that the multilevel format is more complicated than the conventional two-dimensional format and hard to process after data is retrieved. Here we look at several common techniques for processing the WebService/RESTful data. Our focus is on the multilevel JSON computations, but other aspects, including data source interface and XML format, will also be involved.
High-level languages are very widely used. It is natural to use them for processing the retrieved WebService/RESTful data. Java offers class libraries JsonPath, fastjson and jackson to do this and C# provides counterparts Newtonsoft, MiniJSON and SimpleJson. Among them JsonPath has the most expressive syntax. Now we’ll take it as an example to make our illustrations.
A RESTful website returns employee information and corresponding orders in the format of multilevel JSON. Below is part of the original data:
|
\[{
"_id": {"$oid": "6074f6c7e85e8d46400dc4a7"},
"EId": 7,"State": "Illinois","Dept": "Sales","Name": "Alexis","Gender": "F","Salary": 9000,"Birthday": "1972-08-16",
"Orders": \[
{"OrderID": 70,"Client": "DSG","SellerId": 7,"Amount": 288,"OrderDate": "2009-09-30"},
{"OrderID": 131,"Client": "FOL","SellerId": 7,"Amount": 103.2,"OrderDate": "2009-12-10"}
\]
}
{
"_id": {"$oid": "6074f6c7e85e8d46400dc4a8"},
"EId": 8,"State": "California",
...
}\]
|
We are trying to use JsonPath to compute the above JSON strings in order to find all orders whose Amount is in the range of 1000-2000 and where Client contains “business”. Below is the core code:
String JsonStr=… // Skip the process of retrieving the JSON strings
Object document = Configuration.defaultConfiguration().jsonProvider().parse(JsonStr);
ArrayList l=JsonPath.read(document, "$\[*\].Orders\[?(@.Amount>1000 && @.Amount<2000 && @.Client =~ /.*?business.*?/i)\]");
In the above code, @.Amount>1000 && @.Amount<2000 is an interval query condition, and @.Client =~ /.?business.?/i is a fuzzy query condition. JsonPath has the advantages of generating short code and using SQL-like syntax to realize interval queries.
It has disadvantages. In detail, the syntax is not mature enough. It performs fuzzy queries using regular expressions instead of the easy-to-use functions (like SQL LIKE function). Overall, the class library has a weak computing ability with the support of only the simplest computations, such as conditional queries and aggregate operations. It does not support most of the common computations, including grouping & aggregation, joins and set-oriented operations. Yet JsonPath is the best among the high-level language class libraries. Others, such as Jackson and fastjson, are even weaker. It is suitable for doing the simple maintenance work, such as microservice client side. To handle data computations, even the general ones, you’d better use other techniques.
The reason behind high-level languages’ weak expressing abilities is that they do not have special data objects for describing the JSON-like multilevel structure. So, the languages cannot build professional syntax and corresponding functions.
In terms of retrieval interface, JsonPath does not have one. It uses a third-party interface to retrieve data or hardcodes the retrieval. There are many such third-party interfaces. Some are mature but heavy, such as Spring restTemplat and Apache httpclient. Some generate simple yet instable code, such as JourWon httpclientutil and Arronlong httpclientutil. Here we are trying to use Arronlong httpclientutil to retrieve data from RESTful with the following code:
String path= "http://127.0.0.1:6868/api/emp_orders";
String JsonStr= com.arronlong.httpclientutil.HttpClientUtil.get(com.arronlong.httpclientutil.common.HttpConfig.custom().url(path));
All these third-party class libraries package JDK HttpURLConnection class on the low level. The above code is equivalent to the following hardcode:
String path = "http://127.0.0.1:6868/api/emp_orders";
URL url = new URL(path);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setConnectTimeout(5000);
StringBuilder builder = new StringBuilder();
if (conn.getResponseCode() == 200) {
System.out.println("connect ok!");
InputStream in = conn.getInputStream();
InputStreamReader isr = new InputStreamReader(in);
BufferedReader br = new BufferedReader(isr);
String line;
while ((line = br.readLine()) != null) {
builder.append(line);
}
br.close();
isr.close();
System.out.println("below is content from webservice");
System.out.println(builder);
} else {
System.out.println("connect failed!");
}
String JsonStr=builer.toString();
JsonPath’s data format support is weak, too. The library and the other similar class libraries listed above only support JSON format. They do not support XML format.
Relational databases have mature syntax and a great wealth of functions for structured data computations. Many users will convert multilevel data into structured data (two-dimensional structure) and use SQL’s computing ability to handle WebService/RESTful data.
There are two solutions of implementing that. One retrieves JSON strings from WebService/RESTful and creates a table containing a JSON type field in the database in the same statement; inserts JSON strings into the table using INSERT statement; and finally, queries the table through a SQL statement that uses JSON functions.
To retrieve JSON from RESTful in Java and perform a conditional query using SQLite, for instance, we have the following code:
String JsonStr=… // Skip the process of retrieving JSON strings
Connection connection = DriverManager.getConnection("jdbc:sqlite: d:/ex1");
Statement statement = connection.createStatement();
statement.execute("create table datatable ( path string , data json1)");
String sql="insert into datatable values('1', json('"+JsonStr +"'))";
statement.execute(sql);
sql="select value from(" + "select value" + "from datatable, json_tree(datatable.data,'$')" + "where type ='object'and parent!=0" + ")where json\_extract( value,'$.Amount') >1000 and json\_extract(value,'$.Amount') <2000 and json_extract(value,'$.Client') like'%business%'";
ResultSet results = statement.executeQuery(sql);
printResult(results);
if(connection != null) connection.close();
This block of code relies on JSON function json_extract (similar functions include json_tree) to achieve the conditional query though it uses SQL’s computing sources.
We can also perform grouping & aggregation in SQL in this way:
Select strftime(‘%Y’,Orderdate),sum(Amount) from(
select json\_extract(value,'$.OrderDate')OrderDate,json\_extract(value,'$.Amount')Amount from datatable, json_tree( datatable.data, '$') where type = 'object' and parent!=0)group by strftime('%Y',Orderdate)
Or perform a join, say, between employee table and orders table, in SQL, in the same way:
with base as (
select value,id,parent,type
from datatable, json_tree(datatable.data, '$')
),emp_orders as(
select orders.value o,emp.value e from base ordersArr,base orders,base emp
where ordersArr.parent=emp.id and orders.parent=ordersArr.id and emp.parent=0 and emp.type='object'
)select json_extract( o,'$.OrderID'),json_extract( o,'$.Client'),json_extract(o,'$.Amount'),json_extract(o,'$.OrderDate'), json_extract(e,'$.Name'), json_extract(e,'$.Gender'),json_extract( e,'$.Dept')
from emp_orders
The solution is light-weight and real-time, and particularly suitable for handling scenarios where the size of source data is small, historical data is not involved, and data structure is dynamic. There are also demerits. The code is lengthy and intelligible, which is related to the complexity, especially the number of levels, of JSON strings. It uses too many JSON functions, which compromises SQL’s common computing ability. JSON functions have special uses and hard to master. The statement “select…from table name, function where…” is different from common SQL statements and hard to understand. A join query is also difficult to understand thanks to the rather long code and too complicated relationship between tables. Besides, some old version databases do not support JSON functions, and some databases, such as Oracle, provide quite different uses for JSON functions from SQLite.
SQL data objects are two-dimensional structures and do not have direct support for multilevel data. Simply trying to compute multilevel data using the two-dimensional structure is sure to cause problems. That is why the above code is so long and hard to understand.
Then we have the second solution. Still, it uses a high-level language or an ETL tool to retrieve JSON strings from WebService/RESTful; splits JSON strings into multiple two-dimensional tables and writes them to corresponding database tables; and then computes the database tables using general SQL statement that do not include any JSON functions. Generally, the ETL tool is informatica, datastage or kettle, the high-level language is Java or C#, and the SQL-based database is one of the commonly-seen.
The solution is heavy and slow response, and suitable for handling scenarios where the source data size is large, data is appended at regular time, and data structure is stable. The biggest advantage is that it does not need any JSON functions and thus can made the most use of the common SQL abilities. What’s more, the degree of SQL complexity has nothing to do with the that of JSON string complexity.
In the aspect of retrieval interface, ETL tools get high scores as most of them support retrieving data from WebService/RESTful. Java, C# and other high-level languages generate complicated code and hard to learn since they need hardcoding or a third-party class library.
Let’s move on to XML format support performance. The first solution requires storing XML in the database, but SQLite does not support XML format while Oracle and MSSQL databases do. Apart from this, Oracle XML functions and MSSQL XML functions are mutually exclusive. It shows poor and messy performance. In the second solution, most ETL tools support XML format and high-level languages gives bad support since they need to resort to hardcoding.
Python boasts a lot of excellent third-party class libraries, among which requests is for accessing HTTP, numpy for mathematical statistics, and most importantly, Pandas for computing structured data. Pandas support various data sources, including JSON format data. The cooperation of these third-party class libraries is able to manipulate data retrieved from WebService/RESTful.
To retrieve JSON strings from RESTful and perform a conditional query, for instance, we have the following code:
import requests
import numpy as np
import pandas as pd
from pandas import json_normalize
resp=requests.get(url="http://127.0.0.1:6868/api/emp_orders")
JsonOBJ=resp.json()
df=json\_normalize(JsonOBJ, record\_path=\['Orders'\])
#dataframe cannot identify date type data
df\['OrderDate'\]=pd.to_datetime(df\['OrderDate'\])
result=df.query('Amount>1000 and Amount<2000 and contains("business")')
In the above code, the third-party class library requests accesses URL and converts strings into JSON objects and Pandas dataframe object performs the conditional query.
Similarly, we can introduce numpy class library to implement grouping & aggregation:
result=df.groupby(dfu\['OrderDate'\].dt.year)\['Amount'\].agg(\[len,np.sum\])
And perform a join, say between employee table and orders table:
df=json\_normalize(JsonOBJ,record\_path=\['Orders'\],meta=\['Name','Gender','Dept'\])
result=df\[\['Name','Gender','Dept','OrderID','Client','SellerId','Amount','OrderDate'\]\]
Python has the syntactic ability to produce succinct code and compute structured data excellently. Yet on the other hand, dataframe is a two-dimensional data object that does not support data retrieval by level and multilevel data computations. Users need to convert the multilevel data into two-dimensional data using the json_normalize function. The conversion process is equivalent to that where an ETL tool or a high-level language parses JSON strings into multiple two-dimensional tables. When there are too many levels, the conversion could be more complicated than the subsequent computation. Python has another shortcoming. It cannot use its own class libraries to process the retrieved WebService/RESTful data, instead it relies on multiple third-party class libraries. The issue is that the third-party class libraries do not belong to the same team and the use of them brings potential compatibility and stability risks.
Pandas does not support XML format, and even worse, it does not have a json_normalize counterpart function to convert multilevel XML into two-dimensional dataframe. Developers have to hardcode the conversion in a rather complicated way.
Spark is the most import Scala class library. Besides acting as the big data frame, the library can be independently used for processing retrieved WebService/RESTful data. Its general practice is to read in JSON/XML from the data source, convert it into DataFrame object, and perform computations on the DataFrame object.
To retrieve JSON from RESTful and perform a conditional query, for instance, we use the following code:
package test
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object JTest {
def main(args: Array\[String\]): Unit = {
val spark = SparkSession.builder()
.master("local")
.getOrCreate()
val result = scala.io.Source.fromURL("http://127.0.0.1:6868/api/emp_orders").mkString
val jsonRdd = spark.sparkContext.parallelize(result :: Nil)
val df=spark.read.json(jsonRdd)
val Orders = df.select(explode(df("Orders"))).select("col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")
val condition=Orders.where("Amount>1000 and Amount<=3000 and Client like'%business%' ")
condition.show()
}
Or to perform grouping & aggregation:
val groupBy=Orders.groupBy(year(Orders("OrderDate"))).agg(count("OrderID"),sum("Amount"))
Or perform a join between, say, employees table and orders table:
val df1=df.select(df("Name"),df("Gender"),df("Dept"),explode(df("Orders")))
val relation=df1.select("Name","Gender","Dept","col.OrderID","col.Client","col.SellerId","col.Amount","col.OrderDate")
In the above code, JSON strings are first converted into RDD objects and then DataFrame objects. DataFrame can store multilevel data, get data from a specific level (like Orders) using explode function, retrieve desired fields using select function, and finally, perform the target computation.
Scala has many syntactic strengths. DataFrame is able to store multilevel data, gets along well with the JSON structure, retrieves data intuitively by level using the dot operator, computes JSON conveniently.
Spark support JSON format well, but it gives no support for XML format. To make Spark support XML, we have to introduce databricks class library. Though the two libraries can make a team, they reduce the stability.
SPL language is intended to compute structured data. Using the same principle as Scala, it can handle different data sources, including WebService/RESTful, using consistent syntax and data structure. SPL is lighter, has simpler syntax, and offers loosely coupled JDBC interface.
To retrieve JSON from RESTful and perform a conditional query, for instance, SPL has the code as follows:
A | |
---|---|
1 | =json(httpfile("http://127.0.0.1:6868/api/emp_orders").read()) |
2 | =A1.conj(Orders) |
3 | =A2.select(Amount>1000 && Amount<=2000 && like@c(Client,"business")) |
SPL first reads in JSON strings, converts them into multilevel table sequence object using json function, concatenate all Orders records using conj function, and then perform the conditional query using select function.
We can debug or execute the SPL code in esProc IDE, or stores it as a script file (say condition.dfx), which can be called in a Java program via the JDBC. Below is the code for performing invocation:
package Test;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
public class test1 {
public static void main(String\[\] args)throws Exception {
Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection = DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery("call condition()");
printResult(result);
if(connection != null) connection.close();
}
…
}
The invocation process is similar to that of calling the stored procedure. SPL also supports the SQL-like integration that embeds the code directly into a Java program, without the need of a script file, as shown below:
…
ResultSet result = statement.executeQuery("=json(httpfile(\\"http://127.0.0.1:6868/api/emp_orders\\").read()).conj(Orders).select(Amount>1000 && Amount<=3000 && like@c(Client,\\"\*bro\*\\"))");
…
Below is the SPL code for achieving grouping & aggregation and a join operation:
A | B | |
---|---|---|
3 | … | |
4 | =A2.groups(year(OrderDate);sum(Amount)) | /Grouping & aggregation |
5 | =A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate) | /Join operation |
We can see that SPL has a more powerful expressing capability. This enables it to handle common computations with short and easy to understand code and makes it more integration-friendly than Scala. The language gives more intuitive support for the dot operator, making it convenient to retrieve values from the multilevel data during a join operation and producing even more succinct code.
With the extraordinarily expressive syntax, SPL is able to simplify many multilevel JSON computations. Here’s an example. A JSON string’s runners filed is a subdocument that has three fields – horseId, ownerColours and trainer. The trainer field also has its subfield trainerId. ownerColours contains comma-separated array values. Below is part of the source data:
\[
{
"race": {
"raceId":"1.33.1141109.2",
"meetingId":"1.33.1141109"
},
...
"numberOfRunners": 2,
"runners": \[
{ "horseId":"1.00387464",
"trainer": {
"trainerId":"1.00034060"
},
"ownerColours":"Maroon,pink,dark blue."
},
{ "horseId":"1.00373620",
"trainer": {
"trainerId":"1.00010997"
},
"ownerColours":"Black,Maroon,green,pink."
}
\]
},
...
\]
We are trying to group the JSON by trainerId and count members of ownerColours in each group. The following SPL code is used to perform the computing task:
A | |
---|---|
1 | …(Skip the process of retrieving the JSON string) |
2 | =A1(1).runners |
3 | =A2.groups(trainer.trainerId; ownerColours.array().count():times) |
About data format support, SPL supports both JSON and XML and can handle them in consistent syntax. To get data from the weather WebService description file, query a list of provinces according to the descriptions, and convert the returned XML into a table sequence, for instance, SPL has the following code:
A | |
---|---|
1 | =ws_client("http://www.webxml.com.cn/WebServices/WeatherWebService.asmx?wsdl") |
2 | =ws_call(A1,"WeatherWebService":"WeatherWebServiceSoap":"getSupportProvince") |
In the aspect of syntactic style, SPL has the most expressive syntax to simplify multilevel JSON computations; Scala ranks the second to support multilevel JSON computations; Python shows equal ability in generating two-dimensional data but it does not have a direct support for handling multilevel data. In terms of XML format support, SPL and Scala outperform the rest of the tools, which have poor performance, through unfortunately, Scala needs to depend on instable third-party class libraries.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code