-
Notifications
You must be signed in to change notification settings - Fork 332
Implementation technologies of T+0 query
T+0 query refers to real-time data query, and the latest data can be queried during data statistics. There are also T+1, T+7, etc., which means that you can only query data 1 day or 7 days ago. Obviously, T+N query is not real-time. This article will discuss some main T+0 query technologies.
The simplest way to realize T+0 query through database is to directly use a single business database for a query. All data are stored in one database, and it can naturally realize real-time data query. However, this method is only applicable to scenarios where the amount of data is small and the concurrency is not high. When the amount of data is large or the concurrency of queries is high, it will lead to excessive pressure on the database and affect business use. Therefore, in order to ensure the stability of business, many systems will do read-write separation, and data query is mainly based on the read database (Query database).
Once the data is distributed to multiple databases, it is not so easy to implement T+0 query. A fairly simple way is to use the master-slave framework of the database to copy the business database data to the analysis database with the help of the database data synchronization mechanism (such as Oracle golden gate and MySQL master-slave replication). However, master-slave cannot achieve real real-time. There are often delays when the salve database copies data, caused by database IO and the high concurrency of the master database. During MySQL master-slave replication, the slave database can only execute SQL serially for data replication (the master database can be in parallel), and the delay will be exacerbated. If the execution time of a SQL is long or stuck, the delay will be longer and longer.
In addition, master-slave often requires the same type of database, and RDB is often used to meet the needs of the transaction system. When facing query and analysis tasks, RDB has no advantages in terms of capacity or performance. Therefore, master-slave is more suitable for the "quasi- T+0" query scenarios where the real-time requirement of data is not so high, and the amount of data is not too large.
Another approach is to use the cross-database query capability of the database itself (such as DBLink of Oracle and federated engine of MySQL). The previous data comes from the analysis database, and the real-time data comes from the business database. Since the amount of real-time data is not large, the query will not have a great impact on the business database. At the same time, cross-database query does not require high data synchronization frequency of analysis database, and T+N can be used.
However, there are also many disadvantages to realize T+0 query with the help of the database's own cross-database query capability, and the most obvious is the poor support for heterogeneous data sources. Both DBLink and federated are not stable in heterogeneous sources. Even federated requires that the table structure be exactly the same (Federated will create table definitions locally, and the data files exist in the remote database). While it is very common for analysis database (data warehouse) to use different types of databases from business database. Secondly, this method will occupy a lot of database resources and the performance is very poor. There are also some disadvantages, including unstable data transmission, not supporting large object operation, prone to errors, poor stability and low scalability.
As can be seen from the above contents, as long as cross database (mixed real-time hot data and non-real-time cold data) query can be realized, T+0 query statistics can be realized. Since the support of the database itself is not good enough, a natural idea is to implement it on the application end. Get the two parts of data from the databases respectively, and then perform mixed calculations through hard coding at the application end, so as to realize T+0.
However, two problems should be considered when completing the calculations at the application end, development difficulty and computing performance.
We know that at present, most application systems are developed in Java, and Java is not good at structured data calculation due to the lack of structured calculation class library. Simple group aggregation requires dozens of lines of code, which is far less convenient than SQL in data calculation. Using java hard coding to deal with some simple query tasks (such as account transaction query) is not a problem, but when facing complex statistical analysis scenarios, the implementation complexity of Java is too high and often does not have practical feasibility. At this time, you can also use some Java computing engines to assist in data processing, so as to simplify the difficulty of development. Unfortunately, in reality, the computing engines are not mature enough. Some do not support database data source, some cannot realize association, and the computing power is insufficient. Even if they have certain computing power, the coding complexity is still very high, and it is difficult to reduce the development difficulty. There is still a big gap compared with SQL.
In order to take into account the development efficiency, the historical data will still be stored in the database, so that the computing power of the database can be used to simplify the development process through SQL. However, a large number of historical data stored in the database will often lead to excessive pressure on the database, performance degradation, and low execution efficiency of some complex computing tasks. In addition, the performance of the application program interface (such as JDBC) of the database is often very poor. If the result set is slightly larger, the return time is very long, which leads to higher IO cost and worse overall operation performance. Moreover, the database writing action is very slow, the time cost of dumping business database data to the analysis database is not low, and the ETL time window is often limited, which will also have a great impact on some highly periodical businesses (such as bank settlement at the end of the month and the end of the year).
In general, it is not easy to implement T+0 through multi-source mixed calculation in Java on the application end, hard coding is too difficult, and the performance with the help of database is not ideal.
Following the idea of cross source calculation to realize T+0, the use of esProc SPL is also an option. esProc is an open-source structured data computing engine, and it provides a rich computing class library and can easily complete all kinds of data calculations. esProc adopts embedded JDBC and can be seamlessly integrated in Java applications. From this point of view, it can be regarded as an enhanced Java computing engine.
The following capabilities are mainly used to realize T+0 query through esProc.
esProc provides support for a variety of data sources, including RDB, NoSQL, JSON, CSV, WebService and private data file formats. It is important that esProc can perform mixed calculation based on any variety of data sources, read cold and hot data from different data sources, and realize T+0.
A | ||
---|---|---|
1 | =cold=db1.cursor(“select * from orders where odate<?”,date(now())) | / Fetch cold data (data of yesterday and before) from historical database |
2 | =hot=db2.cursor(“select * from orders where odate>=?”,date(now())) | / Fetch hot data (data of today) from business database |
3 | =[cold,hot].conjx() | |
4 | =A3.groups(area,customer;sum(amout):amout) |
Cross source calculation can be carried out, and T+0 query can be realized directly based on business database and analysis database.
esProc provides SPL syntax specially designed for data processing, and realizes complex calculations with the help of rich calculation class library. For example, according to the stock record table, query the stocks whose share price has risen for more than 5 consecutive days and the days of rise (the same share price is regarded as rise).
The SPL code is as follows:
A | ||
---|---|---|
1 | =db.query@x("select * from stock_record order by ddate") | |
2 | =A1.group(code) | |
3 | =A2.new(code,~.group@i(price<price[-1]).max(~.len())-1:maxrisedays) | Calculate the max continuous rising days of each stock |
4 | =A3.select(maxrisedays>=5) | Select qualified records |
The same calculation can only be completed by nesting three layers in SQL, while Java is more complex, and SPL can complete in a few simple steps, which is an incomparable advantage of esProc SPL.
There is a third advantage of using SPL to implement T+0 calculation:
Unlike database and Java solutions, which are strongly dependent on databases, SPL has complete computing power and can directly calculate based on files so that historical data can be stored in files (IO fast, flexible compression and easy to parallel), and then with the help of SPL's cross-source computing power, mixed calculating the database (hot data) and file (cold data) can achieve more efficient T+0 query performance. At the same time, the use of SPL's own high-performance data format can further improve computing performance.
Not only that, SPL also has built-in a large number of high-performance algorithms, which are more efficient and have more advantages over SQL. SPL is interpreted executed and supports hot deployment / hot switching. Based on these capabilities, esProc can fully deal with T+0 data query and ensure that the query system is permanently online, which is simpler and more efficient than database and Java implementation.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code