-
Notifications
You must be signed in to change notification settings - Fork 332
Routine for regular and active update of small amounts of data
Composite table is an important file storage format of SPL, yet the composite table file does not support simultaneous read and write operations and, it also often requires storing the data in order in order to ensure high performance. In practice, however, data is not static and needs to be continuously appended or modified, and the order of newly generated data often differs from that required by composite table. In this case, how to avoid affecting the ongoing query and keep the order of data while maintaining the data of composite table becomes a problem we have to face. This article will present the methods to solve this problem when maintaining the data of composite table under different scenarios.
Some basic characteristics of application scenarios:
Real-time: data maintenance may be performed at any time.
Regular: data maintenance will only be performed at specific time (usually on a regular basis). The time interval between two maintenances is relatively long, allowing enough time for data organization.
Single composite table: the total amount of data is relatively small, and all data can be stored in a single composite table. Multi-zone composite table: the total amount of data is very large, and the data are usually stored in multiple tables, forming a multi-zone composite table.
Append: the composite table has no primary key; the data will only be appended and will not be modified and deleted.
Update: the composite table has primary key; the data may be added, modified, or deleted.
Small-scale maintenance: the new data is small in amount and can be loaded into memory, usually passed in as a table sequence.
Large-scale maintenance: the new data is large in amount and passed in as a cursor.
Online: query can be performed along with data maintenance.
Offline: query can pause during data maintenance. Data maintenance offline is a special case of that online. The following routines are all written according to online scenarios.
1. Routine for regular and active update of small amounts of data
This routine is used to regularly maintain small amounts of data utilizing the complementary area functionality of composite table, which is only applicable to single composite table, supports updating, and actively fetches the updated data from data source.
2. Routine for regular maintenance of single composite table
This routine is used to regularly maintain large amounts of data, and provides two modes: append and update; the newly generated data is passed in from application.
3. Routine for regular maintenance of multi-zone composite table
This routine is used to regularly maintain large amounts of data, and provides two modes: append and update; the newly generated data is passed in from application.
4. Routine for real-time data appending
This routine is used to append small amounts of data in real time, and requires using the multi-zone composite table; the newly generated data is passed in from application.
5. Routine for real-time data updating
This routine is used to update small amounts of data in real time, and requires using the multi-zone composite table; the newly generated data is passed in from application.
For the composite table with primary key, the update of small amounts of data can be implemented with the supplementary area, yet the query cannot be performed while writing data to supplementary area; multiple updates will result in a situation that the supplementary area become very large, which will affect query performance; regular reorganization and merging supplementary area into composite table also needs to pause query.
This routine presents how to implement query when updating a composite table, which is applicable to scenarios where the amount of added, deleted and modified data is small.
Method: make a copy of the current composite table, perform the update action on the copy, and still use the original composite table for query. After updating, use the new composite table to query, and delete the original composite table in the next update.
When querying, in addition to reading the querying composite table, read also the new updated data after the last update time at the same time. Use the update@y to implement the update of the supplementary area in memory and do not write new data to the composite table file, which can make the real-time hot data participate in calculation.
1. Current table: refers to the composite table currently being used for query, and its name is recorded in config.
2. Backup table: refers to the composite table copied for updating, and its name is recorded in config. After updating, exchange the file names of current table and backup table recorded in config. In this way, subsequent query will be performed based on the new current table (i.e., the previous backup table), and the new backup table (i.e., the previous current table) will be deleted in the next update.
3. read.splx: this scrip should be implemented by application (programmer). The input parameters are the start time and end time, both of which are long value. The new updated data within the time period needs to be returned, and the returned result is a sequence composed of two table sequences. The first table sequence contains the updated data, i.e., the modified or newly inserted records; the second table sequence contains the deleted data, i.e., the records that need to be deleted.
Configuration file: ubc.json
By default, this file is located in the main path of esProc (if you want to save it to another directory, modify the code to use an absolute path). The configuration file includes the following contents:
[{
"updateTimes":3,
"resetTimes":5,
"lastUpdateTime":null,
"sortKey":"account,tdate",
"otherFields":"ColName1,ColName2",
"dataDir":"data/",
"current":"data2.ctx",
"backup":"data1.ctx"
}]
“updateTimes” refers to the cumulative update times from the last reorganization until the present. Fill in 0 when the server starts for the first time.
“resetTimes” refers to the reorganization times. When the cumulative update times reaches the reorganization times, reorganize the composite table and reset the cumulative update times to zero.
“lastUpdateTime” refers to the last update time, indicating that the data before this time has been updated, and the next update only reads the data after this time. Fill in null when the server starts for the first time, indicating that all data needs to be updated.
“sortKey” refers to the primary key field of composite table, or the sorting field, also known as the dimension field. When there are multiple primary key fields, they are separated by commas.
“otherFields” refers to other field of composite table, separated by commas if there are multiple fields.
“dataDir” refers to the storage path of composite table relative to the main directory.
“current” refers to the name of composite table file currently in use, such as “data1.ctx”.
“backup” refers to the name of backup composite table file, such as “data2.ctx”.
Assume the store table of a certain e-commerce system is updated once an hour and reorganized once a day, that is, this table is reorganized once every 24 times of update, the configuration file is:
[{
"updateTimes":0,
"resetTimes":24,
"lastUpdateTime":null,
"sortKey":"id",
"otherFields":"ShopID,EmpID,ProductID,Ttime,Quantity,Price,Amount,Discount",
"dataDir":"data/",
"current":"data2.ctx",
"backup":"data1.ctx"
}]
Files and sub-paths in the main path:
data: refers to the storage path of composite table, and its name is set in ubc.json (refer to the description above). ubc.json: configuration file.
The files under ‘data’ include:
The name of composite table file can be set in ubc.json (refer to the description above).
current: the name of the current composite table file. lastUpdateTime: last update time.
This script will be executed when the server starts. If the server is started for the first time, it needs to initialize parameters and create an initial composite table.
A | B | |
---|---|---|
1 | >config=json(file("ubc.json").read()) | |
2 | =file(config.dataDir/config.current) | |
3 | if(!A2.exists()) | |
4 | =A2.create(${config.sortKey.split@c().("#"+trim(~)).concat@c()},${config.otherFields}) | |
5 | =B4.close() | |
6 | =movefile@cy(config.dataDir/config.current,config.backup) | |
7 | =file(config.dataDir/config.backup) | |
8 | if(!A7.exists()) | |
9 | =movefile@cy(config.dataDir/config.current,config.backup) | |
10 | >env(current,config.dataDir/config.current) | |
11 | >env(lastUpdateTime,config.lastUpdateTime) |
A1: read the configuration file; A3-B6: if the current composite table file does not exist, create a new one and make a backup copy; A8-B9: if the backup composite table file does not exist, make a copy of the current composite table; A10: set the name of the current composite table as global variable; A11: set the last update time as global variable.
The following code is just a sample. In practice, the code should be implemented by application (programmer). The input parameters are the start time and end time, both of which are long value. The new updated data within the time period needs to be returned, and the returned result is a sequence composed of two table sequences. The first table sequence contains the updated data, i.e., the modified or newly inserted records; the second table sequence contains the deleted data, i.e., the records that need to be deleted. The field order of table sequence is [Primary key + other fields], which is consistent with that in config.
A | |
---|---|
1 | =connect("mysql") |
2 | =A1.query("select account,tdate,ColName1,ColName2 from tbl1 where t1>=? and t1<? and deleted=0",start,end) |
3 | =A1.query("select account,tdate,ColName1,ColName2 from tbl1 where t1>=? and t1<? and deleted=1",start,end) |
4 | =A1.close() |
5 | return [A2,A3] |
In A2-A3, t1 is the update timestamp field; ‘deleted’ is the deletion flag field (0 means that the record is normal, and 1 means that the record is already deleted).
This script is called and executed by an external program, which reads the data updated between the last update time and the current time, and then updates the data to the backup composite table. After updating, set the backup composite table as the current composite table, and meanwhile, discard the original current composite table and delete it when the next update starts.
Count the cumulative update times. When the cumulative update times reaches the reorganization times, reorganize the composite table, and reset the cumulative update times to zero.
A | B | |
---|---|---|
1 | >config=json(file("ubc.json").read()) | |
2 | =movefile@cy(config.dataDir/config.current,config.backup) | |
3 | =long(datetime@h(now())) | |
4 | =call("read1.splx",lastUpdateTime,A3) | |
5 | if(A4(1).len()==0 && A4(2).len()==0) | |
6 | return | |
7 | =file(config.dataDir/config.backup).open() | |
8 | =A7.update(A4(1):A4(2)) | |
9 | =A7.close() | |
10 | >config.updateTimes+=1 | |
11 | if(config.updateTimes>=config.resetTimes) | |
12 | =file(config.dataDir/config.backup).reset() | |
13 | >config.updateTimes=0 | |
14 | =config.current | |
15 | >config.current=config.backup,current=config.dataDir/config.current | |
16 | >config.backup=A14 | |
17 | >config.lastUpdateTime=A3,lastUpdateTime=A3 | |
18 | =file("ubc.json").write(json(config)) |
A1: read the configuration file; A2: make a copy of the current composite table as a backup; A3: the current update time, accurate to the hour; A4: read the data between the last update time and the current update time; A5: skip in case the data is not updated; A7: open the backup composite table; A8: perform update; A10: cumulative update times plus 1; A11: if the cumulative update times reaches the reorganization times, then: B12: reorganize the backup composite table; B13: reset the cumulative update times to zero; A14-A16: exchange the file names of the backup composite table and the current composite table; A17: set the last update time as the current update time.
This script is used for data query. Read the real-time data between the last update time and the current time, and store the data to the complementary area in memory, and return a composite table object.
A | B | |
---|---|---|
1 | =call("read.splx",lastUpdateTime,long(now())) | |
2 | =file(current).open() | |
3 | if(A1(1).len()>0 || A1(2).len()>0) | =A2.update@y(A1(1):A1(2)) |
4 | return A2 |
A1: read the updated data; A2: open the composite file; A3-B3: write the updated data into the complementary area in memory to participate in calculation; A4: return a composite file object. Note: the composite table object must be closed after use.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code