-
Notifications
You must be signed in to change notification settings - Fork 332
User Behavior Analysis in Practice 3:Order based Filtering Using Binary Search
We have a user events table T. Below is its structure and part of its data:
Time | UserID | EventTypeID | EventType |
---|---|---|---|
2022/6/1 10:20 | 1072755 | 3 | Search |
2022/6/1 12:12 | 1078030 | 2 | Browse |
2022/6/1 12:36 | 1005093 | 5 | Submit |
2022/6/1 13:21 | 1048655 | 1 | Login |
2022/6/1 14:46 | 1037824 | 6 | Logout |
2022/6/1 15:19 | 1049626 | 4 | AddtoCart |
2022/6/1 16:00 | 1009296 | 5 | Submit |
2022/6/1 16:39 | 1070713 | 2 | Browse |
2022/6/1 17:40 | 1090884 | 3 | Search |
Fields in table T:
Field name | Data type | Description |
---|---|---|
Time | Datetime | Time stamp of an event, accurate to milliseconds |
UserID | Integer | User ID |
EventTypeID | Integer | Event type ID |
EventType | String | Event type name |
Computing task:
Find the number of events under each type and that of distinct users who perform that type of event in the specified time period.
The first thing the task involves is to obtain data within a specified time period from huge amount of data.
Sort the original table by Time field and save the ordered data. This enables us to use binary search to perform the filtering without the need of traversing all data when trying to retrieve data in a specified time range, and thus can considerably speed up the filtering operation.
- Dump data from database and store it in a binary file ordered by time.
Stocked data: the data is retrieved from the database, sorted by time and written to a binary file:
A | |
---|---|
1 | =connect("demo").cursor@x("select * from T order by Time") |
2 | =file("T.btx").export@b(A1) |
A1 Sort table T by time while retrieving data from it.
Newly-increased data: The newly-increased data is identified through time stamp. Each day after 0 o’clock we append the newly-generated data in the past day to a bin file:
A | |
---|---|
1 | =connect("demo").cursor@x("select * from T where Time>=? && Time<? order by Time",date(now()-1), date(now())) |
2 | =file("T.btx").export@ba(A1) |
A1 Get data generated in the previous day through filtering condition to create a cursor. Data needs to be first sorted by time stamp.
A2 Fetch A1’s data from the cursor and append it to bin file T.btx. As the newly-increased data is identified according to time stamp, the new data is naturally generated later. There’s no need to sort-merge the new data and the old data. Just append the new data to the original file.
- Perform fast filtering using binary search and then grouping & aggregation
Suppose we need to summarize data that falls in between 2022-03-15 and 2022-06-16:
A | |
---|---|
1 | >start=date("2022-03-15","yyyy-MM-dd"),end=date("2022-06-16","yyyy-MM-dd") |
2 | =file("T.btx").iselect@br(start:end,Time).groups(EventTypeID; EventType, count(1):Num, icount(UserID):iNum) |
A2 iselect function reads records whose Time values are within the interval [start,end] from the file ordered by expression Time and return them as a cursor. In this case the time interval is closed. @r option indicates that not all Time values are distinct because there are concurrencies performed by multiple users; @b option enables reading data from a binary file.
The iselect function does not work with the option enabling parallel processing. In a single-task scenario, it is not necessarily that iselect is faster than the parallel-processing-based full traversal when the selected result set is large. In a multi-task scenario, parallel processing is futile (because the multiple CPUs will be fully occupied with handling concurrencies). In that case, the function has a great advantage.
Execution result:
EventTypeID | EventType | Num | iNum |
---|---|---|---|
1 | Login | 4033000 | 393400 |
2 | Browse | 3578901 | 348791 |
3 | Search | 2947931 | 257539 |
4 | AddtoCart | 1845674 | 175476 |
5 | Submit | 867345 | 83375 |
6 | Logout | 4033000 | 393400 |
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code