Skip to content

SPL practice:funnel analysis

esProcSPL edited this page Sep 29, 2024 · 1 revision

Problem description

Definition of funnel analysis

In e-commerce business, funnel analysis is a common statistical requirement. When users use a smart device to shop online, the system will establish a connection to form a session. Each session contains multiple operation events such as visit-type event (visit), view-type event (view), detail-page-type event (detail).

From the perspective of the occurrence time of events (event time for short), each user's operation events are in a certain order. The later the event order, the fewer the number of users involved in the event, just like a funnel. For the funnel conversion analysis, it first needs to count the number of deduplicated users for each operation event, and then perform further calculations like conversion rate based on the calculation results.

Let's take a three-step funnel analysis as an example. Three event types involved in the example are visit, view and detail, and the date range is from February 28, 2023 to March 27, 2023.

Step 1, filter out one user's visit events within this date range. Since more than one visit event may occur at the same time, it needs to find the record with the minimum session number from the records with the same event time, and regard the found record as the session corresponding to the user's visit event at this time point, which is called the visit event session.

Step 2, filter out the user's view events occurred within one day after the visit event from the visit event session obtained in step 1, and only retain each user's minimum session number, view time and visit time.

Step 3, filter out the user's detail events occurred within one day between the view time and visit time from the visit and view sessions obtained in step 2, and only retain each user's minimum session number, detail time and visit time.

For every user, it needs to determine whether the above three events occur, and count the total number of users of visit, view, and detail respectively, and divide the number of users of detail by the number of users of visit to find the three-step funnel conversion rate.

Data structure and data scale

Event type table (eventtype):

Field Field type Field meaning Sample data
id string Event type number, primary key ed3de1d0e3f475c9bf3e62dc984d4b79
name string Name confirm

The id is a 32-character string converted from a 16-digit hexadecimal number. The customer uses Snowflake, which may have special optimization measures for this type of data, yet the exported text data is a regular string. The primary keys of other tables are also of this data type.

Device type table (devicetype):

Field Field type Field meaning Sample data
id string Device type number, primary key dcb2f062f1436724259deab2bb93a7d5
name string Name pad

Session table (sessions)

Field Field type Field meaning Sample data
id string Session number, primary key bb0018615f7163de67b0d5d00593349d
userid string User number cde148d0c8296f1fcab3bf9fdab23d30
devicetype string Device type number eb58ad0ef88d405b3d2d43c1f55b2e82

Event table (events):

Field Field type Field meaning Sample data
id string Event number, primary key cb9112d01256d51ef7068cdf7c222ea5
sessionid string Session number bb0018615f7163de67b0d5d00593349d
userid string User number cde148d0c8296f1fcab3bf9fdab23d30
eventtime timestamp Time (milliseconds) 2023-03-07 12:03:33.034
eventtype string Event type ed3de1d0e3f475c9bf3e62dc984d4b79

Description of inter-table relationship:

The 'sessions' (s for short) and the 'events' (e for short) are in one-to-many relationship, and are associated on the userid and id of s and the userid and sessionid of e.

Each userid in s corresponds to unique id, and each session corresponds to unique event id.

The e and the eventtype table (et for short) are in many-to-one relationship, and are associated on the eventtype of e and the id of et.

The s and the devicetype table (dt for short) are in many-to-one relationship, and are associated on the devicetype of s and the id of dt.

The date range of all data is 2023-03-01 to 03-31, and the data volume of each table is:

sessions: 3 million rows events: 300 million rows eventtype: 700 rows devicetype: 8 rows

Condition and expectation

Now we want to arbitrarily choose a time span within 30 days before 2023-03-21 and calculate N-step (usually 3-7 steps) funnel conversion rate, and hope to obtain result within 30 seconds on Snowflake's Medium cluster (8-CPU EC2 with 4 nodes in total). Regrettably, actual tests showed that it failed to get the result in 3 minutes when coding in SQL to calculate a 3-step funnel of 14-day span.

The reason is that coding in SQL is very complicated. Even if the code is written, it needs to nest a subquery in each step, making the code difficult to understand and inefficient to execute.

Problem analysis

Ordered computing

The total amount of data involved in funnel analysis calculation is very large, yet the data amount of each user is not large, only a few to thousands of rows. Moreover, the events are calculated on a user-by-user manner and, there is no relationship between the events of different users.

For one user, multiple operation events are calculated in chronological order, which is a typical time-sequence calculation.

The time-sequence calculation logic on events is often extremely complex and requires writing multi-step code to implement.

SPL supports storing data in a physically ordered manner. If the event table is stored in order by userid, sessionid, and eventtime, then it will be easier to perform the time-sequence calculation by reading the data of each userid into memory one by one during calculation. For example, find the view event first, and then see if it follows by an adding-to-cart event. If there is an adding-to-cart event, and the time span between two events is less than 7 days, then it indicates this user has conducted two steps.

In this way, the conversion funnel analysis can be done during only one traversal of the data, and the calculation speed is fast and the memory space is less occupied. Moreover, such computing logic is perfectly in line with our natural thinking and can reduce the coding difficulty.

Although it is slow to pre-sort the event table, it only needs to sort once and, storing only one event table stored in order by user, session and time is enough, which avoids redundancy.

Fast de-duplication

Funnel statistics involves multiple sequential events, with each step corresponding to a COUNT(DISTINCT), allowing us to calculate the customer churn rate of the current step by combining with the previous step's COUNT(DISTINCT); the COUNT(DISTINCT) of the next step should be filtered based on that of previous step. And the order the events occur should be considered.

COUNT (DISTINCT) has always been a difficult problem in database computing, usually very slow. If the data volume is large (a large number of accounts), it may cause database to crash.

By utilizing SPL's ordered calculation described above, the process of reading the data of one user into memory for complex computation will not involve other users, nor does this user's data appear when calculating other users later. Therefore, the calculation results themselves don't have duplicate data and there is no need to deduplicate, hereby significantly improving performance and reducing memory consumption.

Data type conversion

Both the eventtype and the devicetype are small table. If the 32-character string is used to represent id, then the eventtype field of events and the devicetype field of sessions will have to store such id. Due to the large number of rows in the events and sessions, a huge number of 32-character strings need to be stored, resulting in high storage space occupation and poor computing performance.

If we store the eventtype and the devicetype in order, then the position sequence number corresponding to each record is fixed, so we can store the sequence number of record in events and sessions instead of storing the 32-character string. When calculating, we just need to read the two small tables into memory, and retrieve the corresponding record from memory based on a record's position sequence number as required. This optimization method is called sequence-numberization. Sequence number is small integer, boasting the advantages of less space occupation and fast calculation.

In particular, when filtering the events by eventtype field, we can also utilize the boolean dimension sequence to further improve performance by converting "comparison of set values" to "reference of sequence number".

Similarly, timestamp can also be converted to long (integerization), which can also improve performance.

Moreover, if the userid and sessionid fields in events and sessions are represented by 32-character string, it will result in inefficient computation, so they also need to be converted to integer. However, the 32-character string exceeds the value range of long (the largest integer in general). To solve this, we can use SPL's data type bytes, which consists of two long integers that are exactly the 16-digit hexadecimal number.

Practice process

Prepare the test data

A B
1 =8.(rands("bcded",2)/rands("abcdedf0123456789",30)) /devicetypes
2 [laptop,PC,gamebox,Spider,TV,phone,pad,unkown]
3 =A2.new(A1(#):id,~:name) =file("devicetype.txt").export(A3)
4 [visit,view,detail,login,cart,confirm,pay] /eventtypes
5 [50,150,200,300,400,500,550] =700.(rands("bcded",2)/rands("abcdedf0123456789",30))
6 =B5.new(~:id,"eventtype"/#:name) =A5.(A6(~).name=A4(#))
7 =file("sessions.txt") /sessions
8 =1501000.(rands("bcded",2)/rands("abcdedf0123456789",30)) =3001000.(rands("bcded",2)/rands("abcdedf0123456789",30))
9 =A8.id().to(15000) =B8.id().to(3000000)
10 =B9.new(~:id,A9(int((B9.#+1)/2)):userid,int(rand(1000)+1):sessiontime,A3(int(rand(8)+1)).id:devicetype)
11 =A7.export(A10)
12 =file("events1.btx") /events
13 for 50 =3010000.(rands("bcded",2)/rands("abcdedf0123456789",30))
14 =A12.export@ab(B13)
15 =A12.cursor@b().sortx(_1) =A15.group@1(_1)
16 =file("events2.btx") =A16.export@b(B15)
17 =movefile(A12)
18 [3812123,7180321,10090667,18100965,22000324,43250567,82788111] =A5.new(~:id,A4(#):name,A18(#):eventtime)
19 =A16.cursor@b().new(_1:id) =A7.cursor().new(_1:id,_2:userid).rename(id:sessionid)
20 =file("events3.btx") =to(700)\A5
21 for A19,100000 =B19.fetch(2000)
22 =B21.derive(A21.to((B21.#-1)*50+1,(B21.#)*50):events,B18.to(rand(7)+1):eventtype,elapse@ms(datetime(2023,3,1,9+rand(8),0,0)+rand(31),rand(1000)):eventtime1)
23 =B22.run((events|events).derive(B22.eventtype.m(#).id:eventtype,int(B22.eventtype.m(#).eventtime):et1):events)
24 =B23.run(events.derive(elapse@ms(B23.eventtime1,ifn(events.et1,0)):eventtime):events)
25 =B24.news(events;sessionid,userid,id,eventtype,eventtime)
26 =B25.run(if(eventtype==null,(eventtype=B20(rand(693)+1),eventtime=elapse@ms(datetime(2023,3,1,9+rand(8),0,0)+rand(31),rand(1000)))))
27 =A20.export@ab(B26)
28 =A20.cursor@b().new(id,userid,sessionid,A6(eventtype).id:eventtype,eventtime)
29 =A28.sortx(eventtime)
30 =file("events.txt").export(A29) =movefile(A20)

A1-B3: prepare the data of devicetypes;

A4-B6: prepare the data of eventtypes (700 in total), 7 of which to be calculated are fixed in A5;

A7-B11: prepare the data of sessions. Since this table is quite large, and the randomly generated 32-character strings may be repeated, both the userid and sessionid fields generate an extra 10,000 data, which need to be de-duplicated in memory;

A12-A15: prepare the data of events (150 million 32-character strings); write the data to events1.btx; sort and de-duplicate the data in external storage; store them in events2.btx and delete events1.btx;

A18 and B18: prepare the 7 steps to be used in calculation, and preset some time intervals to make more generated data satisfy the funnel rules;

A19- B27: loop through events2.btx and sessions.txt to generate 300 million data of events.btx, and in B28, write the generated data to events3.btx;

A28-A30: sort events3.btx by time and write it to events.txt;

B30: delete events3.btx.

Data preprocessing

The following codes are to convert the format of devicetype and eventtype from txt to btx:

A B
1 =file("devicetype.txt").import().new(_1:id,_2:name)
2 =file("devicetype.btx") =A2.export@b(A1)
A B
1 =file("eventtype.txt").import().new(_1:id,_2:name)
2 =file("eventtype.btx") =A2.export@b(A1)

Code to preprocess sessions:

A
1 =file("sessions.txt").cursor().new(_1:id,_2:userid,_3:sessiontime,_4:devicetype)
2 =file("devicetype.btx").import@b().keys@i(id)
3 =A1.new(k(bits@h(mid(userid,1,8)):4,bits@h(mid(userid,9,8)):4,bits@h(mid(userid,17,8)):4,bits@h(mid(userid,25,8)):4):userid, k(bits@h(mid(id,1,8)):4,bits@h(mid(id,9,8)):4,bits@h(mid(id,17,8)):4,bits@h(mid(id,25,8)):4):id, sessiontime, A2.pfind(devicetype):devicetype)
4 =A3.sortx(userid,id)
5 =file("sessions.ctx").create@py(#userid,#id,sessiontime,devicetype)
6 =A5.append@i(A4)

A3 uses the k() and bits() functions to convert the userid and id to bytes. Note that since the userid and id are strings composed of 0-9 and abcdef, the bits@h() function can be used here to convert them to integer according to hexadecimal number rules.

A3 also uses the pfind function to convert the 32-characcter string number of devicetype field to position sequence number, hereby implementing sequence-numberization.

Code to preprocess events:

A
1 =file("events.txt").cursor().new(_1:id,_2:userid,_3:sessionid,_4:eventtype,_5:eventtime)
2 =file("eventtype.btx").import@b().keys@i(id)
3 =A1.new(k(bits@h(mid(userid,1,8)):4,bits@h(mid(userid,9,8)):4,bits@h(mid(userid,17,8)):4,bits@h(mid(userid,25,8)):4):userid, k(bits@h(mid(sessionid,1,8)):4,bits@h(mid(sessionid,9,8)):4,bits@h(mid(sessionid,17,8)):4,bits@h(mid(sessionid,25,8)):4):sessionid, long(eventtime):eventtime, k(bits@h(mid(id,1,8)):4,bits@h(mid(id,9,8)):4,bits@h(mid(id,17,8)):4,bits@h(mid(id,25,8)):4):id, A2.pfind(eventtype):eventtype)
4 =file("events.ctx").create@py(#userid,#sessionid,#eventtime,#id,eventtype)
5 =A3.sortx(userid,sessionid,eventtime,id)
6 =A4.append@i(A5)

A3 converts the userid, sessionid and id to bytes, converts the eventtime to long, and performs the sequence-numberization of eventtype field.

Funnel analysis calculation

The given parameter arg_date is a date, such as 2023-03-21.

The arg_days is a time span, e.g. 14 days, 30 days.

The arg_steps refers to the funnel analysis steps to be calculated, e.g. three-step funnel: "visit,view,detail".

A
1 =eventtypes=file("eventtype.btx").import@b()
2 =devicetypes=file("devicetype.btx").import@b()
3 =long(elapse(arg_date,-arg_days))
4 =long(arg_date)
5 =long(arg_date+1)
6 =arg_steps.split@c()
7 =A1.(A6.pos(name))
8 =if(A6.len()>=3,to(3,A6.len()),[])
9 =file("events.ctx").open()
10 =sessions=file("sessions.ctx").open().cursor@mv(userid,id,devicetype)
11 =file("events.ctx").open().cursor@v(userid,sessionid,eventtime,eventtype;eventtime>=A3 && eventtime<A5,eventtype:A7:#;A9)
12 =A10.pjoin(userid:sessionid,userid,sessionid,eventtype,eventtime;A9,userid:id,userid,devicetype)
13 =A11.group(userid)
14 =A12.new(~.align@a(7,eventtype):e,e(1).select(eventtime<A4).group@u1(eventtime):e1,e(2).group@o(sessionid):e2${A8.(",e("/~/"):e"/~).concat()})
15 =A13.derive(join@m(e1:sub_e1,sessionid;e2:sub_e2,sessionid).derive@o(sub_e2.select(eventtime>sub_e1.eventtime && eventtime<sub_e1.eventtime+86400000).min(eventtime):sub_e2_min_time).select(sub_e2_min_time) :e1_join_e2 )
16 =A14.new(e1.id(devicetype):e1_id_devicetypeno,e1_join_e2.min(sub_e1.eventtime):e1_min_time,e1_join_e2.min(sub_e2_min_time):e2_min_time,e1_join_e2.min(sub_e1.sessionid):e1_min_sessionid${A8.(",e"/~/".select(sessionid==e1_min_sessionid && eventtime>e"/(~-1)/"_min_time && eventtime<e1_min_time+86400000):e"/~/"_1,e"/~/"_1.min(eventtime):e"/~/"_min_time").concat()})
17 =A15.news(e1_id_devicetypeno;~:devicetype,e2_min_time${A8.(",e"/~/"_min_time").concat()})
18 =A16.groups(devicetype;count(1):STEP1_COUNT,count(e2_min_time):STEP2_COUNT${A8.(",count(e"/~/"_min_time):STEP"/~/"_COUNT").concat()})

A3-A5: based on the given parameter arg_date, calculate the date arg_days before the parameter and the date 1 day after the parameter, and convert the three dates to long;

A6: calculate the sequence of 3-step funnel to be computed based on the given parameter;

A7: set the event types in the table sequence 'eventtypes', which are used for calculation, as 1, 2, 3 in turn, and set the rest as null. A7 is the boolean dimension sequence;

A8: calculate the sequence number 3 of the funnel with more than 2 steps;

A9: create multiple columnar cursors on sessions;

A10: create multiple columnar cursors on events, and segment synchronously with the cursors in A9; filter out three event types from the cursors using the boolean dimension sequence in A7, and retrieve one more day of data;

A11-A12: use the pjoin function to ordered merge of event and session columnar cursors, and perform ordered group by userid;

A13: for each userid group, align and group the eventtype by the order of 1, 2, 3, making them correspond to 3 event types respectively. The filter conditions from e1 through en include the judgment of substring;

A14: merge e1 and e2 in order by sessionid, calculate the min eventtime that satisfies condition from the merged result, and filter out min values that are not null;

A15: perform in-memory de-duplication calculation on devicetype in e1, and calculate the min eventtime and the min sessionid from the result of merging e1 and e2; starting from e3, filter out the records that meet sessionid and time conditions, i.e., the events in e3 that meet conditions;

A16: calculate the de-duplicated devicetype based on the cursors in A15, calculate the min time from e2 to e3 and merge into the original cursor;

A17: perform small result sets group and aggregate in A16; the grouping field is devicetype. Calculate the count value of each step in each group;

A18: convert the devicetype in the table sequence in A17 from sequence number to name;

This example is a three-step funnel analysis. To calculate more steps, we just need to modify the parameter arg_steps without the need to modify code. For example, for a seven-step funnel analysis, modifying the parameter as "visit,view,detail,login,cart,confirm,pay" will be OK.

Practice effect

When using SPL to perform the funnel analysis on an 8C32G virtual machine, it took 17 and 19 seconds respectively to calculate the 3-step and 7-step funnels of 14-day span, and took 28 and 30 seconds respectively to calculate the 3-step and 7-step funnels of 30-day span.

The main frequency of the CPU used in this test is only 1.7G, resulting in a relatively poor performance. When calculating the 3-step funnel of 14-day span on GCP's 16C128G virtual machine, the result can be obtained in 10 seconds.

Postscript

Funnel analysis problems will involve large numbers of users, and the event table is very large and needs to be stored in external storage in general. However, the data amount of each user is not large, only a few to thousands of rows.

Moreover, although the calculation of operation events is complex, it is done on a user-by-user manner and, there is no relationship between the events of different users, and the calculation of one user's events generally doesn't involve the events of other users.

We can utilize the ordered storage mechanism of SPL to separately load each user's data into memory for computation, which can reduce the computing complexity and effectively improve performance.

In doing so, the conversion funnel analysis can be done during only one traversal of the data, and the calculation speed is fast and the memory space is less occupied. Moreover, such computing logic is perfectly in line with our natural thinking and can reduce the coding difficulty.

SQL, which is commonly used in relational databases and big data technologies, is based on unordered set theory, and it cannot ensure the continuous storage of the data of each user, and hence the above-mentioned user- and time-based ordered algorithms cannot be adopted directly. If the high-level languages such as Java is employed to implement these algorithms, the amount of code will be huge and there are limitations in using such languages.

Clone this wiki locally