Skip to content

Python vs. SPL 10 One to N Association

esProcSPL edited this page May 13, 2024 · 2 revisions

In data analysis, we usually encounter the scenarios of associating two or more tables, and the association of tables can be divided into the following categories: one-to-one, one-to-many, many-to-one, and many-to-many. The association of one-to-one means that one record of a table corresponds to only one record of another table; the association of one-to-many indicates that one record of a table corresponds to multiple records of another table. This article will compare the computational abilities of Python and SPL in one-to-N association.

Homo-dimension association

One-to-one association is also known as homo dimension association, and the associated two tables are homo dimension tables to each other. For example:

Based on the final score table and the usual score table, calculate students’ scores of the semester (the percentages of final scores and usual scores are 0.7 and 0.3 respectively).

Final score table

studentid fscore
s1087 68.36422
s1049 78.17515
s1018 82.61762

Usual score table

studentid uscore
s1050 75.11157
s1039 88.02989
s1091 87.95997

Python

import pandas as pd

fexam_file="D:/data/FExamRes.csv"

uperf_file="D:/data/UPerformance.csv"

fexam=pd.read_csv(fexam_file)

uperf=pd.read_csv(uperf_file)

scores=pd.merge(fexam,uperf,on="studentid",how="outer").fillna(0)

scores["total"]=0.7scores["fscore"]+0.3scores["uscore"]

print(scores)




The scores of final exam

The usual scores

Add 0 after associating

Calculate the scores

The merge function in Python can associate two Dataframes. Some students may lack a certain score, so we use outer join, then the score is recorded as 0 when there is no score; in this way, the final score can be obtained after weighed sum. After associating, a new Dataframe will be generated in Python, in which data are copied. In short, this whole processing takes up both a lot of time and plenty of memory.

SPL


A B
1 D:/data/FExamRes.csv
2 D:/data/UPerformance.csv
3 =file(A1).import@tc()
4 =file(A2).import@tc()
5 =join@f(A3:fexam,studentid;A4:uperf,studentid) Homo dimension association
6 =A5.new([fexam.studentid,uperf.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,0.7*fexam.fscore+0.3*uperf.uscore:total)

The join function in SPL performs the association, @f option is outer join, and in A6 the final scores are calculated. The association in SPL only uses the primary key (studentid in this example) to create association on two tables and does not copy the data, so it occupies a little memory and performs very fast.

Multiple homo-dimension tables

During associative calculation, we may often encounter scenarios where three or more homo-dimension tables need to be associated. For example:

Apart from the final scores and usual scores, students also have the scores of elective courses. And the percentages of the three are 0.6, 0.3, and 0.1 respectively.

The final score table and usual score table are the same as the previous example, and the elective score table is shown below:

Elective score table

studentid escore
s1051 74.66936
s1035 57.33134
s1047 61.29318

Python

#continue to use fexam and uperf

elec_file="D:/data/ElectiveScore.csv"

fexam=fexam.set_index(keys="studentid")

uperf=uperf.set_index(keys="studentid")

elec = pd.read_csv(elec_file,index_col='studentid')

student_score = pd.concat([fexam,uperf,elec],axis=1,join='outer',sort=True).fillna(0).reset_index()

student_score['total'] = 0.6student_score.fscore+0.3student_score.uscore+0.1*student_score.escore

print(student_score)



Set the index


Set the index while reading data

Associate multiple homo-dimension tables



Aggregate the scores

The merge function in Python can’t be used any longer for associating multiple homo-dimension tables because it only supports association of two tables. It will be a bit troublesome to execute the merge function twice when associating three tables. However, Python offers the concat function which can perform association based on the index, making association of multiple homo-dimension tables easier. However, in this way, we are required to master the usage of another function.

SPL


A B
/A3 is final score table, and A4 is usual score table
8 D:/data/ElectiveScore.csv
9 =file(A8).import@tc()
10 =A3.keys(studentid) /set the primary key
11 =A4.keys(studentid)
12 =A9.keys(studentid)
13 =join@f(A3:fexam;A4:uperf;A9:elec) /associate multiple homo-dimension tables
14 =A13.new([fexam.studentid,uperf.studentid,elec.studentid].conj().id().~:studentid,fexam.fscore:fscore,uperf.uscore:uscore,elec.escore:escore,0.6*fexam.fscore+0.3*uperf.uscore+0.1*escore:total) /calculate the total scores

The join function in SPL can associate two tables as well as three or more tables. After the primary key is set, the function will perform association according to the primary key automatically, and the subsequent operation is the same as that of two-table association. In SPL, it is very easy for us to adopt one method to other operations, which sparing the trouble of memorizing the usage of other functions.

Primary-sub tables

One-to-many association indicates that one record of a table is able to correspond to any record of another table. In such association, we call the table at the “one” end as primary table, and the table at the “many” end as sub table. For example, in the order table and order detail table, one order corresponds to one distinct orderid, but every orderid may correspond to many pieces of order detail information. And the order table is primary table and order detail table is sub table.

Based on the above two tables, calculate the consumption of customers of each province in s150 stores.

Order table (the primary table of order detail table)

storeid orderid client location
s101 o10001 c104231 Beijing
s101 o10002 c107032 Beijing
s101 o10003 c108875 Beijing

Order detail table (the sub table of order table)

storeid orderid productid price
s101 o10001 p1078 1273
s101 o10002 p1091 131
s101 o10002 p1008 780

Python

detail_file="D:/data/Detail.csv"

order_file="D:/data/Order.csv"

detail=pd.read_csv(detail_file)

order=pd.read_csv(order_file)

detail150=detail.query("storeid=='s150'")

order150=order.query("storeid=='s150'")

order_detail=pd.merge(order150,detail150,on="orderid",how="inner")

loc_consume=order_detail.groupby("location").price.sum()

print(loc_consume)




The order details of s150 stores

The orders of s150 stores

Associate two tables

Group and aggregate

The association in this example is performed on two tables, and the associative relation is also very clear. We can just use orderid to associate two tables and then group and aggregate the records according to location. Here the operation is one-to-many association, which can be converted to one-to-one association as well:

  1. Aggregate the consumption amount of each order in the order detail table;
  2. Associate order table with the aggregation result of Step 1, then the association is about one-to-one;
  3. Group and aggregate according to location. The code is written as:
  4.        order_sum=detail150.groupby(“orderid”).price.sum()
    
  5.        order_detail=pd.merge(order150, order_sum,on="orderid",how="inner")
    
  6.        loc_consume=order_detail.groupby("location").price.sum()
    

SPL


A B

16 D:/data/Detail.csv
17 D:/data/Order.csv
18 =file(A16).import@tc()
19 =file(A17).import@tc()
20 =A18.select(storeid=="s150") /select
21 =A19.select(storeid=="s150")
22 =A20.groups(orderid;sum(price):sprice) /group and aggregate
23 =join(A21:order,orderid;A22:consume,orderid) /associate
24 =A23.groups(order.location;sum(consume.sprice):consume) /group and aggregate

As for primary-sub tables association, the elaborately designed SPL advocates the method of grouping and aggregation -- homo-dimension association. Because in this way the associative relation is even clearer and the operation is less prone to errors when associated tables are too many and too complex. Also, SPL can associate first and aggregate later in the same way that Python does, and the code is like: A25=join(A21:order,orderid;A20:detail,orderid) A26=A25.groups(order.location;sum(detail.price):consume)

Multi-layer primary-sub tables

There are also primary-sub tables of multiple layers such as:

Apart form the same order table and order detail table of last example, a new store information table is added.

Store information table (primary table)

storeid province
s101 Heilongjiang
s102 Beijing
s103 Tianjin

Based on the above three tables, aggregate the consumption of stores in each province.

Python

#continue to use detail and order tables

store_file="D:/data/Store.csv"

store = pd.read_csv(store_file)

detail_sum = detail.groupby(['storeid','orderid']).price.sum()

order_detai = pd.merge(order,detail_sum,on=['storeid','orderid'])

store_amount = order_detai.groupby('storeid',as_index=False).price.sum()

store_inf = pd.merge(store,store_amount,on=["storeid"])

province_sale = store_inf.groupby('province',as_index=False).price.sum()

print(province_sale)

 

 

 

Aggregate the sales amount of each order

 

Associate orders with the aggregated sales amount

Aggregate the sales amount of each store

 

Associate store information with sales amount of stores

Group and aggregate according to the provinces where stores locate

 

The order detail table is the sub table of order table, and order table is the sub table of store table, which form two layers of primary-sub tables together. The operation follows the method of aggregation on sub tables -- association on primary table, which make the associative relation clearer, and does not lead to too big associated tables or wrong association of many-to-many. This example shows a multi-field association, which can be easily done in Python as well.

SPL


A B

28 D:/data/Store.csv
29 =file(A28).import@tc()
30 =A18.groups(storeid,orderid;sum(price):sprice) /aggregate the sales amount of each order
31 =join(A19:order,[storeid,orderid];A30:detail_sum) /associate orders with aggregated sales amount of orders
32 =A31.groups(order.storeid;sum(detail_sum.sprice):sales) /aggregate sales amount of each store
33 =join(A29:store,storeid;A32:store_amount) /associate store information with sales amount of stores
34 =A33.groups(store.province;sum(store_amount.sales):sales) /group and aggregate according to the provinces where stores locate

It is also easy for SPL to complete the association with the same method introduced previously.

One primary table & multiple sub tables

It is also very common that there is one primary table and multiple sub tables, such as order table, order detail table and order return table. For example:

Part of the order table, order detail table, and payment collection table are shown below:

Order table (primary table)

orderid clientid date
10012 100658 2019/2/13
10023 103478 2019/1/12
10040 108013 2019/1/4

Order detail table (sub table)

orderid productid price
10012 3018 428.5
10012 3019 349.2
10023 3019 349.2

Payment collection table (sub table)

orderid term payment
10012 1 259.2
10012 2 259.2
10012 3 259.3

Based on the above three tables, calculate the orders whose payments are not fully collected.

Python

detail_file="D:/data/Detail_2.csv"

order_file="D:/data/Order_2.csv"

pay_file="D:/data/Payment_2.csv"

detail_2 = pd.read_csv(detail_file)

order_2 = pd.read_csv(order_file,index_col='orderid')

pay = pd.read_csv(pay_file)

detail_order = detail_2.groupby('orderid').price.sum()

pay_order = pay.groupby('orderid').payment.sum()

order_d_p = order_2.join([pay_order,detail_order])

nopay_order = order_d_p.query('price>payment')

print(nopay_order)

 

 

 

 

 

 

Aggregate sales amount of orders

Aggregate amount of collected payments

Associate primary table with two sub tables

Select

Order detail table is the sub table of order table, and order return table is also the sub table of order table; so there is one primary table and two sub tables. This operation still follows the method of aggregation on sub tables -- association on primary table. But Python does differently to complete the operation because the merge function cannot associate three tables simultaneously, instead, the join function is provided to associate three tables using index. It creates a new wide table, and uses the wide table to select the result, whose associative logic is slightly different from the previous one.

SPL


A B

36 D:/data/Detail_2.csv
37 D:/data/Order_2.csv
38 D:/data/Payment_2.csv
39 =file(A36).import@tc()
40 =file(A37).import@tc()
41 =file(A38).import@tc()
42 =A39.groups(orderid;sum(price):amount) /aggregate sales amount of orders
43 =A41.groups(orderid;sum(payment):payment) /aggregate amount of collected payments
44 =join(A40:order,orderid;A42:amount;A43:payment) Associate the primary table with two sub tables
45 =A44.select(amount.amount>payment.payment).(order) /select

SPL still follows the logic described before, using the join function to associate multiple tables step by step, then just displaying the order table after selecting.

Summary

When performing association, Python copies two tables to create a new wide table, and uses the wide table to complete the subsequent operations, which consumes more memory and degrades the efficiency. Besides, when associating multiple tables, Python needs to use other functions and some particular conditions, otherwise, it has to use the merge function many times, which is a bit cumbersome.

While SPL only establishes associative relation on two tables rather than copying data, and uses the relation to complete the operation quickly, which occupies less memory and achieves high efficiency. In addition, when associating multiple tables, SPL can still use the same method of two-table association to deal with N tables without complex detour.

When associating primary table with sub tables (one-to-many), we first aggregate the sub tables and then associate them with primary table, which forms the homo-dimension association, so they are introduced together.

Clone this wiki locally