-
Notifications
You must be signed in to change notification settings - Fork 332
Performance comparison of Python and SPL in data processing
In the article Comparison test of Python and SPL in data reading and computing performance, we compare the performance of Python and SPL in data reading and computing. Since modifying dataset is common in daily data processing, we will compare the data processing performance of Python and SPL in this article.
Test environment:
- System: Windows 11
- Memory: 16G
- CPU: 8 cores
- Data: TPCH 1G
The amount of data involved in this article is not particularly large and can be fully loaded into memory.
We use the orderstable which is 1.5 million rows.
The size of text file is 172M. The data row is split by symbol “|”, and presented in the following form:
1|18451|O|193738.97|1996-01-02|5-LOW|Clerk#000000951|0|nstructions sleep furiously among |
2|39001|O|41419.39|1996-12-01|1-URGENT|Clerk#000000880|0| foxes. pending accounts at the pending, silent asymptot|
3|61657|F|208403.57|1993-10-14|5-LOW|Clerk#000000955|0|sly final accounts boost. carefully regular ideas cajole carefully. depos|
4|68389|O|30234.04|1995-10-11|5-LOW|Clerk#000000124|0|sits. slyly regular warthogs cajole. regular, regular theodolites acro|
5|22243|F|149065.30|1994-07-30|5-LOW|Clerk#000000925|0|quickly. bold deposits sleep slyly. packages use slyly|
...
Insert the following record at row 8 of the orderstable:
[8,80665,'F',192181.76,'1995-01-16','3-MEDIUM','Clerk#000000946',0,'efulpackages.blithelyfinalaccountssleepcare']
Python code:
import pandas as pd
import numpy as np
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
cols = orders_data.columns
rcd = [8,80665,'F',192181.76,'1995-01-16','3-MEDIUM','Clerk#000000946',0,'eful packages. blithely final accounts sleep care']
orders_data_insert = pd.DataFrame(np.insert(orders_data.values, 7, values=rcd, axis=0),columns=cols)
e = time.time()
print(e-s)
Time consumed: 0.32 seconds.
Pandas' DataFrame is essentially a Numpy matrix, but it does not fully inherit certain methods of matrix. Let’s take insert()as an example, Numpy supports inserting both rows and columns, while Pandas’s insertonly supports inserting column, if we want to insert rows, we have to convert data to Numery.array()first, and then convert data back to DataFrame, which requires converting data twice for such a simple inserting action, and thus it takes more time.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | [8,80665,F,192181.76,1995-01-16,3-MEDIUM,Clerk#000000946,0,eful packages. blithely final accounts sleep care] |
5 | =A2.record@i(A4,8) |
6 | =interval@ms(A3,now()) |
Time consumed: 0.001 seconds.
Delete the 10000th record
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
orders_data_delete = orders_data.drop(index=9999)
e = time.time()
print(e-s)
print(orders_data_delete.iloc[9998:10001])
Time consumed: 0.13 seconds.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.delete(10000) |
5 | =interval@ms(A3,now()) |
Time consumed: 0.001 seconds.
Modify the O_CUSTKEYof the 1000000th record to 1000000,and O_ORDERDATEto 1996-10-10.
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|',parse_dates=['O_ORDERDATE'],infer_datetime_format=True)
s = time.time()
orders_data.loc[999999,['O_CUSTKEY','O_ORDERDATE']]=[1000000,pd.to_datetime('1996-10-10')]
e = time.time()
print(e-s)
Time consumed: 0.006 seconds.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.modify(1000000,1000000:O_CUSTKEY,date("1996-10-10"):O_ORDERDATE ) |
5 | =interval@ms(A3,now()) |
Time consumed: 0.001 seconds.
Modify O_ORDERKEYto O_KEY, and O_TOTALPRICEto O_T_PRICE.
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|',parse_dates=['O_ORDERDATE'],infer_datetime_format=True)
s = time.time()
orders_data.rename(columns={'O_ORDERKEY':'O_KEY','O_TOTALPRICE':'O_T_PRICE'},inplace=True)
e = time.time()
print(e-s)
Time consumed: 0.002 seconds.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.rename(O_ORDERKEY:O_KEY,O_TOTALPRICE:O_T_PRICE) |
5 | =interval@ms(A3,now()) |
Time consumed: 0.001 seconds.
Since Pandas is good at calculating numbers and not good at calculating strings, we will carry out two tests for adding field: add a number computed column, and add a string computed column.
Add a column: the difference between O_TOTALPRICEand mean price.
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
mprice = orders_data['O_TOTALPRICE'].mean()
orders_data['O_DIF_AVG'] = orders_data['O_TOTALPRICE']-mprice
e = time.time()
print(e-s)
Time consumed: 0.01 seconds.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.avg(O_TOTALPRICE) |
5 | =A2.derive(O_TOTALPRICE-A4:O_DIF_AVG) |
6 | =interval@ms(A3,now()) |
Time consumed: 0.30 seconds.
Column-wise computing code of SPL Enterprise Edition:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|").i() |
3 | =now() |
4 | =A2.avg(O_TOTALPRICE) |
5 | =A2.derive@o(O_TOTALPRICE-A4:O_DIF_AVG) |
6 | =interval@ms(A3,now()) |
Time consumed: 0.01 seconds.
SPL Enterprise Edition is suitable for column-wise computing, the performance is greatly improved.
Add a column: the digital code of O_CLERK.
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
orders_data['O_CLERK_NUM'] = orders_data['O_CLERK'].str.split("#",expand=True)[1].astype(int)
e = time.time()
print(e-s)
Time consumed: 2.2 seconds.
It can be seen that although both are operations of adding a column, the computing performance of Python differ by two orders of magnitude due to different computing objects (numbers and strings).
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.derive(int(O_CLERK.split("#")(2)):O_CLERK_NUM) |
5 | =interval@ms(A3,now()) |
6 | D:\TPCHdata\tpchtbl1g\orders.tbl |
Time consumed: 0.51 seconds.
While SPL runs slightly slower in calculating strings than calculating numbers, it doesn’t slow by order of magnitude.
Column-wise computing code of SPL Enterprise Edition:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|").i() |
3 | =now() |
4 | =A2.derive@o(int(O_CLERK.split("#")(2)):O_CLERK_NUM) |
5 | =interval@ms(A3,now()) |
6 | D:\TPCHdata\tpchtbl1g\orders.tbl |
Time consumed: 0.47 seconds.
Extract the first three fields.
Python code:
import pandas as pd
import numpy as np
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
cols = orders_data.columns
orders_3cols = orders_data.iloc[:,:3]
e = time.time()
print(e-s)
Time consumed: 0.02 seconds.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.new(#1,#2,#3) |
5 | =interval@ms(A3,now()) |
Time consumed: 0.14 seconds.
Column-wise computing code of SPL Enterprise Edition:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|").i() |
3 | =now() |
4 | =A2.new@o(#1,#2,#3) |
5 | =interval@ms(A3,now()) |
Time consumed: 0.001 seconds.
Reduce the O_TOTALPRICEof orders whose O_ORDERSTATUSis Oby 10%
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|',parse_dates=['O_ORDERDATE'],infer_datetime_format=True)
s = time.time()
update_price = orders_data[orders_data['O_ORDERSTATUS']=='O']['O_TOTALPRICE']*0.9
orders_data.loc[orders_data['O_ORDERSTATUS']=='O','O_TOTALPRICE'] = update_price
e = time.time()
print(e-s)
Time consumed: 0.20 seconds
The train of thoughts to filter and modify is very simple, we only need to filter first and then modify. However, Python does not support such modification action, we have to modify the original DataFrame, so we need to calculate out 90% of the required data first and then modify the original data, which obviously filters data twice.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.select(O_ORDERSTATUS=="O") |
5 | =A4.run(O_TOTALPRICE*=0.9) |
6 | =interval@ms(A3,now()) |
Time consumed: 0.14 seconds.
In contrast, this calculation task can be implemented in SPL by just following normal thinking.
Dealing with missing values is actually to modify data. Now we compare the performance of Python and SPL through three actions of missing value.
Set 5-10 missing values for each field randomly.
Python code:
import pandas as pd
import numpy as np
import random
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|',parse_dates=['O_ORDERDATE'],infer_datetime_format=True)
s = time.time()
l = len(orders_data)
cols = orders_data.columns
for i in cols:
for j in range(random.randint(5,11)):
r = random.randint(0, l)
orders_data.loc[r,i] = np.nan
e = time.time()
print(e-s)
Time consumed: 0.05 seconds.
SPL code:
A | B | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl | |
2 | =file(A1).import@t(;,"|") | |
3 | =now() | |
4 | =A2.len() | |
5 | for A2.fname() | =(rand(6)+5).(rand(A4)+1) |
6 | =A2(B5).field(A5,null) | |
7 | =interval@ms(A3,now()) |
Time consumed: 0.007 seconds.
Delete the records containing missing value from the data of the above example.
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders_na.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
orders_data = orders_data.dropna()
e = time.time()
print(e-s)
Time consumed: 0.75 seconds.
SPL code:
A | |
1 | D:\TPCHdata\tpchtbl1g\orders_na.tbl |
2 | =file(A1).import@t(;,"|") |
3 | =now() |
4 | =A2.select(!~.array().pos(null)) |
5 | =interval@ms(A3,now()) |
Time consumed: 0.40 seconds.
Fill the missing value with previous value
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders_na.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
orders_data.fillna(method='ffill',inplace=True)
e = time.time()
print(e-s)
Time consumed: 0.71 seconds.
Since Pandas provides the method of filling missing value with previous value, it is simple to code.
SPL code:
A | B | |
1 | D:\TPCHdata\tpchtbl1g\orders_na.tbl | |
2 | =file(A1).import@t(;,"|") | |
3 | =now() | |
5 | for A2.fname() | =A2.calc(A4,${A5}=if(!${A5},${A5}[-1],${A5})) |
6 | =interval@ms(A3,now()) |
Time consumed: 0.19 seconds.
SPL does not provide a readily available method to fill missing value, but it is not difficult to code.
Fill the missing value of each field with their respective random value
Python code:
import pandas as pd
import numpy as np
import random
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders_na.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
cols = orders_data.columns
l = len(orders_data)
for c in cols:
while True:
randn = random.randint(0,l)
rand = orders_data.loc[randn,c]
if rand!=np.nan:
break
orders_data[c].fillna(rand,inplace=True)
e = time.time()
print(e-s)
Time consumed: 0.21 seconds.
SPL code:
A | B | C | |
1 | D:\TPCHdata\tpchtbl1g\orders_na.tbl | ||
2 | =file(A1).import@t(;,"|") | ||
3 | =now() | ||
4 | =A2.len() | ||
5 | =A2.pselect@a(~.array().pos(null)>0) | ||
6 | for A2.fname() | =null | |
7 | for !B7 | >B6=A2(rand(A4)+1).${A6} | |
8 | =A2.calc(A5,${A6}=if(!${A6},B6,${A6})) | ||
9 | =interval@ms(A3,now()) |
Time consumed: 0.17 seconds.
Convert a non-numerical field to a numerical one. For the same field, convert the same string to the same number.
Python code:
import pandas as pd
import time
orders_file = "D:/TPCHdata/tpchtbl1g/orders.tbl"
orders_data = pd.read_csv(orders_file,sep = '|')
s = time.time()
dtp = orders_data.dtypes
o_cols = dtp[dtp=='object'].index
for c in o_cols:
cmap = {}
gc = orders_data.groupby(c)
cn = 0
for g in gc:
cn+=1
cmap[g[0]]=cn
orders_data[c] = orders_data[c].map(cmap)
e = time.time()
print(e-s)
Time consumed: 20.4 seconds.
SPL code:
A | B | |
1 | D:\TPCHdata\tpchtbl1g\orders.tbl | |
2 | =file(A1).import@t(;,"|") | |
3 | =now() | |
4 | =A2.fname() | |
5 | =A2(1).array().pselect@a(!ifnumber(~)) | |
6 | for A4(A5) | =A2.group(${A6}) |
7 | >B6.run(~.field(A6,B6.#)) | |
8 | =interval@ms(A3,now()) |
Time consumed: 2.85 seconds.
Comparison table of data processing abilities (Unit: second)
Python | SPL Community Edition | SPL Enterprise Edition (column-wise computing) | ||
Insert record | 0.32 | 0.001 | ||
Delete record | 0.13 | 0.001 | ||
Modify record | 0.006 | 0.001 | ||
Modify field name | 0.002 | 0.001 | ||
Add field | Add number computed column | 0.01 | 0.30 | 0.01 |
Add string computed column | 2.20 | 0.51 | 0.47 | |
Extract field | 0.02 | 0.14 | 0.001 | |
Filter and modify | 0.20 | 0.14 | ||
Dealing with missing value | Set missing value | 0.05 | 0.007 | |
Delete missing value | 0.75 | 0.40 | ||
Missing value imputation with previous value | 0.71 | 0.19 | ||
Missing value imputation with random value | 0.21 | 0.17 | ||
Field type conversion | 20.4 | 2.85 |
We can see from the comparison table that Python gets the upper hand over SPL Community Edition only in “addnumber computed column” and “extract field”, but it lags behind SPL in other comparison items. This is due to the fact that Pandas’ data structure is a matrix whose biggest advantage is to calculate pure numbers. But it's precisely because of the matrix that Pandas is no longer flexible in computing, for example, Pandas runs relatively slow in processing strings. In addition, Pandas does not inherit all methods of matrix, and some actions can be done only after converting data to Numpy, such as inserting records. Another advantage of Python is that it offers more mathematical methods, such as filling the missing value with previous value, this does make it easy to code. But what is compared in this article is the performance of them, Python has no advantage in this regard.
The data structure of SPL is a table sequence, which is an ordered set. These operations are nothing more than traversing, locating, modifying, etc., it is flexible and efficient to do these operations on an ordered set. SPL Community Edition is flexible enough and good at actions such as row-by-row modification like A.run()and A.field(). However, the efficiency of SPL Community Edition in column-wise computing is not very high such as adding number computed column, extracting field. The pure sequence (pure table sequence) of SPL Enterprise Edition makes up for this deficiency, and can obtain high performance comparable to Python in column-wise computing. Unfortunately, pure sequence (pure table sequence) is not good at row-wise computing, so the column-wise computing code of SPL Enterprise Edition is not given in row-wise computing examples in this article, we should choose a more appropriate processing method based on actual situation.
SPL Resource: SPL Official Website | SPL Blog | Download esProc SPL | SPL Source Code