This is library that provides efficient way to use JOIN
with comparison.
In this library, supported comparison conditions are:
JOIN yyy WHERE yyy.a BETWEEN xxx.b AND xxx.c
JOIN yyy WHERE yyy.a > xxx.b
JOIN yyy WHERE yyy.a >= xxx.b
JOIN yyy WHERE yyy.a < xxx.b
JOIN yyy WHERE yyy.a <= xxx.b
Please refer Performance if you need performance information.
0.0.2
- python >= 3.6
- nose >= 1.3.7
- numpy >= 1.14.0
- pandas >= 0.22.0
pip install pandas-bj
For BETWEEN,
import pandas_bj
import pandas
df1 = pandas.DataFrame({
'id1': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'id2': [1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2],
's': [1, 2, 3, 4, 5, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7],
'e': [5, 6, 7, 8, 9, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]}
)
df2 = pandas.DataFrame({
'id3': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'id4': [1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2],
'v': [1, 5, 2, 6, 3, 7, 4, 8, 5, 9, 6, 10, 7, 11, 8]}
)
# WHERE xxx.id1 = yyy.id3 AND
# xxx.id2 = yyy.id4 AND
# yyy.v BETWEEN xxx.s AND xxx.e
result = pandas_bj.merge(
left=df1, right=df2,
left_on=['id1', 'id2', pandas_bj.Between('s', 'e', True, True)], right_on=['id3', 'id4', 'v'],
how='inner'
)
print(result)
e id1 id2 s id3 id4 v
0 5.0 1.0 1.0 1.0 1.0 1.0 2.0
1 6.0 1.0 1.0 2.0 1.0 1.0 5.0
2 7.0 1.0 1.0 3.0 1.0 1.0 5.0
3 8.0 1.0 2.0 4.0 1.0 2.0 6.0
4 9.0 1.0 2.0 5.0 1.0 2.0 6.0
5 5.0 2.0 1.0 2.0 2.0 1.0 4.0
6 6.0 2.0 1.0 3.0 2.0 1.0 4.0
7 7.0 2.0 2.0 4.0 2.0 2.0 5.0
8 9.0 2.0 2.0 6.0 2.0 2.0 8.0
9 10.0 3.0 1.0 3.0 3.0 1.0 6.0
10 10.0 3.0 1.0 3.0 3.0 1.0 7.0
11 11.0 3.0 1.0 4.0 3.0 1.0 6.0
12 11.0 3.0 1.0 4.0 3.0 1.0 10.0
13 11.0 3.0 1.0 4.0 3.0 1.0 7.0
14 12.0 3.0 1.0 5.0 3.0 1.0 6.0
15 12.0 3.0 1.0 5.0 3.0 1.0 10.0
16 12.0 3.0 1.0 5.0 3.0 1.0 7.0
17 13.0 3.0 2.0 6.0 3.0 2.0 11.0
18 13.0 3.0 2.0 6.0 3.0 2.0 8.0
19 14.0 3.0 2.0 7.0 3.0 2.0 11.0
20 14.0 3.0 2.0 7.0 3.0 2.0 8.0
Use sort
for more performance.
result = pandas_bj.merge(
left=df1, right=df2,
left_on=['id1', 'id2', pandas_bj.Between('s', 'e', True, True)], right_on=['id3', 'id4', 'v'],
how='inner',
sort=True
)
For other conditions,
# WHERE xxx.id1 = yyy.id3 AND
# xxx.id2 = yyy.id4 AND
# xxx.s < yyy.v
result = pandas_bj.merge(
left=df1, right=df2,
left_on=['id1', 'id2', pandas_bj.GT('s')], right_on=['id3', 'id4', 'v'],
how='inner',
sort=True
)
print('GT')
print(result)
# WHERE xxx.id1 = yyy.id3 AND
# xxx.id2 = yyy.id4 AND
# xxx.s <= yyy.v
result = pandas_bj.merge(
left=df1, right=df2,
left_on=['id1', 'id2', pandas_bj.GE('s')], right_on=['id3', 'id4', 'v'],
how='inner',
sort=True
)
print('GE')
print(result)
# WHERE xxx.id1 = yyy.id3 AND
# xxx.id2 = yyy.id4 AND
# xxx.e > yyy.v
result = pandas_bj.merge(
left=df1, right=df2,
left_on=['id1', 'id2', pandas_bj.LT('e')], right_on=['id3', 'id4', 'v'],
how='inner',
sort=True
)
print('LT')
print(result)
# WHERE xxx.id1 = yyy.id3 AND
# xxx.id2 = yyy.id4 AND
# xxx.e >= yyy.v
result = pandas_bj.merge(
left=df1, right=df2,
left_on=['id1', 'id2', pandas_bj.LE('e')], right_on=['id3', 'id4', 'v'],
how='inner',
sort=True
)
print('LE')
print(result)
GT
e id1 id2 s id3 id4 v
0 5.0 1.0 1.0 1.0 1.0 1.0 5.0
1 5.0 1.0 1.0 1.0 1.0 1.0 2.0
2 6.0 1.0 1.0 2.0 1.0 1.0 5.0
3 7.0 1.0 1.0 3.0 1.0 1.0 5.0
4 8.0 1.0 2.0 4.0 1.0 2.0 6.0
5 9.0 1.0 2.0 5.0 1.0 2.0 6.0
6 5.0 2.0 1.0 2.0 2.0 1.0 7.0
7 5.0 2.0 1.0 2.0 2.0 1.0 4.0
8 6.0 2.0 1.0 3.0 2.0 1.0 7.0
9 6.0 2.0 1.0 3.0 2.0 1.0 4.0
10 7.0 2.0 2.0 4.0 2.0 2.0 8.0
11 7.0 2.0 2.0 4.0 2.0 2.0 5.0
12 7.0 2.0 2.0 4.0 2.0 2.0 9.0
13 8.0 2.0 2.0 5.0 2.0 2.0 8.0
14 8.0 2.0 2.0 5.0 2.0 2.0 9.0
15 9.0 2.0 2.0 6.0 2.0 2.0 8.0
16 9.0 2.0 2.0 6.0 2.0 2.0 9.0
17 10.0 3.0 1.0 3.0 3.0 1.0 6.0
18 10.0 3.0 1.0 3.0 3.0 1.0 10.0
19 10.0 3.0 1.0 3.0 3.0 1.0 7.0
20 11.0 3.0 1.0 4.0 3.0 1.0 6.0
21 11.0 3.0 1.0 4.0 3.0 1.0 10.0
22 11.0 3.0 1.0 4.0 3.0 1.0 7.0
23 12.0 3.0 1.0 5.0 3.0 1.0 6.0
24 12.0 3.0 1.0 5.0 3.0 1.0 10.0
25 12.0 3.0 1.0 5.0 3.0 1.0 7.0
26 13.0 3.0 2.0 6.0 3.0 2.0 11.0
27 13.0 3.0 2.0 6.0 3.0 2.0 8.0
28 14.0 3.0 2.0 7.0 3.0 2.0 11.0
29 14.0 3.0 2.0 7.0 3.0 2.0 8.0
GE
e id1 id2 s id3 id4 v
0 5.0 1.0 1.0 1.0 1.0 1.0 1.0
1 5.0 1.0 1.0 1.0 1.0 1.0 5.0
2 5.0 1.0 1.0 1.0 1.0 1.0 2.0
3 6.0 1.0 1.0 2.0 1.0 1.0 5.0
4 6.0 1.0 1.0 2.0 1.0 1.0 2.0
5 7.0 1.0 1.0 3.0 1.0 1.0 5.0
6 8.0 1.0 2.0 4.0 1.0 2.0 6.0
7 9.0 1.0 2.0 5.0 1.0 2.0 6.0
8 5.0 2.0 1.0 2.0 2.0 1.0 7.0
9 5.0 2.0 1.0 2.0 2.0 1.0 4.0
10 6.0 2.0 1.0 3.0 2.0 1.0 7.0
11 6.0 2.0 1.0 3.0 2.0 1.0 4.0
12 7.0 2.0 2.0 4.0 2.0 2.0 8.0
13 7.0 2.0 2.0 4.0 2.0 2.0 5.0
14 7.0 2.0 2.0 4.0 2.0 2.0 9.0
15 8.0 2.0 2.0 5.0 2.0 2.0 8.0
16 8.0 2.0 2.0 5.0 2.0 2.0 5.0
17 8.0 2.0 2.0 5.0 2.0 2.0 9.0
18 9.0 2.0 2.0 6.0 2.0 2.0 8.0
19 9.0 2.0 2.0 6.0 2.0 2.0 9.0
20 10.0 3.0 1.0 3.0 3.0 1.0 6.0
21 10.0 3.0 1.0 3.0 3.0 1.0 10.0
22 10.0 3.0 1.0 3.0 3.0 1.0 7.0
23 11.0 3.0 1.0 4.0 3.0 1.0 6.0
24 11.0 3.0 1.0 4.0 3.0 1.0 10.0
25 11.0 3.0 1.0 4.0 3.0 1.0 7.0
26 12.0 3.0 1.0 5.0 3.0 1.0 6.0
27 12.0 3.0 1.0 5.0 3.0 1.0 10.0
28 12.0 3.0 1.0 5.0 3.0 1.0 7.0
29 13.0 3.0 2.0 6.0 3.0 2.0 11.0
30 13.0 3.0 2.0 6.0 3.0 2.0 8.0
31 14.0 3.0 2.0 7.0 3.0 2.0 11.0
32 14.0 3.0 2.0 7.0 3.0 2.0 8.0
LT
e id1 id2 s id3 id4 v
0 5.0 1.0 1.0 1.0 1.0 1.0 1.0
1 5.0 1.0 1.0 1.0 1.0 1.0 2.0
2 6.0 1.0 1.0 2.0 1.0 1.0 1.0
3 6.0 1.0 1.0 2.0 1.0 1.0 5.0
4 6.0 1.0 1.0 2.0 1.0 1.0 2.0
5 7.0 1.0 1.0 3.0 1.0 1.0 1.0
6 7.0 1.0 1.0 3.0 1.0 1.0 5.0
7 7.0 1.0 1.0 3.0 1.0 1.0 2.0
8 8.0 1.0 2.0 4.0 1.0 2.0 6.0
9 8.0 1.0 2.0 4.0 1.0 2.0 3.0
10 9.0 1.0 2.0 5.0 1.0 2.0 6.0
11 9.0 1.0 2.0 5.0 1.0 2.0 3.0
12 5.0 2.0 1.0 2.0 2.0 1.0 4.0
13 6.0 2.0 1.0 3.0 2.0 1.0 4.0
14 7.0 2.0 2.0 4.0 2.0 2.0 5.0
15 8.0 2.0 2.0 5.0 2.0 2.0 5.0
16 9.0 2.0 2.0 6.0 2.0 2.0 8.0
17 9.0 2.0 2.0 6.0 2.0 2.0 5.0
18 10.0 3.0 1.0 3.0 3.0 1.0 6.0
19 10.0 3.0 1.0 3.0 3.0 1.0 7.0
20 11.0 3.0 1.0 4.0 3.0 1.0 6.0
21 11.0 3.0 1.0 4.0 3.0 1.0 10.0
22 11.0 3.0 1.0 4.0 3.0 1.0 7.0
23 12.0 3.0 1.0 5.0 3.0 1.0 6.0
24 12.0 3.0 1.0 5.0 3.0 1.0 10.0
25 12.0 3.0 1.0 5.0 3.0 1.0 7.0
26 13.0 3.0 2.0 6.0 3.0 2.0 11.0
27 13.0 3.0 2.0 6.0 3.0 2.0 8.0
28 14.0 3.0 2.0 7.0 3.0 2.0 11.0
29 14.0 3.0 2.0 7.0 3.0 2.0 8.0
LE
e id1 id2 s id3 id4 v
0 5.0 1.0 1.0 1.0 1.0 1.0 1.0
1 5.0 1.0 1.0 1.0 1.0 1.0 5.0
2 5.0 1.0 1.0 1.0 1.0 1.0 2.0
3 6.0 1.0 1.0 2.0 1.0 1.0 1.0
4 6.0 1.0 1.0 2.0 1.0 1.0 5.0
5 6.0 1.0 1.0 2.0 1.0 1.0 2.0
6 7.0 1.0 1.0 3.0 1.0 1.0 1.0
7 7.0 1.0 1.0 3.0 1.0 1.0 5.0
8 7.0 1.0 1.0 3.0 1.0 1.0 2.0
9 8.0 1.0 2.0 4.0 1.0 2.0 6.0
10 8.0 1.0 2.0 4.0 1.0 2.0 3.0
11 9.0 1.0 2.0 5.0 1.0 2.0 6.0
12 9.0 1.0 2.0 5.0 1.0 2.0 3.0
13 5.0 2.0 1.0 2.0 2.0 1.0 4.0
14 6.0 2.0 1.0 3.0 2.0 1.0 4.0
15 7.0 2.0 2.0 4.0 2.0 2.0 5.0
16 8.0 2.0 2.0 5.0 2.0 2.0 8.0
17 8.0 2.0 2.0 5.0 2.0 2.0 5.0
18 9.0 2.0 2.0 6.0 2.0 2.0 8.0
19 9.0 2.0 2.0 6.0 2.0 2.0 5.0
20 9.0 2.0 2.0 6.0 2.0 2.0 9.0
21 10.0 3.0 1.0 3.0 3.0 1.0 6.0
22 10.0 3.0 1.0 3.0 3.0 1.0 10.0
23 10.0 3.0 1.0 3.0 3.0 1.0 7.0
24 11.0 3.0 1.0 4.0 3.0 1.0 6.0
25 11.0 3.0 1.0 4.0 3.0 1.0 10.0
26 11.0 3.0 1.0 4.0 3.0 1.0 7.0
27 12.0 3.0 1.0 5.0 3.0 1.0 6.0
28 12.0 3.0 1.0 5.0 3.0 1.0 10.0
29 12.0 3.0 1.0 5.0 3.0 1.0 7.0
30 13.0 3.0 2.0 6.0 3.0 2.0 11.0
31 13.0 3.0 2.0 6.0 3.0 2.0 8.0
32 14.0 3.0 2.0 7.0 3.0 2.0 11.0
33 14.0 3.0 2.0 7.0 3.0 2.0 8.0
- inner
- left
- right
- outer
- bool
True
to use sort for all join keysFalse
not to use sort for all join keys
- List of ints
[0, 1]
to use sort for first and second join keys[]
equals toFalse
- Randomly generate X and Y data frames.
- Both X and Y has integer id1: (0, 100] and integer id2: (0, 50].
- X has float s: (0, 1000] and e: (0, 1000] and e >= s.
- Y has float v: (0, 1000].
- Use left_on=
['id1', 'id2', pandas_bj.Between('s', 'e')]
, right_on=['id1', 'id2', 'v']
See test/performance.py
for more information.
X record count | Y record count | use Sort | Time in sec | Joined Y record count per X |
---|---|---|---|---|
100 | 1,000 | False | 0.1499 | 1.0 |
100 | 1,000 | True | 0.0614 | 1.0 |
1,000 | 10,000 | False | 8.1311 | 1.4669 |
1,000 | 10,000 | True | 0.3608 | 1.4669 |
10,000 | 100,000 | True | 3.843 | 6.0406 |
10,000 | 1,000,000 | True | 28.5253 | 51.8505 |
When you need to Join 1,000,000
X records with 10,000
Y records with BETWEEN
,
and it is expected that 50
X records are joined per a Y record in average,
pandas-bj can create result in 30 seconds.