created a more user-friendly error message when bad data is found #56

ParticularMiner · 2021-05-09T10:43:35Z

Here I suggest a solution to one user's problem (see #43 (comment)). It was a bit more difficult to implement than I thought. :)

import random
import string
from datetime import datetime
import pandas as pd
import numpy as np
from string_grouper import compute_pairwise_similarities

Create a Series with a few random strings:

strings = [ ''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for i in range(20) ]
good_series = pd.Series(strings, name='left')
good_series.to_frame()

	left
0	6P1UMBC8D8
1	ONWZTJ53E1
2	TO7AADMIAD
3	6Y1QDGIKZ5
4	J53R2HZI96
5	Q383BO2VLK
6	0KINOSJ5JU
7	J8AHSMJNOE
8	IZL32I7VPC
9	9RHVQHA0N3
10	XUVDL96FDL
11	M7ROKPJ2IQ
12	MNXWZHRBPJ
13	1QSN3KG4DM
14	UW9EC83LDH
15	DHZLAQHUWI
16	M6HP4FH88Z
17	CNMKI44QWZ
18	DCVVKSSUO7
19	27B9P0B68L

Generate another Series of strings with some bad (non-string or empty string) values:

bad_series = pd.Series(
    random.choices(
        [None, np.nan, "", datetime.now()]*5 + 
        strings + 
        [i for i in range(111, 115)]
        , k=20
    ),
    name='right'
).rename_axis('id')
bad_series.to_frame()

	right
id
0	MNXWZHRBPJ
1	M6HP4FH88Z
2	1QSN3KG4DM
3
4	None
5	2021-05-09 12:27:18.736565
6	2021-05-09 12:27:18.736565
7	2021-05-09 12:27:18.736565
8	DCVVKSSUO7
9	MNXWZHRBPJ
10	27B9P0B68L
11	IZL32I7VPC
12	UW9EC83LDH
13	112
14	MNXWZHRBPJ
15	1QSN3KG4DM
16	None
17	None
18	None
19	NaN

Notice the error message after the traceback log:

compute_pairwise_similarities(good_series, bad_series)

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-10-56153281113f> in <module>
----> 1 compute_pairwise_similarities(good_series, bad_series)


~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in this(*args, **kwargs)
     61     # function "this" in the first parameter position
     62     def this(*args, **kwargs):
---> 63         return func(this, *args, **kwargs)
     64     return this
     65 


~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in compute_pairwise_similarities(this, string_series_1, string_series_2, **kwargs)
     86         this.issues = sg.issues
     87         this.issues.rename(f'Non-strings in Series {sname}', inplace=True)
---> 88         raise TypeError(sg.error_msg(sname, 'compute_pairwise_similarities'))
     89     return sg.dot()
     90 


TypeError: 

ERROR: Input pandas Series 'right' (string_series_2) contains values that are not strings!
Display the pandas Series 'compute_pairwise_similarities.issues' to find where these values are:
   Non-strings in Series 'right' (string_series_2)
id                                                
3                                                 
4                                             None
5                       2021-05-09 12:27:18.736565
6                       2021-05-09 12:27:18.736565
7                       2021-05-09 12:27:18.736565
13                                             112
16                                            None
17                                            None
18                                            None
19                                             NaN

compute_pairwise_similarities.issues

id
3                               
4                           None
5     2021-05-09 12:27:18.736565
6     2021-05-09 12:27:18.736565
7     2021-05-09 12:27:18.736565
13                           112
16                          None
17                          None
18                          None
19                           NaN
Name: Non-strings in Series 'right' (string_series_2), dtype: object

Similar functionality exists for the other high-level functions: group_similar_strings(), match_most_similar() and match_strings()

Conflicts (resolved): string_grouper/string_grouper.py

ParticularMiner · 2021-05-12T19:03:40Z

Hi @Bergvca

Just noticed you merged the other PR. If you intend to merge the next two, perhaps it would be best to start with this one as it has fewer changes than the other.

:)

Bergvca · 2021-05-12T19:35:29Z

ok thanks, will do :)

Conflicts (now resolved): README.md string_grouper/string_grouper.py string_grouper/test/test_string_grouper.py

ParticularMiner added 7 commits April 26, 2021 22:34

boosted _symmetrize_matches_list() (5x) and _get_matches_list() (33x)

92436e3

made more pypi-friendly changes in README.md

99545de

fixed bug related to single-valued input Series

35dddd9

made more pypi-friendly changes in README.md

bea485f

Merge branch 'boost' into squeeze

7aad99f

Conflicts (resolved): string_grouper/string_grouper.py

added unittest for get_groups() with single-valued input Series

2e5d9a3

fixed remaining squeeze() bugs

4a0b225

ParticularMiner force-pushed the nonstr_error branch from f2d3a6c to 82e3145 Compare May 9, 2021 16:09

added error-handler to capture non-strings in input Series

faa974c

ParticularMiner force-pushed the nonstr_error branch from 82e3145 to faa974c Compare May 9, 2021 16:29

made PEP8-conforming modifications

0bc533f

ParticularMiner force-pushed the nonstr_error branch from 5515f9c to 0bc533f Compare May 11, 2021 10:59

updated string_grouper_utils.py to quell unittest deprecated warnings

02ad030

ParticularMiner added 2 commits July 4, 2021 00:06

set max_n_matches=1 in match_most_similar() for a performance boost

e4686e5

Merge remote-tracking branch 'origin/most_similar' into nonstr_error

6711bb7

Conflicts (now resolved): README.md string_grouper/string_grouper.py string_grouper/test/test_string_grouper.py

ParticularMiner force-pushed the nonstr_error branch 2 times, most recently from b6180ae to 539757d Compare July 5, 2021 04:15

changed default value of kwarg max_n_matches to #strings in master

859aa4b

ParticularMiner force-pushed the nonstr_error branch from 539757d to 859aa4b Compare July 5, 2021 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

created a more user-friendly error message when bad data is found #56

created a more user-friendly error message when bad data is found #56

ParticularMiner commented May 9, 2021

ParticularMiner commented May 12, 2021

Bergvca commented May 12, 2021

created a more user-friendly error message when bad data is found #56

Are you sure you want to change the base?

created a more user-friendly error message when bad data is found #56

Conversation

ParticularMiner commented May 9, 2021

ParticularMiner commented May 12, 2021

Bergvca commented May 12, 2021