Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

created a more user-friendly error message when bad data is found #56

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

ParticularMiner
Copy link
Contributor

Notify: @Bergvca

Here I suggest a solution to one user's problem (see #43 (comment)). It was a bit more difficult to implement than I thought. :)

import random
import string
from datetime import datetime
import pandas as pd
import numpy as np
from string_grouper import compute_pairwise_similarities

Create a Series with a few random strings:

strings = [ ''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for i in range(20) ]
good_series = pd.Series(strings, name='left')
good_series.to_frame()
left
0 6P1UMBC8D8
1 ONWZTJ53E1
2 TO7AADMIAD
3 6Y1QDGIKZ5
4 J53R2HZI96
5 Q383BO2VLK
6 0KINOSJ5JU
7 J8AHSMJNOE
8 IZL32I7VPC
9 9RHVQHA0N3
10 XUVDL96FDL
11 M7ROKPJ2IQ
12 MNXWZHRBPJ
13 1QSN3KG4DM
14 UW9EC83LDH
15 DHZLAQHUWI
16 M6HP4FH88Z
17 CNMKI44QWZ
18 DCVVKSSUO7
19 27B9P0B68L

Generate another Series of strings with some bad (non-string or empty string) values:

bad_series = pd.Series(
    random.choices(
        [None, np.nan, "", datetime.now()]*5 + 
        strings + 
        [i for i in range(111, 115)]
        , k=20
    ),
    name='right'
).rename_axis('id')
bad_series.to_frame()
right
id
0 MNXWZHRBPJ
1 M6HP4FH88Z
2 1QSN3KG4DM
3
4 None
5 2021-05-09 12:27:18.736565
6 2021-05-09 12:27:18.736565
7 2021-05-09 12:27:18.736565
8 DCVVKSSUO7
9 MNXWZHRBPJ
10 27B9P0B68L
11 IZL32I7VPC
12 UW9EC83LDH
13 112
14 MNXWZHRBPJ
15 1QSN3KG4DM
16 None
17 None
18 None
19 NaN

Notice the error message after the traceback log:

compute_pairwise_similarities(good_series, bad_series)
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-10-56153281113f> in <module>
----> 1 compute_pairwise_similarities(good_series, bad_series)


~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in this(*args, **kwargs)
     61     # function "this" in the first parameter position
     62     def this(*args, **kwargs):
---> 63         return func(this, *args, **kwargs)
     64     return this
     65 


~\eclipse-workspace\string_grouper\string_grouper\string_grouper.py in compute_pairwise_similarities(this, string_series_1, string_series_2, **kwargs)
     86         this.issues = sg.issues
     87         this.issues.rename(f'Non-strings in Series {sname}', inplace=True)
---> 88         raise TypeError(sg.error_msg(sname, 'compute_pairwise_similarities'))
     89     return sg.dot()
     90 


TypeError: 

ERROR: Input pandas Series 'right' (string_series_2) contains values that are not strings!
Display the pandas Series 'compute_pairwise_similarities.issues' to find where these values are:
   Non-strings in Series 'right' (string_series_2)
id                                                
3                                                 
4                                             None
5                       2021-05-09 12:27:18.736565
6                       2021-05-09 12:27:18.736565
7                       2021-05-09 12:27:18.736565
13                                             112
16                                            None
17                                            None
18                                            None
19                                             NaN
compute_pairwise_similarities.issues
id
3                               
4                           None
5     2021-05-09 12:27:18.736565
6     2021-05-09 12:27:18.736565
7     2021-05-09 12:27:18.736565
13                           112
16                          None
17                          None
18                          None
19                           NaN
Name: Non-strings in Series 'right' (string_series_2), dtype: object

Similar functionality exists for the other high-level functions: group_similar_strings(), match_most_similar() and match_strings()

@ParticularMiner
Copy link
Contributor Author

Hi @Bergvca

Just noticed you merged the other PR. If you intend to merge the next two, perhaps it would be best to start with this one as it has fewer changes than the other.

:)

@Bergvca
Copy link
Owner

Bergvca commented May 12, 2021

ok thanks, will do :)

Conflicts (now resolved):
	README.md
	string_grouper/string_grouper.py
	string_grouper/test/test_string_grouper.py
@ParticularMiner ParticularMiner force-pushed the nonstr_error branch 2 times, most recently from b6180ae to 539757d Compare July 5, 2021 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants