You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many companies in the same domain have common suffixes...
For e.g. in the high tech companies, many companies have words like
- systems, technology, technologies, tech etc. buried in them.
Removing this will help the matching later.
For e.g. currently, I have Cisco Systems in the matching data, my string to be matched is Cisco, but the matched score is only 37%. If I can preprocess "Cisco Systems" to "Cisco", I think the match score will be higher.
I think we just need another parameter, in the name_matcher constructor to pass in a custom set of words that will be used in the stripping after the punctuations, white spaces etc. have been removed.
The text was updated successfully, but these errors were encountered:
This could be done by setting the common_words bool to true, the most common words will then be discounted when calculating the score. In the last version ('0.8.10) common_words can also be a list, so you can have a custom set of words that should be discounted.
When constructing the name_matcher the common_words argument can now be used as a list, the words from this list won't count when calculating the score. This can be done as follows: nm = NameMatcher(common_words=['technology','systems','technologies'])
Many companies in the same domain have common suffixes...
For e.g. in the high tech companies, many companies have words like
- systems, technology, technologies, tech etc. buried in them.
Removing this will help the matching later.
For e.g. currently, I have Cisco Systems in the matching data, my string to be matched is Cisco, but the matched score is only 37%. If I can preprocess "Cisco Systems" to "Cisco", I think the match score will be higher.
I think we just need another parameter, in the name_matcher constructor to pass in a custom set of words that will be used in the stripping after the punctuations, white spaces etc. have been removed.
The text was updated successfully, but these errors were encountered: