Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ability to pass in list of company name suffixes to be stripped after your current preprocessing step. #23

Open
anatesan-stream opened this issue Apr 28, 2024 · 1 comment

Comments

@anatesan-stream
Copy link

Many companies in the same domain have common suffixes...
For e.g. in the high tech companies, many companies have words like
- systems, technology, technologies, tech etc. buried in them.
Removing this will help the matching later.

For e.g. currently, I have Cisco Systems in the matching data, my string to be matched is Cisco, but the matched score is only 37%. If I can preprocess "Cisco Systems" to "Cisco", I think the match score will be higher.

I think we just need another parameter, in the name_matcher constructor to pass in a custom set of words that will be used in the stripping after the punctuations, white spaces etc. have been removed.

@mnijhuis-dnb
Copy link
Collaborator

This could be done by setting the common_words bool to true, the most common words will then be discounted when calculating the score. In the last version ('0.8.10) common_words can also be a list, so you can have a custom set of words that should be discounted.

When constructing the name_matcher the common_words argument can now be used as a list, the words from this list won't count when calculating the score. This can be done as follows:
nm = NameMatcher(common_words=['technology','systems','technologies'])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants