Allow ability to pass in list of company name suffixes to be stripped after your current preprocessing step. #23

anatesan-stream · 2024-04-28T23:52:29Z

Many companies in the same domain have common suffixes...
For e.g. in the high tech companies, many companies have words like
- systems, technology, technologies, tech etc. buried in them.
Removing this will help the matching later.

For e.g. currently, I have Cisco Systems in the matching data, my string to be matched is Cisco, but the matched score is only 37%. If I can preprocess "Cisco Systems" to "Cisco", I think the match score will be higher.

I think we just need another parameter, in the name_matcher constructor to pass in a custom set of words that will be used in the stripping after the punctuations, white spaces etc. have been removed.

mnijhuis-dnb · 2024-05-07T15:26:03Z

This could be done by setting the common_words bool to true, the most common words will then be discounted when calculating the score. In the last version ('0.8.10) common_words can also be a list, so you can have a custom set of words that should be discounted.

When constructing the name_matcher the common_words argument can now be used as a list, the words from this list won't count when calculating the score. This can be done as follows:
nm = NameMatcher(common_words=['technology','systems','technologies'])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow ability to pass in list of company name suffixes to be stripped after your current preprocessing step. #23

Allow ability to pass in list of company name suffixes to be stripped after your current preprocessing step. #23

anatesan-stream commented Apr 28, 2024

mnijhuis-dnb commented May 7, 2024

Allow ability to pass in list of company name suffixes to be stripped after your current preprocessing step. #23

Allow ability to pass in list of company name suffixes to be stripped after your current preprocessing step. #23

Comments

anatesan-stream commented Apr 28, 2024

mnijhuis-dnb commented May 7, 2024