How to include context words to the pre-existence recognizers #804

Idomingog · 2021-11-21T13:27:27Z

Idomingog
Nov 21, 2021

Hi,

I'm using presidio with a medical text in Spanish.
I have two different questions, both related to context words.

I would like to ask how I can include some context words to existing recognizers. I check the methods for AnalyzerEngine() object and it seems it's not the good way to do it.

I can create a new recognizer for specific PII, like the Spanish zip code, where I'm able to provide context words, and it works perfect.

For some normal PII, the system is giving me a wrong word classified, por example (remember I work in Spanish):

Users’ data:
Nombre: Javier ( Name: xxxxxx )
Apellidos: Sanchez Casado ( Surname: xxxx )
...

Sometimes the systems give me as a result:

Nombre:
:

I understand that if I can give some context words, like "Apellidos", the system will identify surnames as a PERSON, but not the first attribute as a PERSON.

The second question is related. I have a street address, town name and country name. Three of them are corrected classified as LOCATION, but I would like to modify to "STREET", "CITY”, “COUNTRY". I didn't find how to modify this information. It happens the same with the doctor’s name and the user’s name. I would like to identify each one separately.

I think both issues can be solve using context data, but I cannot find how to do it in the different methods.

Thanks for your help and knowledge in advance.

Answered by omri374

Nov 22, 2021

Hi @Idomingog,

First, let me share an example on how to technically add more context words to existing recognizers. Then, I'll try to answer your questions about addresses and doctor's vs user's name.

Updating / changing the context words of recognizers

To update the context words in a specific recognizer, you can either create a new recognizer, or update an existing one.
This example updates an existing one:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
recognizers = analyzer.get_recognizers()

# Finding the one which supports US_PASSPORT

passport_recognizer = [rec for rec in recognizers if "US_PASSPORT" in rec.supported_entities][0]

# Existing list of conte…

View full answer

omri374 · 2021-11-22T19:10:29Z

omri374
Nov 22, 2021
Maintainer

Hi @Idomingog,

First, let me share an example on how to technically add more context words to existing recognizers. Then, I'll try to answer your questions about addresses and doctor's vs user's name.

Updating / changing the context words of recognizers

To update the context words in a specific recognizer, you can either create a new recognizer, or update an existing one.
This example updates an existing one:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
recognizers = analyzer.get_recognizers()

# Finding the one which supports US_PASSPORT

passport_recognizer = [rec for rec in recognizers if "US_PASSPORT" in rec.supported_entities][0]

# Existing list of context words
print(passport_recognizer.context)

['us', 'united', 'states', 'passport', 'passport#', 'travel', 'document']

# Let's assume the new context word is "pass"
analyzer.analyze(text="My pass: 191280342", language="en", entities=["US_PASSPORT"])

We get:

[type: US_PASSPORT, start: 9, end: 18, score: 0.05]

If pass is not a context word, the score is 0.05.

Adding a new context word:

passport.context.append("pass")

# Re-running the analysis:
analyzer.analyze(text="My pass: 191280342", language="en", entities=["US_PASSPORT"])

Output:

[type: US_PASSPORT, start: 9, end: 18, score: 0.4]

So adding the pass context word increased the score from 0.05 to 0.4.

How recognizers use context words

For your second question, the answer depends on which recognizer is used and its underlying logic. For rule-based recognizers (like the US_PASSPORT one), context is leveraged in a rule-based way, i.e. looked for before and after an entity. For Named Entity Recognition based recognizers (like the PERSON recognizer), context is leveraged implicitly by the Machine Learning model, and updating the context words would not help. The model is trained on previously seen data and leverages this data to infer in which contexts entities usually appear.

For the Street vs. City vs. Country question, this is another example of a Named Entity Recognition model, and the default one used in Presidio cannot distinguish between cities, countries, and streets (because of the way it was trained). One can create a post-process which looks for the identified locations in a set of cities, countries etc. to decide whether the entity type should be changed.

Hope this helps!

2 replies

Idomingog Nov 22, 2021
Author

Thanks a lot for your answer.

omri374 Mar 12, 2022
Maintainer

Note that the context mechanism in Presidio was improvied in version 2.2.25+. For a tutorial, see this example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to include context words to the pre-existence recognizers #804

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to include context words to the pre-existence recognizers #804

Idomingog Nov 21, 2021

Updating / changing the context words of recognizers

Replies: 1 comment · 2 replies

omri374 Nov 22, 2021 Maintainer

Updating / changing the context words of recognizers

How recognizers use context words

Idomingog Nov 22, 2021 Author

omri374 Mar 12, 2022 Maintainer

Idomingog
Nov 21, 2021

Replies: 1 comment 2 replies

omri374
Nov 22, 2021
Maintainer

Idomingog Nov 22, 2021
Author

omri374 Mar 12, 2022
Maintainer