-
Hi, I'm using presidio with a medical text in Spanish. I would like to ask how I can include some context words to existing recognizers. I check the methods for AnalyzerEngine() object and it seems it's not the good way to do it. I can create a new recognizer for specific PII, like the Spanish zip code, where I'm able to provide context words, and it works perfect. For some normal PII, the system is giving me a wrong word classified, por example (remember I work in Spanish): Users’ data: Sometimes the systems give me as a result: Nombre: I understand that if I can give some context words, like "Apellidos", the system will identify surnames as a PERSON, but not the first attribute as a PERSON. The second question is related. I have a street address, town name and country name. Three of them are corrected classified as LOCATION, but I would like to modify to "STREET", "CITY”, “COUNTRY". I didn't find how to modify this information. It happens the same with the doctor’s name and the user’s name. I would like to identify each one separately. I think both issues can be solve using context data, but I cannot find how to do it in the different methods. Thanks for your help and knowledge in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @Idomingog, First, let me share an example on how to technically add more context words to existing recognizers. Then, I'll try to answer your questions about addresses and doctor's vs user's name. Updating / changing the context words of recognizersTo update the context words in a specific recognizer, you can either create a new recognizer, or update an existing one. from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
recognizers = analyzer.get_recognizers()
# Finding the one which supports US_PASSPORT
passport_recognizer = [rec for rec in recognizers if "US_PASSPORT" in rec.supported_entities][0]
# Existing list of context words
print(passport_recognizer.context)
# Let's assume the new context word is "pass"
analyzer.analyze(text="My pass: 191280342", language="en", entities=["US_PASSPORT"]) We get:
If Adding a new context word: passport.context.append("pass")
# Re-running the analysis:
analyzer.analyze(text="My pass: 191280342", language="en", entities=["US_PASSPORT"]) Output:
So adding the How recognizers use context wordsFor your second question, the answer depends on which recognizer is used and its underlying logic. For rule-based recognizers (like the For the Street vs. City vs. Country question, this is another example of a Named Entity Recognition model, and the default one used in Presidio cannot distinguish between cities, countries, and streets (because of the way it was trained). One can create a post-process which looks for the identified locations in a set of cities, countries etc. to decide whether the entity type should be changed. Hope this helps! |
Beta Was this translation helpful? Give feedback.
Hi @Idomingog,
First, let me share an example on how to technically add more context words to existing recognizers. Then, I'll try to answer your questions about addresses and doctor's vs user's name.
Updating / changing the context words of recognizers
To update the context words in a specific recognizer, you can either create a new recognizer, or update an existing one.
This example updates an existing one: