-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get patterns from regex match in ITIN recognizer #959
base: main
Are you sure you want to change the base?
Get patterns from regex match in ITIN recognizer #959
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is a great addition. Added some comments with questions
presidio-analyzer/presidio_analyzer/predefined_recognizers/us_itin_recognizer.py
Outdated
Show resolved
Hide resolved
@omri374 @SharonHart I updated the PR with the proposed changes. Since the proposal is incompatible with the existing approach, I implemented a new recognizer for this. We can migrate the rest of recognizers to this step by step. I included the ssn recognizer in this PR to demonstrate a perfect example of pattern level score improvement, and recognizer level score improvement. I also added a string sanitizer class to make more extensible the replace list we are using. This string sanitizer uses now python translate tables, that are faster than |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@aperezfals, thanks for the update of the PR! I might be missing something, but could you elaborate on why this change cannot be added to the existing |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@omri374 Mainly yes. It could be confusing to have a validate, invalidate and improve. Which method should have precedence? I guess that if we implement |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Thanks @aperezfals. Some considerations I'm thinking of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding my review on the code. Hopefully we can have this resolved (together with the previous comment on adding vs. modifying) and released in the next version!
:param name: the name of the pattern | ||
:param regex: the regex pattern to detect | ||
:param score: the pattern's strength (values varies 0-1) | ||
:param get_improved_pattern_func: a function that improve the score of the analysis explanation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param get_improved_pattern_func: a function that improve the score of the analysis explanation | |
:param improve_score_fn: a function that improve the score of the analysis explanation |
|
||
class RegexReplaceSanitizer(StringSanitizer): | ||
""" | ||
Replace parts of a string using a regex to search the term to replace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the params (regex, replace) to the docstring
self.replace = replace | ||
|
||
def sanitize(self, text: str) -> str: | ||
return re.sub(self.regex, self.replace, text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add docstring
|
||
|
||
class StringSanitizer: | ||
"""Cleans a string.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a more detailed description? What type of cleaning does this do?
import regex as re | ||
|
||
|
||
class StringSanitizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have this as an abstract class?
|
||
See https://docs.python.org/3/library/stdtypes.html#str.maketrans | ||
""" | ||
self.trans_table = str.maketrans(*trans_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
class ImprovablePatternRecognizer(LocalRecognizer): | ||
""" | ||
PII entity recognizer using regular expressions or deny-lists. | ||
Analysis explanations can be improved by a pattern or by the recognizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not just the explanations that could be improved, but also the score. Am I correct?
This would be nice |
Get patterns from regex match in ITIN recognizer
Adds a new prop
get_improved_pattern_fn
toPattern
that returns a newPattern
based on the regex match info.UsItinRecognizer
is change, by passing animprove_itin_pattern
function to its only pattern , reducing the regex used from 3 to only 1. The regex inUsItinRecognizer
use now named groups for the separators-
andimprove_itin_pattern
inUsItinRecognizer
uses the named groups to return different pattern names and scores.Issue reference
#956
Future PRs will reduce the number of regex in the recognizers list defined in the issue.
Checklist
get_improved_pattern_fn
prop toPatternRecognizer
UsItinRecognizer
by passing a functionimprove_itin_pattern
to the pattern