You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an issue when redacting files with the Presidio-Image-Redactor while defining an "allow list".
The problem is that words from the "allow list" are redacted by the function.
Here is my code :
# Import necessary modules
import spacy
from pdf2image import convert_from_path
from PIL import Image
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import SpacyRecognizer
from presidio_image_redactor import ImageRedactorEngine, ImageAnalyzerEngine
# Set-up NLP specific NLP model
NLPconfig = {
"nlp_engine_name": "spacy",
"models" : [{"lang_code" : "fr", "model_name" : "fr_core_news_lg"},
{"lang_code" : "en", "model_name" : "en_core_web_lg"}]
}
provider = NlpEngineProvider(nlp_configuration = NLPconfig)
nlp_engine_with_french = provider.create_engine()
# Launch Recognizer Engine
analyzer = AnalyzerEngine(
nlp_engine = nlp_engine_with_french,
supported_languages = ["fr","en"]
)
# Set-up specific recognizers and allow-list
# Allow List
Allow = ["Fonds", "Investisseur", "Société", "Portefeuille", "titres", "Parts", "Valeur", "Définition","Introduction", "Gestion", "Société de Gestion", "Participation"]
# Company Recognizer
Entities1 = ["ORGANIZATION"]
Context1 = ["Fonds", "Société"]
OrgRecognizer = SpacyRecognizer(supported_language="fr", supported_entities=Entities1, ner_strength=0.8, context=Context1)
analyzer.registry.add_recognizer(OrgRecognizer)
# GP-code Recognizer
GP_code = Pattern(name="GP_code", regex="GP-[0-9]+", score=0.7)
GP_code_recognizer = PatternRecognizer(supported_entity="GP_code", patterns=[GP_code])
analyzer.registry.add_recognizer(GP_code_recognizer)
analyzer.get_supported_entities()
# Launch ImageRedactorEngine and specify AnalyzerEngine
ImageAnalyzer = ImageAnalyzerEngine(analyzer_engine = analyzer)
ImageRedactor = ImageRedactorEngine(image_analyzer_engine = ImageAnalyzer)
# Perform PDF Redaction
images = convert_from_path("/Users/tcp/Documents/Projet_3_Contracts/Python/Presidio/Sources/TestA.pdf")
number_of_pages = len(images)
redacted_pages = []
for i in range(number_of_pages):
image_to_redact = images[i]
redacted_image = ImageRedactor.redact(image=image_to_redact, language='fr',allow_list=Allow)
redacted_pages.append(redacted_image)
redacted_pages[0].save("Test.pdf", save_all=True, append_images=redacted_pages[1:])
As an example, I have in TestA.pdf a simple text stating :
"Au Premier Jour de Souscrip0on, la Société de Ges0on gérera aux côtés du Fonds le fonds professionnel de capital investissement dénommé Microsost Croissance & Transmission I (le « Fonds Précédent »)." -> The only thing to redact here is "Microsoft Croissance & Transmission"
The redacted pdf is :
Thus, it only recognizes "Fonds Successeur" as an organization even if "Fonds" is in the Allow_List and whereas it is not sensitive information.
If I run this simple script, basically processing the same text but only with the AnalyzerEngine, it works well :
import spacy
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import SpacyRecognizer
# Set-up NLP specific NLP model
NLPconfig = {
"nlp_engine_name": "spacy",
"models" : [{"lang_code" : "fr", "model_name" : "fr_core_news_lg"},
{"lang_code" : "en", "model_name" : "en_core_web_lg"}]
}
provider = NlpEngineProvider(nlp_configuration = NLPconfig)
nlp_engine_with_french = provider.create_engine()
# Launch Recognizer Engine
analyzer = AnalyzerEngine(
nlp_engine = nlp_engine_with_french,
supported_languages = ["fr","en"]
)
# Set-up specific recognizers and allow-list
# Allow List
Allow = ["Fonds", "Investisseur", "Société", "Portefeuille", "titres", "Parts", "Valeur", "Définition","Introduction", "Gestion", "Société de Gestion", "Participation"]
# Company Recognizer
Entities1 = ["ORGANIZATION"]
Context1 = ["Fonds","Société"]
OrgRecognizer = SpacyRecognizer(supported_language="fr", supported_entities=Entities1, ner_strength=0.8, context=Context1)
analyzer.registry.add_recognizer(OrgRecognizer)
# Test
Text = "Au Premier Jour de Souscription, la Société de Gestion gérera aux côtés du Fonds le fonds professionnel de capital investissement dénommé Microsoft Croissance & Transmission I (le « Fonds Précédent »)."
Results = analyzer.analyze(text = Text, language='fr', allow_list = Allow)
print(Results)
It returns: [type: ORGANIZATION, start: 182, end: 197, score: 0.8] meaning it detected well "Microsoft Croissance & Transmission I".
So I believe the ImageRedactor in my first script do not take properly into account (i) the context words of my ORG_Recognizer (otherwise it would probably detect the fund name) and (ii) the allow_list (otherwise it would probably not redact "Fonds").
Do anyone understand where I fail to use the allow list ?
Many thanks in advance!
PS: Please note my files are in French, so I had to fork the repository and change the Image_Analyzer_Engine to support all languages (basically by deleting the "language = 'en' " argument and specifying the language in the **kwargs), but I don't think it is the source of my problem.
EDIT: I changed my question after the Code Bloc because I also have an issue with context words for the ORG_Recognizer.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I have an issue when redacting files with the Presidio-Image-Redactor while defining an "allow list".
The problem is that words from the "allow list" are redacted by the function.
Here is my code :
As an example, I have in TestA.pdf a simple text stating :
"Au Premier Jour de Souscrip0on, la Société de Ges0on gérera aux côtés du Fonds le fonds professionnel de capital investissement dénommé Microsost Croissance & Transmission I (le « Fonds Précédent »)." -> The only thing to redact here is "Microsoft Croissance & Transmission"
The redacted pdf is :
Thus, it only recognizes "Fonds Successeur" as an organization even if "Fonds" is in the Allow_List and whereas it is not sensitive information.
If I run this simple script, basically processing the same text but only with the AnalyzerEngine, it works well :
It returns:
[type: ORGANIZATION, start: 182, end: 197, score: 0.8]
meaning it detected well "Microsoft Croissance & Transmission I".So I believe the ImageRedactor in my first script do not take properly into account (i) the context words of my ORG_Recognizer (otherwise it would probably detect the fund name) and (ii) the allow_list (otherwise it would probably not redact "Fonds").
Do anyone understand where I fail to use the allow list ?
Many thanks in advance!
PS: Please note my files are in French, so I had to fork the repository and change the
Image_Analyzer_Engine
to support all languages (basically by deleting the "language = 'en' " argument and specifying the language in the**kwargs
), but I don't think it is the source of my problem.EDIT: I changed my question after the Code Bloc because I also have an issue with context words for the ORG_Recognizer.
Beta Was this translation helpful? Give feedback.
All reactions