Image Redactor - Allow list & Context Words not properly working #1125

LSD-98 · 2023-07-19T23:00:08Z

LSD-98
Jul 19, 2023

Hello,

I have an issue when redacting files with the Presidio-Image-Redactor while defining an "allow list".
The problem is that words from the "allow list" are redacted by the function.

Here is my code :

# Import necessary modules
import spacy
from pdf2image import convert_from_path
from PIL import Image
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import SpacyRecognizer
from presidio_image_redactor import ImageRedactorEngine, ImageAnalyzerEngine

# Set-up NLP specific NLP model
NLPconfig = {
    "nlp_engine_name": "spacy",
    "models" : [{"lang_code" : "fr", "model_name" : "fr_core_news_lg"},
                {"lang_code" : "en", "model_name" : "en_core_web_lg"}]
}

provider = NlpEngineProvider(nlp_configuration = NLPconfig)
nlp_engine_with_french = provider.create_engine()

# Launch Recognizer Engine
analyzer = AnalyzerEngine(
    nlp_engine = nlp_engine_with_french,
    supported_languages = ["fr","en"]
)

# Set-up specific recognizers and allow-list
# Allow List
Allow = ["Fonds", "Investisseur", "Société", "Portefeuille", "titres", "Parts", "Valeur", "Définition","Introduction", "Gestion", "Société de Gestion", "Participation"]

# Company Recognizer
Entities1 = ["ORGANIZATION"]
Context1 = ["Fonds", "Société"]
OrgRecognizer = SpacyRecognizer(supported_language="fr", supported_entities=Entities1, ner_strength=0.8, context=Context1)
analyzer.registry.add_recognizer(OrgRecognizer)

# GP-code Recognizer
GP_code = Pattern(name="GP_code", regex="GP-[0-9]+", score=0.7)
GP_code_recognizer = PatternRecognizer(supported_entity="GP_code", patterns=[GP_code])

analyzer.registry.add_recognizer(GP_code_recognizer)
analyzer.get_supported_entities()

# Launch ImageRedactorEngine and specify AnalyzerEngine
ImageAnalyzer = ImageAnalyzerEngine(analyzer_engine = analyzer)
ImageRedactor = ImageRedactorEngine(image_analyzer_engine = ImageAnalyzer)

# Perform PDF Redaction
images = convert_from_path("/Users/tcp/Documents/Projet_3_Contracts/Python/Presidio/Sources/TestA.pdf")
number_of_pages = len(images)

redacted_pages = []
for i in range(number_of_pages):
    image_to_redact = images[i]
    redacted_image = ImageRedactor.redact(image=image_to_redact, language='fr',allow_list=Allow)
    redacted_pages.append(redacted_image)

redacted_pages[0].save("Test.pdf", save_all=True, append_images=redacted_pages[1:])

As an example, I have in TestA.pdf a simple text stating :

"Au Premier Jour de Souscrip0on, la Société de Ges0on gérera aux côtés du Fonds le fonds professionnel de capital investissement dénommé Microsost Croissance & Transmission I (le « Fonds Précédent »)." -> The only thing to redact here is "Microsoft Croissance & Transmission"

The redacted pdf is :

Thus, it only recognizes "Fonds Successeur" as an organization even if "Fonds" is in the Allow_List and whereas it is not sensitive information.

If I run this simple script, basically processing the same text but only with the AnalyzerEngine, it works well :

import spacy
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import SpacyRecognizer

# Set-up NLP specific NLP model
NLPconfig = {
    "nlp_engine_name": "spacy",
    "models" : [{"lang_code" : "fr", "model_name" : "fr_core_news_lg"},
                {"lang_code" : "en", "model_name" : "en_core_web_lg"}]
}

provider = NlpEngineProvider(nlp_configuration = NLPconfig)
nlp_engine_with_french = provider.create_engine()

# Launch Recognizer Engine
analyzer = AnalyzerEngine(
    nlp_engine = nlp_engine_with_french,
    supported_languages = ["fr","en"]
)

# Set-up specific recognizers and allow-list
# Allow List
Allow = ["Fonds", "Investisseur", "Société", "Portefeuille", "titres", "Parts", "Valeur", "Définition","Introduction", "Gestion", "Société de Gestion", "Participation"]

# Company Recognizer
Entities1 = ["ORGANIZATION"]
Context1 = ["Fonds","Société"]
OrgRecognizer = SpacyRecognizer(supported_language="fr", supported_entities=Entities1, ner_strength=0.8, context=Context1)
analyzer.registry.add_recognizer(OrgRecognizer)

# Test
Text = "Au Premier Jour de Souscription, la Société de Gestion gérera aux côtés du Fonds le fonds professionnel de capital investissement dénommé Microsoft Croissance & Transmission I (le « Fonds Précédent »)."
Results = analyzer.analyze(text = Text, language='fr', allow_list = Allow)

print(Results)

It returns: [type: ORGANIZATION, start: 182, end: 197, score: 0.8] meaning it detected well "Microsoft Croissance & Transmission I".

So I believe the ImageRedactor in my first script do not take properly into account (i) the context words of my ORG_Recognizer (otherwise it would probably detect the fund name) and (ii) the allow_list (otherwise it would probably not redact "Fonds").

Do anyone understand where I fail to use the allow list ?

Many thanks in advance!

PS: Please note my files are in French, so I had to fork the repository and change the Image_Analyzer_Engine to support all languages (basically by deleting the "language = 'en' " argument and specifying the language in the **kwargs), but I don't think it is the source of my problem.

EDIT: I changed my question after the Code Bloc because I also have an issue with context words for the ORG_Recognizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image Redactor - Allow list & Context Words not properly working #1125

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Image Redactor - Allow list & Context Words not properly working #1125

LSD-98 Jul 19, 2023

Replies: 0 comments

LSD-98
Jul 19, 2023