Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected silent exits of presidio application #1505

Open
grafandreas opened this issue Jan 2, 2025 · 8 comments
Open

Unexpected silent exits of presidio application #1505

grafandreas opened this issue Jan 2, 2025 · 8 comments

Comments

@grafandreas
Copy link

First of all, thanks for the great work on this project.

I am encountering the following problem: The Python app silently exits indeterministicly during a call of anonymize_text().
Activating logging level DEBUG shows the following:

DEBUG:presidio-analyzer:Returning a total of 10 recognizers
INFO:presidio-analyzer:Fetching all recognizers for language de
DEBUG:presidio-analyzer:Returning a total of 10 recognizers

And that is the last output before the application just returns to command line. Other texts passed before are anonymized correctly.

  • We do not have a custom analyzer, so this is out of the box
  • Running with Python 3.12.3
  • No error messages / stack trace shown

Any pointers / hints on what might cause this problems?

@omri374
Copy link
Contributor

omri374 commented Jan 2, 2025

Hi, thanks for raising this. Would it be possible to create a slightly more detailed reproducible example?
Is this running on pure Python, in Docker, or in pyspark?

@grafandreas
Copy link
Author

Hi,
it is really difficult / impossible to create a concise reproducible example, since it seems non-deterministic and I cannot share the data set. A bit more information:

  • We are running pure Python (in a VS Code terminal)
  • Base setup:
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "de", "model_name": "de_core_news_lg"}],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# the languages are needed to load country-specific recognizers 
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["de"])

def anonymize_text(text: str) -> str:
    logger.info(f"Anonymizing text: {text}")
    analyzer_results = analyzer.analyze(text=text,
                            language='de')
    
    logger.info(f"Anonymizer results: {analyzer_results}")

    engine = presidio_anonymizer.AnonymizerEngine()
    result = engine.anonymize(text=text, analyzer_results=analyzer_results)
    logger.info(result)
    # Restructuring anonymizer results

    anonymization_results =  {"anonymized": result.text,"found": [entity.to_dict() for entity in analyzer_results]}
    return anonymization_results["anonymized"]

anonymize_text() is then basically called in a loop that fetches data from a SQL (MariaDB) table and writes the anonymized data into another table. Are there maybe any other trace options to get further output?

@grafandreas
Copy link
Author

I also tried to see if the problem is with one of the registered anonymizers, trying to exclude some with combinations of

analyzer.registry.recognizers = analyzer.registry.recognizers[0:1]

to no avail.

@janorivera
Copy link

Hi,
I have the same issue:
I'm using pure Python.

Below is the function that I'm using:
It has worked once, the other times it fails a some point with no errors.
The loop basically tries to run the scrubber on all message bodies inside a transcript object.

def scrub_transcript_messages(transcript, analyzer, anonymizer, entities=None):
    if "transcript" not in transcript or "messages" not in transcript["transcript"]:
        raise ValueError("Invalid transcript format. Expected 'transcript' key with 'messages' list.")

    if entities is None:
        entities = ["PHONE_NUMBER", "PERSON"]

    scrubbed_transcript = {"transcript": {"messages": []}}
    messages_list = transcript["transcript"]["messages"]

    for message in messages_list:
        scrubbed_message = message.copy()
        try:
            log.logger.info("Processing")
            print(message["body"])
            results = analyzer.analyze(
                text=message["body"],
                entities=entities,
                language='en'
            )
            anonymized_text = anonymizer.anonymize(
                text=message["body"],
                analyzer_results=results
            )
            scrubbed_message["body"] = anonymized_text.text
            log.logger.info("Scrubbed message")
            print(anonymized_text.text)

        except Exception as e:
            scrubbed_message["body"] = f"Error anonymizing message: {e}"
        
        scrubbed_transcript["transcript"]["messages"].append(scrubbed_message)

    return scrubbed_transcript

@omri374
Copy link
Contributor

omri374 commented Jan 5, 2025

Thanks, we're trying to reproduce this. @janorivera in your case, I see that you're collecting exceptions into the body of the scrubbed message. Do you have instances where the scrubbed message contains an error and not the scrubbed text?

Also, it could be more scalable to use the BatchAnalyzerEngine and BatchAnonymizerEngine to run presidio on a list of texts. https://microsoft.github.io/presidio/samples/python/batch_processing/ and https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.batch_analyzer_engine.BatchAnalyzerEngine.analyze_iterator

Could you please check if this happens with batch mode too?

@omri374
Copy link
Contributor

omri374 commented Jan 5, 2025

@grafandreas I'm trying to reproduce your case. I'm using this code. Is it different in any way from yours?

from logging import getLogger
logger = getLogger()

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
import presidio_anonymizer



configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "de", "model_name": "de_core_news_lg"}],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# the languages are needed to load country-specific recognizers 
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["de"])

def anonymize_text(text: str) -> str:
    logger.info(f"Anonymizing text: {text}")
    analyzer_results = analyzer.analyze(text=text,
                            language='de')
    
    logger.info(f"Anonymizer results: {analyzer_results}")

    engine = presidio_anonymizer.AnonymizerEngine()
    result = engine.anonymize(text=text, analyzer_results=analyzer_results)
    logger.info(result)
    # Restructuring anonymizer results

    anonymization_results =  {"anonymized": result.text,"found": [entity.to_dict() for entity in analyzer_results]}
    return anonymization_results["anonymized"]


text = """
Hier sind ein paar Beispielsätze, die wir derzeit unterstützen:

Hallo, mein Name ist David Johnson, und ich komme ursprünglich aus Liverpool.
Meine Kreditkartennummer ist 4095-2609-9393-4932, und meine Krypto-Wallet-ID ist 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

Am 11.10.2024 habe ich www.microsoft.com besucht und eine E-Mail an [email protected] von der IP-Adresse 192.168.0.1 gesendet.

Mein Reisepass: 191280342 und meine Telefonnummer: (212) 555-1234.

Dies ist eine gültige internationale Bankkontonummer: IL150120690000003111111. Können Sie bitte den Status des Bankkontos 954567876544 überprüfen?

Kates Sozialversicherungsnummer ist 078-05-1126. Ihr Führerschein? Er lautet 1234567A.

"""

for i in range(100000):
    if i % 100 == 0:
        print(i)
    anonymize_text(text)

@grafandreas
Copy link
Author

@omri374 Yes, that looks very much like the code I use, with the obvious exception of me using different texts.

@omri374
Copy link
Contributor

omri374 commented Jan 6, 2025

Are the texts much longer? Contain non-unicode values? anything else that could be special about them?
Are you running this in a certain compute environment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants