This project provides a skript for automatic removal of direct personal identifiers in pdf, docx, txt and log files. Note that, according to GDPR, this is not a full anonymization scheme. However, the procedure of masking direct identifiers can be part of technical measures for data privacy.
We assume that python and git are installed and there is basic knowledge about both tools. Here, we provide commands to install requirements in a python virtualenv, which have been tested on Linux.
git clone https://github.com/caretech-owl/Text-De-Identifizierer
cd text-anonymisierer
python -m venv
source venv/bin/activate
pip install -r requirements.txt
For de-identifying a single file:
source venv/bin/activate
python anonymize.py path/to/file
For de-identifying files in a directory:
source venv/bin/activate
python anonymize.py path/
Results are saved as txt in a directory called output
.
For (simple) testing purposes we added small examples in the example
folder. Give it a try
source venv/bin/activate
python anonymize.py examples/