Feat: Add Skeleton to support Dataset anonymization with differential privacy #455

ETHenzlere · 2024-01-24T09:03:50Z

DP-Anonymization for BenchBase

In order to measure the influence of modern anonymization techniques on datasets and their query performance, we would like to add support for differential privacy mechanisms to BenchBase. In this PR, we have added the skeleton of our approach to the codebase. We would like to introduce a new flag: --anonymize=true to the system that allows to anonymize tables after the loading step and before execution.

The anonymization will:

Pull data from the DBMS via already specified JDBC connection (config)
Run DP-algorithms on the data to create a new, synthetic dataset
Push the synthetic data back to the DBMS as an anonymized copy

The anonymization information must be provided in the config file. The process will work with minimal information but also allow for fine-tuning. A separate README file has been constructed that will list all the features and how to use them.
/scripts/anonymization/README.md

Minimal config:

 <anonymization>
        <table name="item">
          <differential_privacy />
        </table>
   </anonymization>

Sensitive value handling is one feature we want to add to the process immediately. It replaces actual values of specified columns with fake ones. The code base has already been written, tested and used privately within BenchBase.

The column faking approach will be decoupled from differential privacy, to allow for more control.

 <anonymization>
        <table name="item">
          <differential_privacy> ... </differential_privacy>
          <value_faking> ... </value_faking>
        </table>
   </anonymization>

Disclaimer: The anonymization itself is not part of this PR in order to reduce the complexity. Currently, the anonymization flag will call the script and parse the config. The rest of the code is ready-to-be-added

Architecture Benchbase.pdf

config/postgres/sample_templated_config.xml

src/main/java/com/oltpbenchmark/DBWorkload.java

scripts/anonymizer.py

scripts/anonymization/README.md

config/postgres/sample_templated_config.xml

scripts/anonymization/README.md

src/main/java/com/oltpbenchmark/DBWorkload.java

scripts/anonymization/README.md

config/postgres/sample_templated_config.xml

.github/workflows/python.yml

scripts/anonymization/README.md

.github/workflows/python.yml

scripts/anonymization/.pylintrc

.github/workflows/maven.yml

bpkroth

Please add the comments I asked for and then I think we can merge this.
Thanks!

ETHenzlere added 4 commits January 23, 2024 13:27

Skeleton for anonymization

5e19d22

Merge branch 'main' into feature/dp-anonymization

59f9cb9

Added anonymization flag

a27cf19

Added anonymization flag to options

0464611

anjagruenheid reviewed Jan 31, 2024

View reviewed changes

ETHenzlere and others added 9 commits February 1, 2024 09:14

Merge branch 'main' into feature/dp-anonymization

c49c945

Changes to XML for easier use

d21bac1

README

79cb47c

Added disclaimer

d6bd203

Continuous Column handling with fine-tuning

cef8da9

Merge branch 'main' into feature/dp-anonymization

85995ca

Cleanup for easier analysis, adaptions to XML parsing

a04fb55

Removed Sens-Mode option, can be added back later

5bd852c

Merge branch 'main' into feature/dp-anonymization

ee17b3e

ETHenzlere marked this pull request as ready for review March 13, 2024 13:19

ETHenzlere and others added 3 commits March 13, 2024 16:45

Inserted missing closing tag in config

a1da5af

Valid XML

6759ab6

Merge branch 'main' into feature/dp-anonymization

317ca3e