Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add Skeleton to support Dataset anonymization with differential privacy #455

Merged
merged 36 commits into from
Jun 17, 2024

Conversation

ETHenzlere
Copy link
Contributor

@ETHenzlere ETHenzlere commented Jan 24, 2024

DP-Anonymization for BenchBase

In order to measure the influence of modern anonymization techniques on datasets and their query performance, we would like to add support for differential privacy mechanisms to BenchBase. In this PR, we have added the skeleton of our approach to the codebase. We would like to introduce a new flag: --anonymize=true to the system that allows to anonymize tables after the loading step and before execution.

The anonymization will:

  1. Pull data from the DBMS via already specified JDBC connection (config)
  2. Run DP-algorithms on the data to create a new, synthetic dataset
  3. Push the synthetic data back to the DBMS as an anonymized copy

The anonymization information must be provided in the config file. The process will work with minimal information but also allow for fine-tuning. A separate README file has been constructed that will list all the features and how to use them.
/scripts/anonymization/README.md

Minimal config:

 <anonymization>
        <table name="item">
          <differential_privacy />
        </table>
   </anonymization>

Sensitive value handling is one feature we want to add to the process immediately. It replaces actual values of specified columns with fake ones. The code base has already been written, tested and used privately within BenchBase.

The column faking approach will be decoupled from differential privacy, to allow for more control.

 <anonymization>
        <table name="item">
          <differential_privacy> ... </differential_privacy>
          <value_faking> ... </value_faking>
        </table>
   </anonymization>

Disclaimer: The anonymization itself is not part of this PR in order to reduce the complexity. Currently, the anonymization flag will call the script and parse the config. The rest of the code is ready-to-be-added

Architecture Benchbase.pdf

config/postgres/sample_templated_config.xml Outdated Show resolved Hide resolved
config/postgres/sample_templated_config.xml Outdated Show resolved Hide resolved
config/postgres/sample_templated_config.xml Outdated Show resolved Hide resolved
scripts/anonymizer.py Outdated Show resolved Hide resolved
@ETHenzlere ETHenzlere marked this pull request as ready for review March 13, 2024 13:19
Copy link
Collaborator

@bpkroth bpkroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comments I asked for and then I think we can merge this.
Thanks!

@bpkroth bpkroth merged commit 422a573 into cmu-db:main Jun 17, 2024
137 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants