-
Notifications
You must be signed in to change notification settings - Fork 189
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Feat: Add Skeleton to support Dataset anonymization with differential…
… privacy (#455) **DP-Anonymization for BenchBase** In order to measure the influence of modern anonymization techniques on datasets and their query performance, we would like to add support for differential privacy mechanisms to BenchBase. In this PR, we have added the skeleton of our approach to the codebase. We would like to introduce a new flag: `--anonymize=true` to the system that allows to anonymize tables after the loading step and before execution. The anonymization will: 1. Pull data from the DBMS via already specified JDBC connection (config) 2. Run DP-algorithms on the data to create a new, synthetic dataset 3. Push the synthetic data back to the DBMS as an anonymized copy The anonymization information must be provided in the config file. The process will work with minimal information but also allow for fine-tuning. A separate README file has been constructed that will list all the features and how to use them. `/scripts/anonymization/README.md` Minimal config: ``` <anonymization> <table name="item"> <differential_privacy /> </table> </anonymization> ``` Sensitive value handling is one feature we want to add to the process immediately. It replaces actual values of specified columns with fake ones. The code base has already been written, tested and used privately within BenchBase. The column faking approach will be decoupled from differential privacy, to allow for more control. ``` <anonymization> <table name="item"> <differential_privacy> ... </differential_privacy> <value_faking> ... </value_faking> </table> </anonymization> ``` **Disclaimer: The anonymization itself is not part of this PR in order to reduce the complexity. Currently, the anonymization flag will call the script and parse the config. The rest of the code is ready-to-be-added** [Architecture Benchbase.pdf](https://github.com/cmu-db/benchbase/files/14356135/Architecture.Benchbase.pdf) --------- Co-authored-by: Brian Kroth <[email protected]>
- Loading branch information
1 parent
98ccd4d
commit 422a573
Showing
14 changed files
with
1,152 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
name: BenchBase (Python) | ||
|
||
on: | ||
push: | ||
branches: [ main ] | ||
tags: | ||
- 'v*' | ||
pull_request: | ||
branches: [ main ] | ||
|
||
jobs: | ||
build: | ||
runs-on: ubuntu-latest | ||
strategy: | ||
matrix: | ||
python-version: ["3.10"] | ||
|
||
steps: | ||
- uses: actions/checkout@v3 | ||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
- name: Install dependencies | ||
working-directory: ./scripts/anonymization | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install -r requirements.txt | ||
- name: Check anonymization files with pylint | ||
run: | | ||
pylint --rcfile=.pylintrc ./scripts/anonymization/src | ||
- name: Test anonymization with pytest | ||
working-directory: ./scripts/anonymization/src | ||
run: pytest test.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,5 +56,8 @@ build/ | |
.*.swp | ||
|
||
.env | ||
|
||
docker-compose-*.tar.gz | ||
|
||
# Python | ||
__pycache__/ | ||
*.py[cod] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
[MAIN] | ||
|
||
# Specify a score threshold under which the program will exit with error. | ||
fail-under=0.9 | ||
|
||
[DESIGN] | ||
|
||
# Maximum number of arguments for function / method. | ||
max-args=9 | ||
|
||
# Maximum number of locals for function / method body. | ||
max-locals=20 | ||
|
||
# Maximum number of return / yield for function / method body. | ||
max-returns=6 | ||
|
||
# Maximum number of statements in function / method body. | ||
max-statements=50 | ||
|
||
# Minimum number of public methods for a class (see R0903). | ||
min-public-methods=0 | ||
|
||
|
||
[FORMAT] | ||
|
||
# Maximum number of characters on a single line. | ||
max-line-length=120 | ||
|
||
# Maximum number of lines in a module. | ||
max-module-lines=1000 | ||
|
||
[REPORTS] | ||
|
||
# Python expression which should return a score less than or equal to 10. You | ||
# have access to the variables 'fatal', 'error', 'warning', 'refactor', | ||
# 'convention', and 'info' which contain the number of messages in each | ||
# category, as well as 'statement' which is the total number of statements | ||
# analyzed. This score is used by the global evaluation report (RP0004). | ||
evaluation=max(0, 0 if fatal else 10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10)) | ||
|
||
# Tells whether to display a full report or only the messages. | ||
reports=no | ||
|
||
# Activate the evaluation score. | ||
score=yes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
<?xml version="1.0"?> | ||
<parameters> | ||
|
||
<!-- Connection details --> | ||
<type>MYSQL</type> | ||
<driver>com.mysql.cj.jdbc.Driver</driver> | ||
<url>jdbc:mysql://localhost:3306/benchbase?rewriteBatchedStatements=true&allowPublicKeyRetrieval=True&sslMode=DISABLED</url> | ||
<username>admin</username> | ||
<password>password</password> | ||
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure> | ||
<isolation>TRANSACTION_SERIALIZABLE</isolation> | ||
<batchsize>128</batchsize> | ||
|
||
<!-- Note: this example anonymizes the "item" table of the tpcc workload. | ||
To run, use the `anonymize=true` flag | ||
--> | ||
|
||
<!-- The anonymization configuration --> | ||
<anonymization> | ||
<table name="item"> | ||
<differential_privacy epsilon="1.0" pre_epsilon="0.0" algorithm="mst"> | ||
<!-- Column categorization --> | ||
<ignore> | ||
<column name="i_id"/> | ||
<column name="i_data" /> | ||
<column name="i_im_id" /> | ||
</ignore> | ||
<categorical> | ||
<column name="i_name" /> | ||
</categorical> | ||
<!-- Continuous column fine-tuning --> | ||
<continuous> | ||
<column name="i_price" bins="1000" lower="2.0" upper="100.0" /> | ||
</continuous> | ||
</differential_privacy> | ||
<!-- Sensitive value handling --> | ||
<value_faking> | ||
<column name="i_name" method="name" locales="en_US" seed="0"/> | ||
</value_faking> | ||
</table> | ||
</anonymization> | ||
</parameters> |
Oops, something went wrong.